Introduction
These days data is growing massively and since the data is growing the amount of bad data is also growing. Today we'll talk about one form of bad data which we can call duplicate data. We use few database management systems to store and process the data ie sql server, postgresql, oracle etc. We store lot of data in these systems but sometimes we get lot duplicate data as well in our systems which can impact the true data and reporting on which lot of businesses and important decisions are dependent.
So for us working on data, it's very important to identify the duplicates and remove them from the database which help businesses towards better reporting.
Let's get into the practical to understand the duplicates. First we need to create a table with few columns.
In this tutorial I am using postgresql.
now let's insert some data with duplicates.
Query the data to see results:
Let's find the duplicates for entire row.
or
Using CTE as well to get the duplicates.
Above query will give you the count. You can remove the count in select list if you want and then you will get duplicate record without count number.
Sometimes we want to find duplicates based on one column. Sometime we don't have entire row is duplicate rather just one column have some duplicate values.
So we can definitely use partition by clause like below.
Conclusion
That was it about duplicates. Always test above code in lower environment before implementing in Production.