Airflow, Kubernetes and Docker Documentation : duplicates

Introduction

These days data is growing massively and since the data is growing the amount of bad data is also growing. Today we'll talk about one form of bad data which we can call duplicate data. We use few database management systems to store and process the data ie sql server, postgresql, oracle etc. We store lot of data in these systems but sometimes we get lot duplicate data as well in our systems which can impact the true data and reporting on which lot of businesses and important decisions are dependent.

So for us working on data, it's very important to identify the duplicates and remove them from the database which help businesses towards better reporting.

Let's get into the practical to understand the duplicates. First we need to create a table with few columns.

In this tutorial I am using postgresql.

CREATE TABLE duplicate_data_test(
id int,
name text
)

now let's insert some data with duplicates.

insert into duplicate_data_test values(1,'One');
insert into duplicate_data_test values(1,'One');
insert into duplicate_data_test values(2,'two');
insert into duplicate_data_test values(3,'three');
insert into duplicate_data_test values(4,'four');
insert into duplicate_data_test values(5,'three');

Query the data to see results:

select * from duplicate_data_test;

Let's find the duplicates for entire row.

select id,name,count(*) as count from duplicate_data_test
group by id,name having count(*) > 1;

Using CTE as well to get the duplicates.

with cte(id,name)
as
(select id,name,row_number() over (partition by id,name order by id) as rn 
from duplicate_data_test)
select * from cte;

Above query will give you the count. You can remove the count in select list if you want and then you will get duplicate record without count number.

Sometimes we want to find duplicates based on one column. Sometime we don't have entire row is duplicate rather just one column have some duplicate values.

So we can definitely use partition by clause like below.

select * from (
select id,name,rank() over (partition by name order by id) as rnk 
from duplicate_data_test
) as dt where dt.rnk > 1

SQL query to find duplicate records

Introduction

Conclusion

That was it about duplicates. Always test above code in lower environment before implementing in Production.

Quantum Computing: The Future of Supercomputing Explained