Airflow, Kubernetes and Docker Documentation : sql

Introduction

These days data is growing massively and since the data is growing the amount of bad data is also growing. Today we'll talk about one form of bad data which we can call duplicate data. We use few database management systems to store and process the data ie sql server, postgresql, oracle etc. We store lot of data in these systems but sometimes we get lot duplicate data as well in our systems which can impact the true data and reporting on which lot of businesses and important decisions are dependent.

So for us working on data, it's very important to identify the duplicates and remove them from the database which help businesses towards better reporting.

Let's get into the practical to understand the duplicates. First we need to create a table with few columns.

In this tutorial I am using postgresql.

CREATE TABLE duplicate_data_test(
id int,
name text
)

now let's insert some data with duplicates.

insert into duplicate_data_test values(1,'One');
insert into duplicate_data_test values(1,'One');
insert into duplicate_data_test values(2,'two');
insert into duplicate_data_test values(3,'three');
insert into duplicate_data_test values(4,'four');
insert into duplicate_data_test values(5,'three');

Query the data to see results:

select * from duplicate_data_test;

Let's find the duplicates for entire row.

select id,name,count(*) as count from duplicate_data_test
group by id,name having count(*) > 1;

Using CTE as well to get the duplicates.

with cte(id,name)
as
(select id,name,row_number() over (partition by id,name order by id) as rn 
from duplicate_data_test)
select * from cte;

Above query will give you the count. You can remove the count in select list if you want and then you will get duplicate record without count number.

Sometimes we want to find duplicates based on one column. Sometime we don't have entire row is duplicate rather just one column have some duplicate values.

So we can definitely use partition by clause like below.

select * from (
select id,name,rank() over (partition by name order by id) as rnk 
from duplicate_data_test
) as dt where dt.rnk > 1

Conclusion

That was it about duplicates. Always test above code in lower environment before implementing in Production.

What is SQL

SQL stands for structured query language. It is used to query the relational databases. Since the name is saying the structured means it handles structured data.

What is structured data

Here structured data means a data which is in tabular format. And the data that have some relations between data is called relational data.

You can query structured data residing inside relational database management system. Relational database management system is a management system which stores and handles the data. Data volume could be too much like in GB, TB,PB. RDMS(Relational database management system) can handle this much of volume.

There are lot of RDBMS in the market like Oracle, SQL server, MYSQL and Postgres etc. Here MYSQL and Postgres both are open source.

Previously SQL was called structured english query language (SEQUEL) . Later it was renamed as structured query language(SQL). Many people call it SEQUEL and many call it SQL. So it's up to you what you want to call it.

A brief History of SQL

- 1970 − Dr. E. F. Codd of IBM is known as the father of relational databases. He described a relational model for databases.
- 1974 − Structured Query Language appeared.
- 1978 − IBM worked to develop Codd's ideas and released a product named System/R.
- 1986 − IBM developed the first prototype of relational database and standardized by ANSI. The first relational database was released by Relational Software which later came to be known as Oracle.

How to write SQL and how to fetch data

First you need to decide which RDBMS you are going to install on your system. I would recommend using postgres which is open source or SQL server which is not open source but you can download it's free version. Microsoft is developer of SQL server.

Now I am assuming that you have downloaded the RDBMS. If not then you can practice on any online platform which provide you sql editor.

Suppose I have table EMPLOYEE, which has three columns name, age, city.

Name	Age	City
Suraj	25	Banglore
Mohit	26	Chennai
Rahul	27	Lucknow

I want to fetch all records inside this table, so I will write below query:

SELECT * FROM EMPLOYEE;

Here you are query data then you need to write SELECT , * means all columns (Name,Age,City) ,write from keyword then table name. In this case table name is EMPLOYEE.

So above query will give you all records from the table. If I want to select only name then I need write only Name column in my query like Below:

SELECT Name FROM EMPLOYEE;

Similar for Age and City as well:

SELECT Age FROM EMPLOYEE;

SELECT City FROM EMPLOYEE;

Suppose I want to display all details about Suraj then I'll write below query:

SELECT * FROM EMPLOYEE WHERE NAME = 'Suraj'

Name	Age	City
Suraj	25	Banglore

You can use a wild card for search but there is a lot to cover on SQL. I suggest to read more on basics of SQL to get started.

SQL query to find duplicate records