Showing posts with label postgreSQL. Show all posts
Showing posts with label postgreSQL. Show all posts

SQL query to find duplicate records

Introduction

These days data is growing massively and since the data is growing the amount of bad data is also growing. Today we'll talk about one form of bad data which we can call duplicate data. We use few database management systems to store and process the data ie sql server, postgresql, oracle etc. We store lot of data in these systems but sometimes we get lot duplicate data as well in our systems which can impact the true data and reporting on which lot of businesses and important decisions are dependent.

  So for us working on data, it's very important to identify the duplicates and remove them from the database which help businesses towards better reporting. 

Let's get into the practical to understand the duplicates. First we need to create a table with few columns.

In this tutorial I am using postgresql. 

CREATE TABLE duplicate_data_test(
id int,
name text
)

now let's insert some data with duplicates.

insert into duplicate_data_test values(1,'One');
insert into duplicate_data_test values(1,'One');
insert into duplicate_data_test values(2,'two');
insert into duplicate_data_test values(3,'three');
insert into duplicate_data_test values(4,'four');
insert into duplicate_data_test values(5,'three');

Query the data to see results:

select * from duplicate_data_test;













Let's find the duplicates for entire row.

select id,name,count(*) as count from duplicate_data_test
group by id,name having count(*) > 1;

or

Using CTE as well to get the duplicates.

with cte(id,name)
as
(select id,name,row_number() over (partition by id,name order by id) as rn
from duplicate_data_test)
select * from cte;

Above query will give you the count. You can remove the count in select list if you want and then you will get duplicate record without count number. 















Sometimes we want to find duplicates based on one column. Sometime we don't have entire row is duplicate rather just one column have some duplicate values. 

So we can definitely use partition by clause like below.

select * from (
select id,name,rank() over (partition by name order by id) as rnk
from duplicate_data_test
) as dt where dt.rnk > 1













Conclusion

That was it about duplicates. Always test above code in lower environment before implementing in Production.


How to run PostgreSQL on Docker



Postgres on Docker 

Postgres is most advanced object relational database management system(ORDBMS). Postgres implements majority of SQL:2011 standard. It's ACID compliant and It avoids locking issues using multiversion concurrency control. So today we are going to run Postgres on Docker.

To start with Postgres we first need to pull the image from DockerHub. DockerHub is image repository for all images. Let's run the below command and pull the image:

docker pull postgres

Using default tag: latest

latest: Pulling from library/postgres

a9eb63951c1c: Pull complete 

b192c7f382df: Pull complete 

e7ce3f587986: Pull complete 

4098744a1414: Pull complete 

4c98d6f3399d: Pull complete 

65e57fefc38a: Pull complete 

d61d9528cfd5: Pull complete 

de6b20f44659: Pull complete 

25db13ff0bef: Pull complete 

7f74f4b0e936: Pull complete 

144c847b11fb: Pull complete 

cf0afd1be009: Pull complete 

fe0c14991327: Pull complete 


Now let's check that we have downloaded the image.

docker images

REPOSITORY        TAG       IMAGE ID       CREATED       SIZE

postgres          latest    83ce63c594ee   5 days ago    355MB


Let's run the image and start a container.

docker run --name test -e POSTGRES_PASSWORD=Test@123 -d postgres


Just run the docker ps command to check if container is running

docker ps

CONTAINER ID   IMAGE      COMMAND                  CREATED         STATUS   

83ec4a222   postgres   "docker-entrypoint.s…"   2 minutes ago   Up 


Let's enter in bash shell of container by running below command

docker exec -it 83ec4a222 bash

root@83ec4a222:/# 


Connect to Postgres now:

psql -h localhost -p 5432 -U postgres -w

psql (14.0 (Debian 14.0-1.pgdg110+1))

Type "help" for help.


You are connected to Postgres now. Lets' create some tables and execute some queries.

postgres=# \l

                                 List of databases

   Name    |  Owner   | Encoding |  Collate   |   Ctype    |   Access privileges   

-----------+----------+----------+------------+------------+-----------------------

 postgres  | postgres | UTF8     | en_US.utf8 | en_US.utf8 | 

 template0 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +

           |          |          |            |            | postgres=CTc/postgres

 template1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +

           |          |          |            |            | postgres=CTc/postgres

(3 rows)


postgres=# 



Let's check the current database name by running below command.

postgres=# select current_database();

 current_database 

------------------

 postgres

(1 row)


So current database is Postgres. We'll check now how many databases are there on the system.

postgres=# select datname from pg_catalog.pg_database;

  datname  

-----------

 postgres

 template1

 template0

(3 rows)


There are total 3 databases on system.

You can check all tables on a database by querying information schema.

postgres=# select table_name from information_schema.tables limit 10;

      table_name       

-----------------------

 pg_statistic

 pg_type

 pg_foreign_table

 pg_authid

 pg_shadow

 pg_statistic_ext_data

 pg_roles

 pg_settings

 pg_file_settings

 pg_hba_file_rules

(10 rows)



We can do a lot more than this on Postgres this was just a small part about Postgres. We can get all information about all tables and databases just by using information schema. Docker can be very useful in this case when we don't want to install it on system and want to run Postgres inside container and can leverage the power of Docker.

Note: If you think this helped you and you want to learn more stuff on devops, then I would recommend joining the Kodecloud devops course and go for the complete certification path by clicking this link

Quantum Computing: The Future of Supercomputing Explained

  Introduction Quantum computing is revolutionizing the way we solve complex problems that classical computers struggle with. Unlike tradi...