Airflow, Kubernetes and Docker Documentation : postgreSQL

Introduction

These days data is growing massively and since the data is growing the amount of bad data is also growing. Today we'll talk about one form of bad data which we can call duplicate data. We use few database management systems to store and process the data ie sql server, postgresql, oracle etc. We store lot of data in these systems but sometimes we get lot duplicate data as well in our systems which can impact the true data and reporting on which lot of businesses and important decisions are dependent.

So for us working on data, it's very important to identify the duplicates and remove them from the database which help businesses towards better reporting.

Let's get into the practical to understand the duplicates. First we need to create a table with few columns.

In this tutorial I am using postgresql.

CREATE TABLE duplicate_data_test(
id int,
name text
)

now let's insert some data with duplicates.

insert into duplicate_data_test values(1,'One');
insert into duplicate_data_test values(1,'One');
insert into duplicate_data_test values(2,'two');
insert into duplicate_data_test values(3,'three');
insert into duplicate_data_test values(4,'four');
insert into duplicate_data_test values(5,'three');

Query the data to see results:

select * from duplicate_data_test;

Let's find the duplicates for entire row.

select id,name,count(*) as count from duplicate_data_test
group by id,name having count(*) > 1;

Using CTE as well to get the duplicates.

with cte(id,name)
as
(select id,name,row_number() over (partition by id,name order by id) as rn 
from duplicate_data_test)
select * from cte;

Above query will give you the count. You can remove the count in select list if you want and then you will get duplicate record without count number.

Sometimes we want to find duplicates based on one column. Sometime we don't have entire row is duplicate rather just one column have some duplicate values.

So we can definitely use partition by clause like below.

select * from (
select id,name,rank() over (partition by name order by id) as rnk 
from duplicate_data_test
) as dt where dt.rnk > 1

Conclusion

That was it about duplicates. Always test above code in lower environment before implementing in Production.

Postgres on Docker

Postgres is most advanced object relational database management system(ORDBMS). Postgres implements majority of SQL:2011 standard. It's ACID compliant and It avoids locking issues using multiversion concurrency control. So today we are going to run Postgres on Docker.

To start with Postgres we first need to pull the image from DockerHub. DockerHub is image repository for all images. Let's run the below command and pull the image:

docker pull postgres

Using default tag: latest

latest: Pulling from library/postgres

a9eb63951c1c: Pull complete

b192c7f382df: Pull complete

e7ce3f587986: Pull complete

4098744a1414: Pull complete

4c98d6f3399d: Pull complete

65e57fefc38a: Pull complete

d61d9528cfd5: Pull complete

de6b20f44659: Pull complete

25db13ff0bef: Pull complete

7f74f4b0e936: Pull complete

144c847b11fb: Pull complete

cf0afd1be009: Pull complete

fe0c14991327: Pull complete

Now let's check that we have downloaded the image.

docker images

REPOSITORY TAG IMAGE ID CREATED SIZE

postgres latest 83ce63c594ee 5 days ago 355MB

Let's run the image and start a container.

docker run --name test -e POSTGRES_PASSWORD=Test@123 -d postgres

Just run the docker ps command to check if container is running

docker ps

CONTAINER ID IMAGE COMMAND CREATED STATUS

83ec4a222 postgres "docker-entrypoint.s…" 2 minutes ago Up

Let's enter in bash shell of container by running below command

docker exec -it 83ec4a222 bash

root@83ec4a222:/#

Connect to Postgres now:

psql -h localhost -p 5432 -U postgres -w

psql (14.0 (Debian 14.0-1.pgdg110+1))

Type "help" for help.

You are connected to Postgres now. Lets' create some tables and execute some queries.

postgres=# \l

List of databases

-----------+----------+----------+------------+------------+-----------------------

(3 rows)

postgres=#

Let's check the current database name by running below command.

postgres=# select current_database();

current_database

------------------

postgres

(1 row)

So current database is Postgres. We'll check now how many databases are there on the system.

postgres=# select datname from pg_catalog.pg_database;

datname

-----------

postgres

template1

template0

(3 rows)

There are total 3 databases on system.

You can check all tables on a database by querying information schema.

postgres=# select table_name from information_schema.tables limit 10;

table_name

-----------------------

pg_statistic

pg_type

pg_foreign_table

pg_authid

pg_shadow

pg_statistic_ext_data

pg_roles

pg_settings

pg_file_settings

pg_hba_file_rules

(10 rows)

We can do a lot more than this on Postgres this was just a small part about Postgres. We can get all information about all tables and databases just by using information schema. Docker can be very useful in this case when we don't want to install it on system and want to run Postgres inside container and can leverage the power of Docker.

Note: If you think this helped you and you want to learn more stuff on devops, then I would recommend joining the Kodecloud devops course and go for the complete certification path by clicking this link

SQL query to find duplicate records

Introduction

Conclusion

That was it about duplicates. Always test above code in lower environment before implementing in Production.

How to run PostgreSQL on Docker

Postgres on Docker

Quantum Computing: The Future of Supercomputing Explained