ArticlesSQL

How to remove duplicates in SQL


Remove duplicates (SQL)

Very often we do not need duplicated values. It could be because we need to analyse specific data without duplicating items. Or maybe we need to store only unique data in a database, such as user names.

In SQL, there are several ways of tackling such tasks. We will look at different ways of dealing with duplicates.

This page covers:

  • how to restrict duplicates
  • how to extract unique values
  • identify duplicate values
  • remove duplicates in SQL

Let’s explore each approach separately.

Restrict duplicates

In SQL, you can set up columns in such way, that duplicates cannot be entered.

This can be implemented through the constraint unique. With unique, columns cannot contain duplicated values.

Let’s see how to use it with an example. We create a new table (called employee).

The table includes 4 columns:

  • id – identifier of the specific employee
  • firstName – first name of the employee
  • lastName – last name of the employee
  • age – age of the employee

We implement the unique constraint in the employee first name column.

CREATE TABLE employee (
    id int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
    firstName varchar(50) UNIQUE,
    lastName varchar(50),
    age int
    );

Output: “Commands completed successfully.”

Now, let’s try and insert some data into the new table.

INSERT INTO employee
(firstName, lastName, age)
VALUES
('Carolina', 'Tomson', 36),
('Sam', 'Smith', 21);

Checking the outcome with a select statement (SELECT * FROM employee;).

idfirstNamelastNameage
1CarolinaTomson36
2SamSmith21

The result above is from the first insert statement.

But let’s try another one, with another employee named Carolina.

INSERT INTO employee
(firstName, lastName, age)
VALUES
('Carolina', 'Johnson', 27);

The output of the above query produces an error “Violation of UNIQUE KEY constraint..“.

Extract unique values

Often, tables will have duplicate values. But if needed, we can extract (select) only unique values.

This can be achieved through a filter called “distinct“.

Distinct is used with a select statement and its purpose is to extract only unique (no duplicated) values.

Let’s start fresh, with the above employee table example. But in this instance, there is no unique constraint in the first name column.

First, we need to drop the existing employee table.

DROP TABLE employee;

Output: Commands completed successfully.

Now, let’s create the same table, without the constraint. In addition, we are inserting the same data as before.

CREATE TABLE employee (
    id int IDENTITY(1, 1) NOT NULL PRIMARY KEY,
    firstName varchar(50),
    lastName varchar(50),
    age int
    );
INSERT INTO employee
(firstName, lastName, age)
VALUES
('Carolina', 'Tomson', 36),
('Sam', 'Smith', 21),
('Carolina', 'Johnson', 27);

In this scenario, there are no errors with the duplicates as the unique constraint is not affecting the first name column.

Let’s check the result with a select statement (SELECT * FROM employee;).

idfirstNamelastNameage
1CarolinaTomson36
2SamSmith21
3CarolinaJohnson27

The table above now contains duplicate values (2 x Carolina).

Sometimes, we may need to extract only unique values.

For this, we implement a select distinct statement.

SELECT DISTINCT firstName FROM employee;

Please be aware that we apply the filter only on the first name column (and not on all data inside the table).

The outcome is below.

firstName
Carolina
Sam

Identify duplicate values

In some situations, we need to identify duplicated values.

This process can be implemented with the “group by” and “having” clauses.

Examining duplicates can happen with the group by clause – with the help of the function “count()“.

We use the previous example (employee table) for this demonstration.

SELECT firstName, COUNT(*)
FROM employee
GROUP BY firstName;

The outcome is below.

firstNameNo column name
Carolina2
Sam1

From the above, it can be seen that the name Sam is found 1 time and the name Carolina 2 times. In such way, we examine the output and identify the duplicated values.

But we can make the query more efficient, especially with large datasets, by including the having clause.

Let’s see an example.

SELECT firstName, COUNT(*)
FROM employee
GROUP BY firstName
HAVING COUNT(*) > 1;
firstNameNo column name
Carolina2

Now, the outcome shows only values that have a higher count than 1 (meaning, only duplicates).

Remove duplicates in SQL

There are various ways of tackling duplicates.

We explore the simple method of identifying and removing manually duplicated records.

Let’s continue the above example, having 3 employees (of which two are named Carolina).

We have also already identified the duplicates through the select, group by, and having statements.

firstNameNo column name
Carolina2

Now, we can find and display each record with a select statement (with the help of a where clause).

SELECT * FROM employee
WHERE firstName = 'Carolina';

We get the following outcome.

idfirstNamelastNameage
1CarolinaTomson36
3CarolinaJohnson27

The above is all duplicate names under “Carolina”. The easiest way to identify a record is through the “id” column.

We can now delete (drop) the one necessary. In this scenario, it can be any of the two rows as this is just an example of the method.

Let’s apply the delete statement. The where clause is a must in this instance. Otherwise, the delete statement will delete all records in the table.

We are deleting the record of Carolina with id 3.

DELETE FROM employee
WHERE id = '3';

Displaying the result (SELECT * FROM employee;).

idfirstNamelastNameage
1CarolinaTomson36
2SamSmith21

And the duplicate value is gone.