Bash remove duplicate lines without sorting

Bash remove duplicate records without sorting

Bash remove duplicate lines without sorting

One of the most frequent requirements I face when bash scripting, is to extract a column of data from an input file, find each unique value from within a number of rows, and then do something with that output. The standard way I achieve that is something like this –

cat file1.txt | sort | uniq > file2.txt

What the above does is to first sort the input file, and then (the limitation of uniq) is that it will only unique on ADJACENT lines of a file, that is the reason why you need to sort it first.

That works perfectly well in 99% of cases, but on that rare occasion the step following that ‘unique’ phase might need the output to be in the same order as the input file was.

There are 2 ways to achieve this and I will demonstrate both below. First as an example here is my input file –

aaa bbb rrr fff aaa iii yyy fff

As you can see there are 2 x aaa records, and also 2 x fff records in the file.

So here we go with..

Method 1 which I will call RECORD LEVEL INDEXING

Here is the full command (i will explain each part below) –

cat -n file1.txt | sort -uk2 | sort -nk1 | cut -f2-

OK, an explanation what that is doing… A great command for numbering records is ‘cat -n’

cat -n file1.txt

1 aaa 2 bbb 3 rrr 4 fff 5 aaa 6 iii 7 yyy 8 fff
so now you have numbered output. So now what you want to do is sort it on unique values in column 2 –

cat -n file1.txt | sort -uk2

-u means unique
-k2 meand key 2 or column 2

That outputs this –

1 aaa 2 bbb 4 fff 6 iii 3 rrr 7 yyy

as you see the values you are interested in are now unique, BUT they are in not in the original sequence. Therefore now you want to re-sort the file based on the FIRST column which is the index / record number –

cat -n file1.txt | sort -uk2 | sort -nk1

-n means numeric sort
-k1 says to use key/column 1

1 aaa 2 bbb 3 rrr 4 fff 6 iii 7 yyy

Your file is now in both unique, and in the correct order. But you are not interested in the index number now, so lets strip that out –

cat -n file1.txt | sort -uk2 | sort -nk1 | cut -f2-

aaa bbb rrr fff iii yyy
Job done!

Method 2 I will call AWK ARRAY MAGIC

cat file1.txt | awk '!x[$0]++'

Perfectly clear what that is doing yeah….?? So $0 has the entire contents of the line and the [ ] put that into an array element. the element is incremented (++) and the ! states that it will be printed if the element was not already set.

Bash Remove Duplicate records – Performance of each method

Quick reminder of both methods –

1) cat -n file1.txt | sort -uk2 | sort -nk1 | cut -f2-
2) cat file1.txt | awk '!x[$0]++'

The time command gives the best output for performance of a command. It is always best to run the command 2 or 3 times to get an average.

The best for Method 1 –

real 0m0.004s user 0m0.001s sys 0m0.003s

The best for method 2 –

real 0m0.003s user 0m0.000s sys 0m0.002s

Therefore on this TINY file you can clearly see that method 2 performs best. How this reacts on say a file with 1000 or 10000 or 1 million rows you can test yourself.

Have fun!

Check out - How to Install and Configure DHCP Server on Centos 6

Bash remove duplicate lines without sorting

Method 1 which I will call RECORD LEVEL INDEXING

Method 2 I will call AWK ARRAY MAGIC

Related Posts

Leave a Reply Cancel reply