Bash remove duplicate lines without sorting

By | March 11, 2016
Bash remove duplicate records without sorting

Bash remove duplicate records without sorting

Bash remove duplicate lines without sorting

One of the most frequent requirements I face when bash scripting, is to extract a column of data from an input file, find each unique value from within a number of rows, and then do something with that output. The standard way I achieve that is something like this –

cat file1.txt | sort | uniq > file2.txt

What the above does is to first sort the input file, and then (the limitation of uniq) is that it will only unique on ADJACENT lines of a file, that is the reason why you need to sort it first.

That works perfectly well in 99% of cases, but on that rare occasion the step following that ‘unique’ phase might need the output to be in the same order as the input file was.

There are 2 ways to achieve this and I will demonstrate both below. First as an example here is my input file –

aaa
bbb
rrr
fff
aaa
iii
yyy
fff

As you can see there are 2 x aaa records, and also 2 x fff records in the file.

So here we go with..

Method 1 which I will call RECORD LEVEL INDEXING

Here is the full command (i will explain each part below) –

cat -n file1.txt | sort -uk2 | sort -nk1 | cut -f2-

OK, an explanation what that is doing… A great command for numbering records is ‘cat -n’

cat -n file1.txt

1 aaa
2 bbb
3 rrr
4 fff
5 aaa
6 iii
7 yyy
8 fff

so now you have numbered output. So now what you want to do is sort it on unique values in column 2 –

cat -n file1.txt | sort -uk2

-u means unique
-k2 meand key 2 or column 2

That outputs this –

1 aaa
2 bbb
4 fff
6 iii
3 rrr
7 yyy

as you see the values you are interested in are now unique, BUT they are in not in the original sequence. Therefore now you want to re-sort the file based on the FIRST column which is the index / record number –

cat -n file1.txt | sort -uk2 | sort -nk1

-n means numeric sort
-k1 says to use key/column 1

1 aaa
2 bbb
3 rrr
4 fff
6 iii
7 yyy

Your file is now in both unique, and in the correct order. But you are not interested in the index number now, so lets strip that out –

cat -n file1.txt | sort -uk2 | sort -nk1 | cut -f2-

aaa
bbb
rrr
fff
iii
yyy

Job done!

Method 2 I will call AWK ARRAY MAGIC

cat file1.txt | awk '!x[$0]++'

Perfectly clear what that is doing yeah….?? So $0 has the entire contents of the line and the [ ] put that into an array element. the element is incremented (++) and the ! states that it will be printed if the element was not already set.

Bash Remove Duplicate records – Performance of each method

Quick reminder of both methods –

1) cat -n file1.txt | sort -uk2 | sort -nk1 | cut -f2-
2) cat file1.txt | awk '!x[$0]++'

The time command gives the best output for performance of a command. It is always best to run the command 2 or 3 times to get an average.

The best for Method 1 –

real 0m0.004s
user 0m0.001s
sys 0m0.003s

The best for method 2 –

real 0m0.003s
user 0m0.000s
sys 0m0.002s

Therefore on this TINY file you can clearly see that method 2 performs best. How this reacts on say a file with 1000 or 10000 or 1 million rows you can test yourself.

Have fun!

Check out -  Linux Voice Recognition - How to implement

Leave a Reply