Bash remove duplicate lines without sorting
One of the most frequent requirements I face when bash scripting, is to extract a column of data from an input file, find each unique value from within a number of rows, and then do something with that output. The standard way I achieve that is something like this –
cat file1.txt | sort | uniq > file2.txt
What the above does is to first sort the input file, and then (the limitation of uniq) is that it will only unique on ADJACENT lines of a file, that is the reason why you need to sort it first.
That works perfectly well in 99% of cases, but on that rare occasion the step following that ‘unique’ phase might need the output to be in the same order as the input file was.
There are 2 ways to achieve this and I will demonstrate both below. First as an example here is my input file –
aaa
bbb
rrr
fff
aaa
iii
yyy
fff
As you can see there are 2 x aaa records, and also 2 x fff records in the file.
So here we go with..
Method 1 which I will call RECORD LEVEL INDEXING
Here is the full command (i will explain each part below) –
cat -n file1.txt | sort -uk2 | sort -nk1 | cut -f2-
OK, an explanation what that is doing… A great command for numbering records is ‘cat -n’
cat -n file1.txt
1 aaa
2 bbb
3 rrr
4 fff
5 aaa
6 iii
7 yyy
8 fff
so now you have numbered output. So now what you want to do is sort it on unique values in column 2 –
cat -n file1.txt | sort -uk2
-u
means unique
-k2
meand key 2 or column 2
That outputs this –
1 aaa
2 bbb
4 fff
6 iii
3 rrr
7 yyy
as you see the values you are interested in are now unique, BUT they are in not in the original sequence. Therefore now you want to re-sort the file based on the FIRST column which is the index / record number –
cat -n file1.txt | sort -uk2 | sort -nk1
-n
means numeric sort
-k1
says to use key/column 1
1 aaa
2 bbb
3 rrr
4 fff
6 iii
7 yyy
Your file is now in both unique, and in the correct order. But you are not interested in the index number now, so lets strip that out –
cat -n file1.txt | sort -uk2 | sort -nk1 | cut -f2-
aaa
bbb
rrr
fff
iii
yyy
Job done!
Method 2 I will call AWK ARRAY MAGIC
cat file1.txt | awk '!x[$0]++'
Perfectly clear what that is doing yeah….?? So $0 has the entire contents of the line and the [ ] put that into an array element. the element is incremented (++) and the ! states that it will be printed if the element was not already set.
Bash Remove Duplicate records – Performance of each method
Quick reminder of both methods –
1) cat -n file1.txt | sort -uk2 | sort -nk1 | cut -f2-
2) cat file1.txt | awk '!x[$0]++'
The time command gives the best output for performance of a command. It is always best to run the command 2 or 3 times to get an average.
The best for Method 1 –
real 0m0.004s
user 0m0.001s
sys 0m0.003s
The best for method 2 –
real 0m0.003s
user 0m0.000s
sys 0m0.002s
Therefore on this TINY file you can clearly see that method 2 performs best. How this reacts on say a file with 1000 or 10000 or 1 million rows you can test yourself.
Have fun!