Words lists in a shell

I often want to manipulate set of words that I want to compare. This post present some of the one lines that I frequently use to manipulate such lists.

Get a set of words from a text file

If you start from a text file, the following command will convert it to a list of words:

cat input.txt |\
  sed 's/\>/\n/g' |\
  sed 's/^[[:space:]]*//' |\
  sed 's/[[:space:]]*$//' |\
  grep -v "^$" |\
  sort |\
  uniq  > output.txt

If you run osx use this command instead (notice the new line in the middle of the command):

cat input.txt |\ sed 's/[[:>:]]/\
/g' | sed 's/^[[:space:]]*//' | sed 's/[[:space:]]*$//' | grep -v "^$" | sort | uniq  > output.txt

Both of these commands replace word boundaries by newlines, trim words and then print a sorted lists of words.

Intersection

A first one liner to find the elements that are common to two lists:

cat file1.txt file2.txt | sort | uniq -d

Union

A similar one liner to find the elements that are in one set or the other:

cat file1.txt file2.txt | sort | uniq

Union minus intersection

To get the words that are only in file1.txt or file2.txt, but not both:

cat file1.txt file2.txt | sort | uniq -u

Difference

To get the elements that are in file1.txt, but not file2.txt:

cat file1.txt file2.txt file2.txt | sort | uniq -u

Histogram of words

As a bonus, we can tweak our first command to get an histogram of words:

cat input.txt |\
  sed 's/\>/\n/g' |\
  sed 's/^[[:space:]]*//' |\
  sed 's/[[:space:]]*$//' |\
  grep -v "^$" |\
  sort |\
  uniq -c |\
  sort -nr

The following variations prints words appearing at least 10 times:

cat input.txt |\
  sed 's/\>/\n/g' |\
  sed 's/^[[:space:]]*//' |\
  sed 's/[[:space:]]*$//' |\
  grep -v "^$" |\
  sort |\
  uniq -c |\
  sort -nr |\
  awk '$1 >= 10 {print $2}'
comments powered by Disqus