Anagram finder implementations – Lars Nordenström's blog

In Colin Dexters mystery novel The way through the woods a vital clue comes in the form of a poem:

Find me, find the Swedish daughter - // Thaw my frosted tegument!
Dry the azured skylit water, // Sky my everlasting tent.

Who spied, who spied that awful spot? // O find me! Find the woodman's daughter!
Ask the stream: "Why tellst thou not // The truth thou knowst - the tragic slaughter?"

Ask the tiger, ask the sun // Whither riding, what my plight?
Till the given day be run, // Till the burning of the night.

Thyme, I saw Thyme flow'ring here // A creature white trapped in a gin,
Panting like a hunted deer // Licking still the bloodied skin.

With clues surveyed so wondrous laden, // Hunt the ground beneath thy feet!
Find me, find me now, thy maiden, // I will kiss thee when we meet.

Perhaps not a very good poem, but in it are several anagrams of a place-name near Oxford. Not being as clever as Detective Chief Inspector Morse I decided to write a program to find anagrams in the text. This program I have now implemented in a number of languages.

The implementations are listed with compiled or semi-compiled languages first, and scripting languages after that. Within each group they are listed alphabetically.

`Ada`	Once the language of the future.
`C, traditional`	A traditional version, based on the C OO-inspired version.
`C, OO-inspired`	As close to the C++ version that you can get in C.
`C++`	The reference implementation.
`C++, optimized`	A brutally optimized C++ version. Lasciate ogne speranza, voi ch’intrate.
`C#`	Microsoft’s language.
`Cobol`	The oldest language. No classes, no dynamic memory. Retro computing.
`D`	A variation, or derivation, of C++
`Fortran 77`	First standardised Fortran with string handling.
`Fortran 2018`	Recent Fortran standard.
`Java`	Sun Micosystems’s language.
`Kotlin`	Google’s language.
`ML (OCaml)`	A functional language
`Perl`	Scripting inspired by awk and sed.
`Perl, optimized`	A faster, more Perlish, version.
`PHP`	Server-side web scripting.
`Python`	Procedural-style scripting.
`Rust`	A variation, or derivation, of C++, focussed on memory safety
`Swift`	Apple’s language.
`Ruby`	Procedural-style scripting.

There are a few languages on the to-do list:

Javascript	Browser-side language from Netscape, now an ECMA standard.
Erlang	Must have at least on functional language. Very different from the rest.
Lisp	Also very different from the rest.

Each version show the complete source code, and some comments. What is really striking is how very similar they all are.

Of course, the Cobol code looks very different, but it has basically the same structure and uses the same data.

First I will describe how the program works, then what the major data types are, what the major variables are, and finally the most important functions.

One thing to remember. A word is not an anagram by itself, only in relation to another word.

Key to the algorithm used is a `unified’ version of each word. It is simply the word with all letters rearranged in alphabetical order. For example, the unified version of TERASK is AEKRST. Two words are anagrams of each other if and only of they have the same unified version.

The steps involved are:

Read the input file, while removing all characters except letters. This will give exactly 512 letters
Convert all letters to upper case.
Given the length of the anagram, create a list of words.
If the length is 6, then the words are all consecutive six-character sequences in the text. There are 507 words of length 6.
Sort the words, according to
1) the unified version of the word, and
2) the offset within the letter-only string.
This sorting is key to the speed of the algoritm.
Compare the 507 words to each other to find anagrams. Since they are sorted, we only need to compare words that are next to one another.
All words with the same unified value are next to each other in the list of words and are candidates for the list of anagrams of a particular word. Excluded multiple occurences of the same sequence, as well as sequences that overlap. The exclusion rules are somewhat arbitrary, but generates all interesting sequenes without a lot of uninteresting ones.
After each range of idententical unified values, Print the result if it contains at least 3 words.
Repeat the process for a range of lengths, say from 4 letters to 10 letters.

The 512 character letter-only string looks like:

    FINDMEFINDTHESWEDISHDAUGHTERTHAW...THYMAIDENIWILLKISSTHEEWHENWEMEET

The output for lengths 4 to 7 is:

len = 4
  5:   68:ATER  148:TERA  158:TREA  199:ETRA  332:REAT
  3:   28:THAW  248:WHAT  311:AWTH
  3:  219:HETI  237:ITHE  340:HITE
  3:   24:HTER  180:HETR  238:THER
  3:   10:THES  155:HEST  497:STHE

len = 5
  3:   26:ERTHA  147:HTERA  198:HETRA
  3:   68:ATERS  148:TERAS  157:STREA

len = 6
  3:   68:ATERSK  148:TERASK  213:ERASKT
  5:   28:THAWMY  161:AMWHYT  248:WHATMY  311:AWTHYM  479:WTHYMA
  3:    0:FINDME  123:INDMEF  467:NDMEFI

len = 7
  3:  147:HTERASK  213:ERASKTH  225:RASKTHE

The first number on an anagram line is the number of anagrams. The number before the word is the relative position of the character within the letter-only string. It is convenient for verifying the result.

The order of lines within each length is a by-product of the algorithm used. It would have been trivial to sort them either in string order, or by the position of the first occurence. But that would have only added some not very interesting code, and I want to keep the programs short.

At first it seems that we would need to compare each word to each other word, making the algorithm ordo(n2), but by sorting the words in a clever way, it can be done in ordo (n logn) time. for large input, the execution time is dominated by the sorting. All other processing is ordo(n) or better.

There are two data types in the program, Word and AnagramFinder. A word contains:

offset	The offset within the letter-only string. Use for sorting, and is also printed in the result. Example: 148.
value	The word itself. Example TERASK.
unified	The unified version of the word. Example: AEKRST.

The only member function is the constructor.

constructor(offset, value, unified) Called by the AnagramFinder constructor

The class AnagramFinder is instantiated for each anagram length. It typically contains:

text	A pointer or reference to the letter-only string.
len	The length of the anagrams to find. Useful range is something like from 4 to 10.
words	An array of Word. 507 elements in the example.
accepted	The list of accepted words for the current anagram.

The following invariants must hold for the words in accepted:

All the words are anagrams of each other.
The words do not overlap each other.
There are no duplicates

Member functions.

constructor(text, len)	Builds the member array words, but does not sort it.
get_unified(value)	Returns the unified version of value. get_unified(“FINDME”) => “DEFIMN”.
print_one_result()	Prints a set of words that are anagrams of each other, giving a single output line. Called last thing by add_range().
add_candidate(cand)	Verifies that the candidate is not a duplicate and that it does not overlap, and if so adds it to accepted. Called by add_range().
add_range(begin, end)	Adds a range of candidates by calling add_candidate. Then, if there are at least 3 members, prints a result line. For each set of anagrams there is only a single call to add_range(). Called by find_all().
find_all()	Sorts words. This will put all words with identical unified values, and thus are anagrams of each other, next to each other. Iterates over words to find ranges that have identical unified values, and passes the ranges to add_range. Called by the main() program.

There are also a few global functions:

read_text(text)	Returns a single string containing all the upper-cased letters in the input.
main	Decodes the range of anagrams, given as one or two arguments on the command line. Creates one `AnagramFinder` for each length, and runs its find_all() member function.

A look at some data may make it all clearer. To the left below is an extract of the words array, after it has been constructed, but before it has been sorted. The three values are the offset, the word itself and the unified version of the word. The extract includes all anagrams of ‘FINDME’ and ‘THAWMY’. To the right the words have been sorted by unified value and by offset.

 words[  0]:    0 FINDME DEFIMN                     words[138]:   28 THAWMY AHMTWY
 words[  1]:    1 INDMEF DEFIMN                     words[139]:  161 AMWHYT AHMTWY
 words[  2]:    2 NDMEFI DEFIMN                     words[140]:  248 WHATMY AHMTWY
 words[  3]:    3 DMEFIN DEFIMN                     words[141]:  311 AWTHYM AHMTWY
 words[  4]:    4 MEFIND DEFIMN                     words[142]:  479 WTHYMA AHMTWY
 words[  5]:    5 EFINDT DEFINT                                    ...
 words[  6]:    6 FINDTH DFHINT                     words[201]:    0 FINDME DEFIMN
 words[  7]:    7 INDTHE DEHINT                     words[202]:    1 INDMEF DEFIMN
 words[  8]:    8 NDTHES DEHNST                     words[203]:    2 NDMEFI DEFIMN
 words[  9]:    9 DTHESW DEHSTW                     words[204]:    3 DMEFIN DEFIMN
 words[ 10]:   10 THESWE EEHSTW                     words[205]:    4 MEFIND DEFIMN
                  ...                               words[206]:  122 FINDME DEFIMN
 words[ 28]:   28 THAWMY AHMTWY                     words[207]:  123 INDMEF DEFIMN
                  ...                               words[208]:  124 NDMEFI DEFIMN
 words[122]:  122 FINDME DEFIMN                     words[209]:  125 DMEFIN DEFIMN
 words[123]:  123 INDMEF DEFIMN                     words[210]:  126 MEFIND DEFIMN
 words[124]:  124 NDMEFI DEFIMN                     words[211]:  465 FINDME DEFIMN
 words[125]:  125 DMEFIN DEFIMN                     words[212]:  466 INDMEF DEFIMN
 words[126]:  126 MEFIND DEFIMN                     words[213]:  467 NDMEFI DEFIMN
 words[127]:  127 EFINDT DEFINT                     words[214]:  468 DMEFIN DEFIMN
 words[128]:  128 FINDTH DFHINT                     words[215]:  469 MEFIND DEFIMN
 words[129]:  129 INDTHE DEHINT                     words[216]:  470 EFINDM DEFIMN
                  ...                               words[217]:  471 FINDME DEFIMN
 words[161]:  161 AMWHYT AHMTWY                                    ...
                  ...                               words[218]:    5 EFINDT DEFINT
 words[248]:  248 WHATMY AHMTWY                     words[219]:  127 EFINDT DEFINT
                  ...                               words[220]:  463 ETFIND DEFINT
 words[311]:  311 AWTHYM AHMTWY                                    ...
                  ...                               words[223]:    7 INDTHE DEHINT
 words[463]:  463 ETFIND DEFINT                     words[224]:  129 INDTHE DEHINT
                  ...                                              ...
 words[465]:  465 FINDME DEFIMN                     words[231]:    8 NDTHES DEHNST
 words[466]:  466 INDMEF DEFIMN                                    ...
 words[467]:  467 NDMEFI DEFIMN                     words[238]:    9 DTHESW DEHSTW
 words[468]:  468 DMEFIN DEFIMN                                    ...
 words[469]:  469 MEFIND DEFIMN                     words[266]:    6 FINDTH DFHINT
 words[470]:  470 EFINDM DEFIMN                     words[267]:  128 FINDTH DFHINT
 words[471]:  471 FINDME DEFIMN                                    ...
                  ...                               words[312]:   10 THESWE EEHSTW
 words[479]:  479 WTHYMA AHMTWY                     words[313]:  497 STHEEW EEHSTW
                  ...
 words[497]:  497 STHEEW EEHSTW

There are 8 ranges: [138-142], [201-217], [218-220], [223-224], [231] and three more in the sorted extract to the right. find_all() will pass each of these ranges to add_range(). In the first range all values will be accepted because there are no duplicates and no overlaps, resulting in a set of five anagrams. In the second range there are many duolicates and many overlaps, so the result is a set of only three anagrams. The third range, with three elements, has one duplicate, so the resulting set do not reach the threshold of three anagrams. The five last ranges are all to short short to start with, so add_range() will just skip them. Note that add_range() clears the result set every time it starts on a new range.

The above extract will produce:

  5:   28:THAWMY  161:AMWHYT  248:WHATMY  311:AWTHYM  479:WTHYMA
  3:    0:FINDME  123:INDMEF  467:NDMEFI

The sorted list explains the order of the result sets; the result sets are produced in the order of their unifed value.

A real AnagramFinder would of course return a result, perhaps a list of anagram sets, not just print the result and forget it.

Notes on the implementations

The implementations are all very similar. Most have classes, some don’t, but they are all similarly structured. All have a Word data type in some way, all have an array accepted of some sort for storing the current result set, and all store the input in a string text. As far as a language provides a convenient feature that does not obscure the code too much, I have used it. Some languages have anonymous functions or classes, but I have avoided them; the syntax is less than easy to read, and besides, putting a name to something, such as a function or a class, makes the code easier to understand.

You can reach me by email at “lars dash 7 at sdu dot se” or by telephone +46 705 189090

View source for the content of this page.

Notes on the implementations

Cancel reply