In Colin Dexters mystery novel The way through the woods a vital clue comes in the form of a poem:
Find me, find the Swedish daughter - // Thaw my frosted tegument! Dry the azured skylit water, // Sky my everlasting tent. Who spied, who spied that awful spot? // O find me! Find the woodman's daughter! Ask the stream: "Why tellst thou not // The truth thou knowst - the tragic slaughter?" Ask the tiger, ask the sun // Whither riding, what my plight? Till the given day be run, // Till the burning of the night. Thyme, I saw Thyme flow'ring here // A creature white trapped in a gin, Panting like a hunted deer // Licking still the bloodied skin. With clues surveyed so wondrous laden, // Hunt the ground beneath thy feet! Find me, find me now, thy maiden, // I will kiss thee when we meet.
Perhaps not a very good poem, but in it are several anagrams of a place-name near Oxford. Not being as clever as Detective Chief Inspector Morse I decided to write a program to find anagrams in the text. This program I have now implemented in a number of languages.
The implementations are listed with compiled or semi-compiled languages first, and scripting languages after that. Within each group they are listed alphabetically.
|Ada||Once the language of the future.|
|C, traditional||A traditional version, based on the C OO-inspired version.|
|C, OO-inspired||As close to the C++ version that you can get in C.|
|C++||The reference implementation.|
|C++, optimized||A brutally optimized C++ version. Lasciate ogne speranza, voi ch’intrate.|
|Cobol||The oldest language. No classes, no dynamic memory. Retro computing.|
|D||A variation, or derivation, of C++|
|Fortran 77||First standardised Fortran with string handling.|
|Fortran 2018||Recent Fortran standard.|
|Java||Sun Micosystems’s language.|
|ML (OCaml)||A functional language|
|Perl||Scripting inspired by awk and sed.|
|Perl, optimized||A faster, more Perlish, version.|
|PHP||Server-side web scripting.|
|Rust||A variation, or derivation, of C++, focussed on memory safety|
There are a few languages on the to-do list:
|Erlang||Must have at least on functional language. Very different from the rest.|
|Lisp||Also very different from the rest.|
Each version show the complete source code, and some comments. What is really striking is how very similar they all are.
Of course, the Cobol code looks very different, but it has basically the same structure and uses the same data.
First I will describe how the program works, then what the major data types are, what the major variables are, and finally the most important functions.
One thing to remember. A word is not an anagram by itself, only in relation to another word.
Key to the algorithm used is a `unified’ version of each word. It is simply the word with all letters rearranged in alphabetical order. For example, the unified version of TERASK is AEKRST. Two words are anagrams of each other if and only of they have the same unified version.
The steps involved are:
- Read the input file, while removing all characters except letters. This will give exactly 512 letters
- Convert all letters to upper case.
- Given the length of the anagram, create a list of words.
- If the length is 6, then the words are all consecutive six-character sequences in the text. There are 507 words of length 6.
- Sort the words, according to
- 1) the unified version of the word, and
- 2) the offset within the letter-only string.
- This sorting is key to the speed of the algoritm.
- Compare the 507 words to each other to find anagrams. Since they are sorted, we only need to compare words that are next to one another.
- All words with the same unified value are next to each other in the list of words and are candidates for the list of anagrams of a particular word. Excluded multiple occurences of the same sequence, as well as sequences that overlap. The exclusion rules are somewhat arbitrary, but generates all interesting sequenes without a lot of uninteresting ones.
- After each range of idententical unified values, Print the result if it contains at least 3 words.
- Repeat the process for a range of lengths, say from 4 letters to 10 letters.
The 512 character letter-only string looks like:
The output for lengths 4 to 7 is:
len = 4 5: 68:ATER 148:TERA 158:TREA 199:ETRA 332:REAT 3: 28:THAW 248:WHAT 311:AWTH 3: 219:HETI 237:ITHE 340:HITE 3: 24:HTER 180:HETR 238:THER 3: 10:THES 155:HEST 497:STHE len = 5 3: 26:ERTHA 147:HTERA 198:HETRA 3: 68:ATERS 148:TERAS 157:STREA len = 6 3: 68:ATERSK 148:TERASK 213:ERASKT 5: 28:THAWMY 161:AMWHYT 248:WHATMY 311:AWTHYM 479:WTHYMA 3: 0:FINDME 123:INDMEF 467:NDMEFI len = 7 3: 147:HTERASK 213:ERASKTH 225:RASKTHE
The first number on an anagram line is the number of anagrams. The number before the word is the relative position of the character within the letter-only string. It is convenient for verifying the result.
The order of lines within each length is a by-product of the algorithm used. It would have been trivial to sort them either in string order, or by the position of the first occurence. But that would have only added some not very interesting code, and I want to keep the programs short.
At first it seems that we would need to compare each word to each other word, making the algorithm ordo(n2), but by sorting the words in a clever way, it can be done in ordo (n logn) time. for large input, the execution time is dominated by the sorting. All other processing is ordo(n) or better.
There are two data types in the program, Word and AnagramFinder. A word contains:
|offset||The offset within the letter-only string. Use for sorting, and is also printed in the result. Example: 148.|
|value||The word itself. Example TERASK.|
|unified||The unified version of the word. Example: AEKRST.|
The only member function is the constructor.
|constructor(offset, value, unified)||Called by the AnagramFinder constructor|
The class AnagramFinder is instantiated for each anagram length. It typically contains:
|text||A pointer or reference to the letter-only string.|
|len||The length of the anagrams to find. Useful range is something like from 4 to 10.|
|words||An array of Word. 507 elements in the example.|
|accepted||The list of accepted words for the current anagram.|
The following invariants must hold for the words in accepted:
- All the words are anagrams of each other.
- The words do not overlap each other.
- There are no duplicates
|constructor(text, len)||Builds the member array words, but does not sort it.|
|get_unified(value)||Returns the unified version of value. get_unified(“FINDME”) => “DEFIMN”.|
|print_one_result()||Prints a set of words that are anagrams of each other, giving a single output line. Called last thing by add_range().|
|add_candidate(cand)||Verifies that the candidate is not a duplicate and that it does not overlap, and if so adds it to accepted. Called by add_range().|
|add_range(begin, end)||Adds a range of candidates by calling add_candidate. Then, if there are at least 3 members, prints a result line. For each set of anagrams there is only a single call to add_range(). Called by find_all().|
|find_all()||Sorts words. This will put all words with identical unified values, and thus are anagrams of each other, next to each other. Iterates over words to find ranges that have identical unified values, and passes the ranges to add_range. Called by the main() program.|
There are also a few global functions:
|read_text(text)||Returns a single string containing all the upper-cased letters in the input.|
|main||Decodes the range of anagrams, given as one or two arguments on the command line. Creates one AnagramFinder for each length, and runs its find_all() member function.|
A look at some data may make it all clearer. To the left below is an extract of the words array, after it has been constructed, but before it has been sorted. The three values are the offset, the word itself and the unified version of the word. The extract includes all anagrams of ‘FINDME’ and ‘THAWMY’. To the right the words have been sorted by unified value and by offset.
words[ 0]: 0 FINDME DEFIMN words: 28 THAWMY AHMTWY words[ 1]: 1 INDMEF DEFIMN words: 161 AMWHYT AHMTWY words[ 2]: 2 NDMEFI DEFIMN words: 248 WHATMY AHMTWY words[ 3]: 3 DMEFIN DEFIMN words: 311 AWTHYM AHMTWY words[ 4]: 4 MEFIND DEFIMN words: 479 WTHYMA AHMTWY words[ 5]: 5 EFINDT DEFINT ... words[ 6]: 6 FINDTH DFHINT words: 0 FINDME DEFIMN words[ 7]: 7 INDTHE DEHINT words: 1 INDMEF DEFIMN words[ 8]: 8 NDTHES DEHNST words: 2 NDMEFI DEFIMN words[ 9]: 9 DTHESW DEHSTW words: 3 DMEFIN DEFIMN words[ 10]: 10 THESWE EEHSTW words: 4 MEFIND DEFIMN ... words: 122 FINDME DEFIMN words[ 28]: 28 THAWMY AHMTWY words: 123 INDMEF DEFIMN ... words: 124 NDMEFI DEFIMN words: 122 FINDME DEFIMN words: 125 DMEFIN DEFIMN words: 123 INDMEF DEFIMN words: 126 MEFIND DEFIMN words: 124 NDMEFI DEFIMN words: 465 FINDME DEFIMN words: 125 DMEFIN DEFIMN words: 466 INDMEF DEFIMN words: 126 MEFIND DEFIMN words: 467 NDMEFI DEFIMN words: 127 EFINDT DEFINT words: 468 DMEFIN DEFIMN words: 128 FINDTH DFHINT words: 469 MEFIND DEFIMN words: 129 INDTHE DEHINT words: 470 EFINDM DEFIMN ... words: 471 FINDME DEFIMN words: 161 AMWHYT AHMTWY ... ... words: 5 EFINDT DEFINT words: 248 WHATMY AHMTWY words: 127 EFINDT DEFINT ... words: 463 ETFIND DEFINT words: 311 AWTHYM AHMTWY ... ... words: 7 INDTHE DEHINT words: 463 ETFIND DEFINT words: 129 INDTHE DEHINT ... ... words: 465 FINDME DEFIMN words: 8 NDTHES DEHNST words: 466 INDMEF DEFIMN ... words: 467 NDMEFI DEFIMN words: 9 DTHESW DEHSTW words: 468 DMEFIN DEFIMN ... words: 469 MEFIND DEFIMN words: 6 FINDTH DFHINT words: 470 EFINDM DEFIMN words: 128 FINDTH DFHINT words: 471 FINDME DEFIMN ... ... words: 10 THESWE EEHSTW words: 479 WTHYMA AHMTWY words: 497 STHEEW EEHSTW ... words: 497 STHEEW EEHSTW
There are 8 ranges: [138-142], [201-217], [218-220], [223-224],  and three more in the sorted extract to the right. find_all() will pass each of these ranges to add_range(). In the first range all values will be accepted because there are no duplicates and no overlaps, resulting in a set of five anagrams. In the second range there are many duolicates and many overlaps, so the result is a set of only three anagrams. The third range, with three elements, has one duplicate, so the resulting set do not reach the threshold of three anagrams. The five last ranges are all to short short to start with, so add_range() will just skip them. Note that add_range() clears the result set every time it starts on a new range.
The above extract will produce:
5: 28:THAWMY 161:AMWHYT 248:WHATMY 311:AWTHYM 479:WTHYMA 3: 0:FINDME 123:INDMEF 467:NDMEFI
The sorted list explains the order of the result sets; the result sets are produced in the order of their unifed value.
A real AnagramFinder would of course return a result, perhaps a list of anagram sets, not just print the result and forget it.
Notes on the implementations
The implementations are all very similar. Most have classes, some don’t, but they are all similarly structured. All have a Word data type in some way, all have an array accepted of some sort for storing the current result set, and all store the input in a string text. As far as a language provides a convenient feature that does not obscure the code too much, I have used it. Some languages have anonymous functions or classes, but I have avoided them; the syntax is less than easy to read, and besides, putting a name to something, such as a function or a class, makes the code easier to understand.
You can reach me by email at “lars dash 7 at sdu dot se” or by telephone +46 705 189090