One reason for that is that you need to do at least N searches (one for each word in the text) and potentially M matching operations (one for each word in the dictionary) = worst case O(N*M), which, as the text and the dictionary grow, will get significantly slower. Potentially up to 300000 string matching operations for one input text word – that’s pretty much guaranteed to be painfully slow.
One way to handle this is to model, using objects, the way that we look up words in the dictionary. You don’t check every whole word; you zoom in, letter by letter. “Esempio” – first you go to the E’s; then find the s’s within that chapter; then ese’s and so on.
So the letters are really arranged in a tree structure.
(
~makeDict = { |stringPairs|
var dict = IdentityDictionary.new;
stringPairs.pairsDo { |str, hyphenation| ~putOneString.(dict, str, hyphenation) };
dict
};
~putOneString = { |dict, str, hyphenation|
var level = dict, sublevel;
str.do { |char|
sublevel = level[char];
if(sublevel.isNil) {
sublevel = IdentityDictionary.new;
level.put(char, sublevel);
};
level = sublevel;
};
level[\wholeString] = str;
level[\hyphenation] = hyphenation;
};
~matchString = { |dict, str|
var level = dict;
var stream = CollStream(str);
var char;
while {
char = stream.next;
if(char.notNil) {
level = level[char];
};
char.notNil and: { level.notNil }
};
// and here, retrieve hyphenation
level.tryPerform(\at, \hyphenation)
};
)
d = ~makeDict.([
"hello", "hel-lo",
"there", "there",
"thereupon", "there-up-on"
]);
This dictionary is ugly when it prints out, but it will have everything. (Don’t try to print the whole dictionary…)
Then the lookup function just walks down through the levels, character by character. A nil
result indicates that the search word wasn’t found.
~matchString = { |dict, str|
var level = dict;
var stream = CollStream(str);
var char;
while {
char = stream.next;
if(char.notNil) {
level = level[char];
};
char.notNil and: { level.notNil }
};
// and here, retrieve hyphenation
level.tryPerform(\at, \hyphenation)
};
)
~matchString.(d, "hello"); // yes
~matchString.(d, "hel"); // nil
~matchString.(d, "hellothere"); // nil
~matchString.(d, "not-even-close"); // nil
~matchString.(d, "there"); // yes
~matchString.(d, "thereup"); // nil
~matchString.(d, "thereupon"); // yes
You will still have to do one search for each word in the text – minimum O(N). But the lookup cost is now based on the number of characters in each word, not the size of the dictionary. Each IdentityDictionary lookup has a cost too, but hash table lookup is much less expensive than a linear search through an array.
EDIT: Hm, you could maybe even replace each IdentityDictionary level with a letter array – letters[char.ascii - 97]
– and then use the 27th array item (index 26) to be either nil (not arrived at a word) or the hyphenation. Then the hash table lookup would get replaced by a faster, constant-time array indexing operation.
I don’t have your large dataset so I can’t benchmark, but if your hyphenation dictionary contains 300000 words, this approach could well be much faster.
hjh