Search for a specific string in a file, print the index

Robin_Morabito · May 15, 2023, 3:40am

Hello,
this is probably going to be a dumb question, but I can’t find a way to do what I need.
Essentially I’m trying to make a hyphenation / syllable parser in SuperCollider.
What I hope to achieve is a system where I can input live text, for instance a string of text in supercollider or updating a text document outside supercollider, and have SC split the text in words, and the words in syllables, to ultimately use this as rhythmic patterns. I’m interested in doing this with the italian language, and since I was having a hard time implementing all the hyphenation rules algorithmically, I’ve decided to try a different approach. I have collected about 300+k words in italian from an open source database; divided all them in syllables using an online hyphenation software; done some housekeeping in Python and obtained two lists of 1 word per line:

one list contains all the entire words, one by one;
the other contains the same words but divided in syllables, one by one.
For my own comfort, I turned these txt lists into two CSV files, where each line is a single word (so there’s 1 column / 300000 rows approx).

My aim at this point is to be able to write a line of text (“Esempio, questa Stringa”) and have supercollider divide it first in words([“esempio”, “questa”, “stringa”]) and then in corresponding syllables ([ [“e”, “sem”, “pio”], [“que”, “sta”], [“strin”, “ga”] ], for now with no case-specificity and no interest for punctuation. To do so, I am using the .split method to separate at least the words every time there’s a space in the text, although I should also get to remove (or identify) punctuation, and make this case insensitive.

The biggest bottleneck I’ve found so far though is trying to parse each input word to its index in the database: I would need this so that I can find the corresponding entry of the input word in the hyphenated array(eg. first spotting “Esempio” in the whole-words array, then using the same index to associate “e-sem-pio” in the second array). The method .findAll doesn’t work as I expected, probably because I’m using arrays of strings nested inside another array.
Anyone has any tips? The task at hand is mainly: parsing an input string, case insensitive, with an array of strings (amongst which there must be the input), and outputting the corresponding index.

Thanks to everyone!
Robin

(PS. I haven’t abandoned the idea of getting to an algorithmic hyphenation system, which would have the advantage of being theoretically much faster and neat, but that will take longer tests and at this point I wanted to play with this toy for a bit since I’ve been moving quite slow at the beginning phase. That’s why I’m using datasets containing pretty much all text I could possibly write in italian, but feel free to suggest any other approach you might find useful)

jordan · May 15, 2023, 7:43am

Why not use https://pyphen.org/ then import it into supercollider?

jamshark70 · May 15, 2023, 7:53am

One reason for that is that you need to do at least N searches (one for each word in the text) and potentially M matching operations (one for each word in the dictionary) = worst case O(N*M), which, as the text and the dictionary grow, will get significantly slower. Potentially up to 300000 string matching operations for one input text word – that’s pretty much guaranteed to be painfully slow.

One way to handle this is to model, using objects, the way that we look up words in the dictionary. You don’t check every whole word; you zoom in, letter by letter. “Esempio” – first you go to the E’s; then find the s’s within that chapter; then ese’s and so on.

So the letters are really arranged in a tree structure.

(
~makeDict = { |stringPairs|
	var dict = IdentityDictionary.new;
	stringPairs.pairsDo { |str, hyphenation| ~putOneString.(dict, str, hyphenation) };
	dict
};

~putOneString = { |dict, str, hyphenation|
	var level = dict, sublevel;
	str.do { |char|
		sublevel = level[char];
		if(sublevel.isNil) {
			sublevel = IdentityDictionary.new;
			level.put(char, sublevel);
		};
		level = sublevel;
	};
	level[\wholeString] = str;
	level[\hyphenation] = hyphenation;
};

~matchString = { |dict, str|
	var level = dict;
	var stream = CollStream(str);
	var char;
	while {
		char = stream.next;
		if(char.notNil) {
			level = level[char];
		};
		char.notNil and: { level.notNil }
	};
	// and here, retrieve hyphenation
	level.tryPerform(\at, \hyphenation)
};
)

d = ~makeDict.([
	"hello", "hel-lo",
	"there", "there",
	"thereupon", "there-up-on"
]);

This dictionary is ugly when it prints out, but it will have everything. (Don’t try to print the whole dictionary…)

Then the lookup function just walks down through the levels, character by character. A nil result indicates that the search word wasn’t found.

~matchString = { |dict, str|
	var level = dict;
	var stream = CollStream(str);
	var char;
	while {
		char = stream.next;
		if(char.notNil) {
			level = level[char];
		};
		char.notNil and: { level.notNil }
	};
	// and here, retrieve hyphenation
	level.tryPerform(\at, \hyphenation)
};
)

~matchString.(d, "hello");  // yes
~matchString.(d, "hel");    // nil
~matchString.(d, "hellothere");  // nil
~matchString.(d, "not-even-close");  // nil

~matchString.(d, "there");    // yes
~matchString.(d, "thereup");  // nil
~matchString.(d, "thereupon");  // yes

You will still have to do one search for each word in the text – minimum O(N). But the lookup cost is now based on the number of characters in each word, not the size of the dictionary. Each IdentityDictionary lookup has a cost too, but hash table lookup is much less expensive than a linear search through an array.

EDIT: Hm, you could maybe even replace each IdentityDictionary level with a letter array – letters[char.ascii - 97] – and then use the 27th array item (index 26) to be either nil (not arrived at a word) or the hyphenation. Then the hash table lookup would get replaced by a faster, constant-time array indexing operation.

I don’t have your large dataset so I can’t benchmark, but if your hyphenation dictionary contains 300000 words, this approach could well be much faster.

hjh

jamshark70 · May 15, 2023, 8:10am

Faster array approach:

(
~makeDict = { |stringPairs|
	var dict = Array.newClear(27);
	stringPairs.pairsDo { |str, hyphenation| ~putOneString.(dict, str, hyphenation) };
	dict
};

~putOneString = { |dict, str, hyphenation|
	var level = dict, sublevel;
	str.do { |char|
		char = char.toLower;
		sublevel = level[char.ascii - 97];
		if(sublevel.isNil) {
			sublevel = Array.newClear(27);
			level.put(char.ascii - 97, sublevel);
		};
		level = sublevel;
	};
	// level[\wholeString] = str;
	level[26] = hyphenation;
};

~matchString = { |dict, str|
	var level = dict;
	var stream = CollStream(str);
	var char;
	while {
		char = stream.next;
		if(char.notNil) {
			level = level[char.ascii - 97];
		};
		char.notNil and: { level.notNil }
	};
	// and here, retrieve hyphenation
	level.tryPerform(\at, 26)
};
)

d = ~makeDict.([
	"hello", "hel-lo",
	"there", "there",
	"thereupon", "there-up-on"
]);

d looks really funny!

> [ nil, nil, nil, nil, nil, nil, nil, [ nil, nil, nil, nil, [ nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, [ nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, [ nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, [ nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, hel-lo ], nil, nil, nil, nil, ... ]

But the ~matchString calls in my earlier post all still work this way.

Edit: This way might be a little trickier with accented characters though – it assumes 26 letters only. Accents would need to map onto indices… “exercise for the reader.”

hjh

Robin_Morabito · May 15, 2023, 12:52pm

Thanks for the hint! I didn’t know about the existence of this. Seems quite appropriate for what I’m doing, but I’m not sure how I would make it communicate real-time with Supercollider, maybe OSC?
The thing of having the hyphenation happen in supercollider is to having a fast lane for writing → decomposing text → recomposing rhythms. TBH I’m not sure where i’m going ahhahahah, I’m checking out PyPhen to figure out how to make them talk together

jordan · May 15, 2023, 1:48pm

Assuming you can use unixCmdGetStdOut.

Run
pip install pyphen in the terminal first,
then in supercollider:

~get_sylabs = {|text|
	format(
		"python -c \"import pyphen; out = pyphen.Pyphen(left=1, right=1, lang='it_IT').inserted('%'); print(out)\"",
		text
	).unixCmdGetStdOut.split($-)
}

~get_sylabs.("Esempio, questa Stringa").do(_.postln)

returns:

E
sem
pio, 
que
sta 
Strin
ga

Seems pretty snappy to me.

Edit: added the split($-)

Edit 2: just fixed the fact it split Esem wrong

"Esempio, questa Stringa".reject(_ == $,).split($ ).collect(~get_sylabs)

Robin_Morabito · May 15, 2023, 2:04pm

This looks really elegant but I’m not quite grasping it. Would you mind walk me through your steps a bit?
The way I understand it I would mainly have to create a dictionary of associations string (whole word) with hyphenation (word syllables). Is that what you did with the variable d?
I’ve never used IdentityDictionary before, so perhaps I will need to take a deep dive there to properly get it. Trying it out, it does work in a pretty neat way (except for accented vowels) but I’m not gonna lie, your coding is VERY ESOTERIC for me

jamshark70 · May 15, 2023, 3:01pm

The main problem may be understanding the data structure.

This is basically guaranteed to be inefficient.

Instead what I did was to create a dictionary of first letters: keys are a, b, c … z. The “a” entry points to a dictionary of second letters coming after an initial a; the “b” entry points to a dictionary of second letters coming after an initial b and so on. Then for example the “a” entry underneath “b” points to a dictionary of third letters coming after “ba” and so on.

So the original words exist in the dictionary only as chains of associations.

At the end of each chain, there’s the hyphenated variant.

So instead of searching every entry, I thought to walk through the tree, where each letter narrows down the territory until you find either a matching end of the chain, or you run out of letters in the search string (failure) or you get to the end of the chain without exhausting the search string (also failure). This is exactly the process you use when you open up a paper dictionary.

If you’re looking for an 8 letter word, you’ll either find it, or not, in 8 steps, not potentially thousands of iterations.

hjh

Robin_Morabito · May 15, 2023, 3:27pm

Thanks for the tips, am I doing something wrong?

returns [ ]

jordan · May 15, 2023, 3:51pm

Did you install pyphen and get their demo working from their website? Also are you on windows?

Robin_Morabito · May 15, 2023, 3:59pm

i installed PyPhen and it works fine using the .runInTerminal method in SC instead of .unixCmdGetStdOut (it opens the terminal and prints the output there). I’m on OSX 10.13.
oh boy, oh my

jordan · May 15, 2023, 4:05pm

Weird…
does this return anything?

"ls".unixCmdGetStdOut;

Robin_Morabito · May 15, 2023, 4:07pm

Ok I can follow this. My only question is about the d = ~makeDict. As far as I understand, that’s where you created a tiny dictionary of 3 words with their corresponding hyphenation - is that correct?

If so, I was just wondering if it makes sense to create an array of associations containing whole words interleaved with their hyphenation like you did, something like ~array = ["aba", "a-ba", "abaca", "a-ba-ca"..."zaino", "zai-no"], and then simply throw it into d = ~makeDict.([~array])

I understand that you’re not iterating through the whole thing every time but only chaining letters together, it’s just not clear to me how these associations are stored initially. Hope this question makes sense and as always appreciate a lot your explanations

Robin_Morabito · May 15, 2023, 4:07pm

yes, it returns

-> Applications
Documents
Library
Network
SuperCollider
System
Users
Volumes
WaveShell-AU 9.3.component
WaveShell-VST 9.3.vst
Waveshells
bin
cartella senza titolo
cores
dev
etc
home
ixilang
libtensorflowlite_c.dylib
net
opt
private
sbin
tmp
usr
var

jordan · May 15, 2023, 4:26pm

Then I’m stuck because it works here and since, both, pyphen and get from terminal are working it should work for you.

Try running this in the terminal directly? does it return E-sem-pio

python -c "import pyphen; out = pyphen.Pyphen(left=1, right=1, lang='it_IT').inserted('Esempio'); print(out)"

Maybe try copying the function again, I edited it a few times.

~get_sylabs = {|text|
	format(
		"python -c \"import pyphen; out = pyphen.Pyphen(left=1, right=1, lang='it_IT').inserted('%'); print(out)\"",
		text
	).unixCmdGetStdOut.split($-)
};

~get_sylabs.("Esempio, questa Stringa")

Robin_Morabito · May 15, 2023, 4:55pm

actually it does for some reason it doesn’t work in supercollider. Well, thanks anyway for your help, I’ll try to figure out what’s up with unixCmdGetStdOut

Robin_Morabito · May 15, 2023, 10:22pm

Ok, made a laced array in Python to have it in the form [“word-string”, “word-syllables”, … “wordn-string”, “wordn-syllables”]. I’ve tried the function you wrote and you’re totally right: it’s extremely fast, and accurate. This is the perfect solution for what I was looking for! It takes only a few seconds to initialize the merged array (it contains now way more than a million entries! it was more than 600k words, same amount of hyphenated words) but after that, I can work the way I wanted (typing a word → getting hyphenation). Super cool. Now I’m gonna have fun hooking it up to a bunch of stuff! Hope to post interesting updates soon

jamshark70 · May 16, 2023, 12:27am

Yep – alternately, you could read the pairs incrementally and call ~putOneString.(d, word, word_syllables) for each pair (avoiding the need for a very large array of pairs).

Glad to hear that it performs as well as I hoped!

hjh

prko · May 16, 2023, 2:57am

I got nothing from your code in my SC-IDE under MacOS 12.6.5 on the Apple M1 Max chip:

jordan:

Maybe try copying the function again, I edited it a few times.

~get_sylabs = {|text|
	format(
		"python -c \"import pyphen; out = pyphen.Pyphen(left=1, right=1, lang='it_IT').inserted('%'); print(out)\"",
		text
	).unixCmdGetStdOut.split($-)
};

~get_sylabs.("Esempio, questa Stringa")

It returns as follows:

-> [  ]

The following works in the terminal:

Last login: Tue May 16 08:18:32 on ttys000
(base) prko@prkoMBP2021 ~ % which python
/Users/prko/opt/anaconda3/bin/python
(base) prko@prkoMBP2021 ~ % which python3
/Users/prko/opt/anaconda3/bin/python3
(base) prko@prkoMBP2021 ~ % which python3

In Supercollider, the following codes return the expected lists:

["ls"].unixCmd
"ls".unixCmd

x = "ls".unixCmdGetStdOut
x

y = ["ls"].unixCmdGetStdOut
y

How could I solve the problem?

Robin_Morabito · May 16, 2023, 3:08am

Yea i’ve also bailed this for the moment, I’m not sure what’s causing this. There might be a way but I haven’t gotten the facilities to try it lol - it would be trying to force Python to respond to the request. It seems the problem might be that Supercollider is not receiving the print(out) string, but if you have the identical problem as me, if you launch this

~get_sylabs = {|text|
	format(
		"python -c \"import pyphen; out = pyphen.Pyphen(left=1, right=1, lang='it_IT').inserted('%'); print(out)\"",
		text
	).runInTerminal.split($-)
};

and then this

~get_sylabs.("Esempio, questa Stringa")

you should have the output in the terminal, right?
I’m not so skilled in Python either, but I couldn’t find nothing online about this issue. Perhaps avoiding the (print) and just letting python store the content of the string into a variable, it could be possible to use Supercollider to retrieve it?