this is probably going to be a dumb question, but I can’t find a way to do what I need.
Essentially I’m trying to make a hyphenation / syllable parser in SuperCollider.
What I hope to achieve is a system where I can input live text, for instance a string of text in supercollider or updating a text document outside supercollider, and have SC split the text in words, and the words in syllables, to ultimately use this as rhythmic patterns. I’m interested in doing this with the italian language, and since I was having a hard time implementing all the hyphenation rules algorithmically, I’ve decided to try a different approach. I have collected about 300+k words in italian from an open source database; divided all them in syllables using an online hyphenation software; done some housekeeping in Python and obtained two lists of 1 word per line:
- one list contains all the entire words, one by one;
- the other contains the same words but divided in syllables, one by one.
For my own comfort, I turned these txt lists into two CSV files, where each line is a single word (so there’s 1 column / 300000 rows approx).
My aim at this point is to be able to write a line of text (“Esempio, questa Stringa”) and have supercollider divide it first in words([“esempio”, “questa”, “stringa”]) and then in corresponding syllables ([ [“e”, “sem”, “pio”], [“que”, “sta”], [“strin”, “ga”] ], for now with no case-specificity and no interest for punctuation. To do so, I am using the .split method to separate at least the words every time there’s a space in the text, although I should also get to remove (or identify) punctuation, and make this case insensitive.
The biggest bottleneck I’ve found so far though is trying to parse each input word to its index in the database: I would need this so that I can find the corresponding entry of the input word in the hyphenated array(eg. first spotting “Esempio” in the whole-words array, then using the same index to associate “e-sem-pio” in the second array). The method .findAll doesn’t work as I expected, probably because I’m using arrays of strings nested inside another array.
Anyone has any tips? The task at hand is mainly: parsing an input string, case insensitive, with an array of strings (amongst which there must be the input), and outputting the corresponding index.
Thanks to everyone!
(PS. I haven’t abandoned the idea of getting to an algorithmic hyphenation system, which would have the advantage of being theoretically much faster and neat, but that will take longer tests and at this point I wanted to play with this toy for a bit since I’ve been moving quite slow at the beginning phase. That’s why I’m using datasets containing pretty much all text I could possibly write in italian, but feel free to suggest any other approach you might find useful)