How to match "\" in String wth findRegexp?

catniptwinz · September 1, 2020, 10:11pm

I’m trying to parse a compile string using findRegexp to detect all instances of Symbol plus any methods called on those Symbols and any values passed as arguments to those method calls.

I’m pretty clear on how to write the actual regex, but I’ve run into an embarrassingly basic sclang issue while implementing my String parsing Function: how to match an actual backslash character in a String?

Consider this string:

~str = "This String contains the Symbol \\foo, among other things."

Intuitively, I’d assume that we could match "\foo" with:

~str.findRegexp("\\\foo")

In other words, one backslash for the String escape, another for the regex escape, and a third for the literal backslash. That’s not the case, though – the above expression returns an empty Array.

Any thoughts on what I’m missing here would be very much appreciated!

jamshark70 · September 1, 2020, 11:13pm

So the regexp itself needs to contain two backslashes: one escape, and one for the character.

Then the SC string literal requires both of these to be escaped: "\\\\".

That’s “sc escape, regexp escape, sc escape, regexp character to match.”

hjh

catniptwinz · September 1, 2020, 11:43pm

Ah, right. It’s:

~str = "This String contains the Symbol \\foo, among other things."
~str.findRegexp("\\\\foo")

…or more generically for \this form of Symbol:

~str.findRegexp("(?<!\\$)\\\\\\w\\w*")

Thanks so much!

catniptwinz · September 2, 2020, 12:37am

In case anyone finds this thread while trying to detect Symbols in a compile string, hopefully this will be helpful. I’m not especially a regex wizard myself, but I was able to start from the patterns that scvim uses for syntax highlighting and take things from there.

There are three forms of Symbol that we might need to capture: \symbol, 'symbol', and symbol:. Each of these has slightly different rules for which characters they can contain. Here’s a PCRE expression for each form as I understand it:

\symbol  -- (?<!\$)\\\w\w*
'symbol' -- (?<!(\w|\\))\'.*?(?<!\\)\'
 symbol: -- \w+:

Thanks to @jamshark70’s assistance above, here they are as properly escaped sclang Strings:

\symbol  -- "(?<!\\$)\\\\\\w\\w*"
'symbol' -- "(?<!(\\w|\\\\))\'.*?(?<!\\\\)\'"
 symbol: -- "\\w+:"

In practice you’ll probably want to capture multiple forms. In my case, I’m only concerned with catching arguments passed to particular instance methods of Symbol, so only \symbol and 'symbol' are relevant and the part of my regex that captures the Symbol itself looks like:

 "((?<!\\$)\\\\\\w\\w*|(?<!(\\w|\\\\))\'.*?(?<!\\\\)\')"

EDIT: please see below – these expressions are close, but not exact.

VIRTUALDOG · September 2, 2020, 5:31am

This is close but not quite there – I’m pretty sure that for symbol:s, it’s [a-zA-Z_]\w* (at least, this is closer to the truth than \w+); for \symbols, it’s (?<!\$)\\[a-zA-Z_]?\w*. \symbols beginning with a digit are disallowed and the “empty symbol” \ is allowed. Most lexers of SC grammar don’t get that right (even the IDE’s highlighter doesn’t correctly handle it).

catniptwinz · September 2, 2020, 5:11pm

This is great; thank you!

And as usual with regex, it looks like the edge cases might be a bit more complex still. For example, \1foo is captured by (?<!\$)\\[a-zA-Z_]?\w* but is not a valid Symbol. Interestingly, \1 seems to be a valid symbol, and it looks like _symbol: works as well.

Also, I’m seeing some unexpected results when parsing certain instances of symbol: Evaluating (1foo: true) throws a DoesNotUnderstandError for 1.foo, and indeed, evaluating (1neg: true) returns -1.

To be clear, that last issue doesn’t have any practical consequences for me; was just quite surprised to see it! I’d assume it’s a consequence of supporting the form 1 min: 0, as (1min: 0) returns 0. In fact:

(1min: 0) // 0
(\1min: 0) // 1
('1min': 0) // ('1min': 0)

VIRTUALDOG · September 2, 2020, 9:37pm

Thanks for catching that; it’s been awhile since I revisited this. I made a mistake in my regex (the one i gave above matches the same language as yours) and a mistake in my description. You’re right that sequences of digits work with \symbols. I believe it should be:

\\(\d*|[a-zA-Z_]\w*)

That is, either 0 or more digits, or a non-digit word character followed by 0 or more word chars.

Also, I’m seeing some unexpected results when parsing certain instances of symbol: Evaluating (1foo: true) throws a DoesNotUnderstandError for 1.foo , and indeed, evaluating (1neg: true) returns -1 .

Yes, 1foo: true is parsed as literal 1, selector foo:, whitespace, literal true (sctweets folks, take note!). I think most people would expect intuitively that you need whitespace between 1 and foo there, but you don’t. I think this is a mistake in the interpreter’s lexer rather than an intentional decision.

catniptwinz · September 3, 2020, 3:26am

Definitely, but I think the first clause is actually “0 or more digits not followed by a non-digit word character,” so we don’t match e.g \1foo. I think (?<!\$)\\(\d+(?![a-zA-Z_])|[a-zA-Z_]\w*) should do it.

I’d imagine so – I don’t typically use the “object method: arg” syntax, so hadn’t considered until now how much potential there is for collision with the syntax of Events. Thanks again for your insight!

VIRTUALDOG · September 3, 2020, 4:17am

The regex I provided will not match \1foo, but it will match the \1 inside it, which is a valid symbol, and that is the same way sclang’s lexer would tokenize that string – as a symbol followed by an identifier. What you are talking about now is a grammar-level concern: an identifier cannot directly follow a symbol in sclang. If you need to check that the string itself is valid sclang code, you may want to offload that work to String:-compile if you can. Regexes alone cannot analyze regular grammars like sclang.

I’m a bit confused about what you’re trying to accomplish beyond this particular problem, to be honest. Is there use in picking out symbols from a string even if it won’t compile? And is your input sanitized so that nested block comments are eliminated? (i.e. how do you make sure you don’t match either of the “symbols” in /* /* \foo */ \bar */?)

Btw, if you’d like a reference, here is where \symbols are tokenized: supercollider/lang/LangSource/PyrLexer.cpp at develop · supercollider/supercollider · GitHub

And here is where 'symbols' are tokenized: supercollider/lang/LangSource/PyrLexer.cpp at develop · supercollider/supercollider · GitHub

catniptwinz · September 3, 2020, 4:55am

That makes perfect sense; thank you.

Sorry about that. Indeed as you may have guessed, this did turn out to be a dead end. In the interest of leaving the thread in a place that would be useful to anyone who finds it in the future, I wanted to put as fine a point as I could on the regex before moving on. Maybe ironically, I think that your reminder that without meaning to do so I’ve wandered outside the domain of regex proper is exactly that point!