Parser not accepting the full list of symbols after `\` that work inside single quotes

RFluff · June 14, 2020, 7:28am

Does someone reading this have the “institutional memory” to say why only some symbols can be written with leading backslash? E.g. the following are parse errors: \+ or \a+. More obscurely \1 is valid but \1a is not.

The documentation says on two different pages (“Syntax Shortcuts” and “Symbolic Notations”) that the single quotes and backslash-led symbols are equivalent, without mentioning any caveats.

There have also been some bug reports and even (abandoned) pull requests opened on this, e.g. https://github.com/supercollider/supercollider/pull/2676

I was updating the documentation (Syntax Shortcuts) recently, but it’s not clear what to say on the matter, i.e. if the backslash is intended not to work for some stuff or say nothing because that’s a bug that should be fixed…

The relevant bits from the lexer (I think I got all of them in this snippet)

// in the big "if"
    if (c == '\\')
        goto symbol1;
    else if (c == '\'')
        goto symbol3;
        
// then

symbol1:
    c = input();

    if ((c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '_')
        goto symbol2;
    else if (c >= '0' && c <= '9')
        goto symbol4;
    else {
        unput(c);
        yytext[yylen] = 0;
        r = processsymbol(yytext);
        goto leave;
    }

symbol2:
    c = input();

    if ((c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '_' || (c >= '0' && c <= '9'))
        goto symbol2;
    else {
        unput(c);
        yytext[yylen] = 0;
        r = processsymbol(yytext);
        goto leave;
    }

symbol4:
    c = input();
    if (c >= '0' && c <= '9')
        goto symbol4;
    else {
        unput(c);
        yytext[yylen] = 0;
        r = processsymbol(yytext);
        goto leave;
    }

symbol3 : {
    int startline, endchar;
    startline = lineno;
    endchar = '\'';

    /*do {
        c = input();
    } while (c != endchar && c != 0);*/
    for (; yylen < MAXYYLEN;) {
        c = input();
        if (c == '\n' || c == '\r') {
            post("Symbol open at end of line on line %d of %s\n", startline + errLineOffset,
                 printingCurrfilename.c_str());
            yylen = 0;
            r = 0;
            goto leave;
        }
        if (c == '\\') {
            yylen--;
            c = input();
        } else if (c == endchar)
            break;
        if (c == 0)
            break;
    }
    if (c == 0) {
        post("Open ended symbol started on line %d of %s\n", startline + errLineOffset, printingCurrfilename.c_str());
        yylen = 0;
        r = 0;
        goto leave;
    }
    yytext[yylen] = 0;
    yytext[yylen - 1] = 0;
    r = processsymbol(yytext);
    goto leave;
}

So it looks like (symbol4 branch) numbers were intended to be supported: if the first char is a digit, only digits are accepted thereafter. As for the symbol2 branch (taken on letters and underscore) only the same plus digits are accepted thereafter. So, fairly “usual” rules for identifies in many programming languages. But nothing else seem like was intended to work after backslash as a symbol besides numbers and “identifiers”. (The symbol3 branch is for stuff in single quotes).

Sooo, there’s actually a 3rd help page on this Literals | SuperCollider 3.12.2 Help which actually has details

Symbols

A symbol can be written in two ways. One method is to enclose the contents in single quotes. Any printing character may be used within a symbol except for non-space whitespace characters ( \f, \n, \r, \t, \v ). Any single quotes within the symbol must be escaped ( \' ).

‘x’
‘aiff’
‘BigSwiftyAndAssoc’
‘nowhere here’
‘somewhere there’
‘.+o*o+.’
‘\‘symbol_within_a_symbol\’’

A second way of notating symbols is by prefixing the word with a backslash. This is only legal if the symbol consists of a single word (a sequence of alphanumeric and/or underscore characters).

\x
\aiff
\Big_Swifty_And_Assoc
\not really a symbol // illegal

Thus “a sequence of alphanumeric and/or underscore characters” is basically the help-given specification.

But it does look like someone has put effort into making \1a not work… because otherwise they could have added the digits to the initial char allowed in the symbol 2 branch (as opposed to adding the whole symbol 4 branch.)

jamshark70 · June 14, 2020, 10:08am

I’m pretty sure those parser rules predate my involvement in SC (and probably go back to SC2).

It’s never come up as a question. The intuitive rule is, if it’s a valid variable or class identifier, it can be written as a \symbol, and if it isn’t a valid identifier, better use single quotes. That covers '1a' because identifiers may not start with digits. (I vaguely remember that \CapitalizedSymbols were not supported at first though, and were added later.)

I have no idea why that symbol4 branch is there. I always assumed a symbol consisting of only digits should be in single quotes.

Huh. For fairly straightforward reasons, a \symbol can’t contain spaces or operators, while 'a symbol can'. So the documentation couldn’t be correct here. (Although, to be fair, neither of those pages claims to contain an exhaustive survey of the types of symbols one might create. Possibly they should.)

hjh

VIRTUALDOG · June 15, 2020, 2:18pm

It’s never come up as a question

But it has – I asked it

RFluff · June 15, 2020, 2:48pm

Well, I had noticed it (it’s linked in my first post here)… but I had been wondering if maybe there was some kind “good reason” to disallow that. Although spaces are recommended around binary operators, technically they are not needed, so I’ve been experimenting for example with

\abc++\def // -> abcdef
\abc+\def // -> abc (just ???)
\abc+123 // -> abc
\abc++123 // -> abc123
(\++123).class // -> String

to see if something like that might have justified the limitations besides not allowing spaces in \-led symbols (which is the only really obvious one).

I suppose the last one (that \++123 is a String might have justified adding support for \123 as (that symbol4 rule) so that you don’t have to write something too awkward to put a number into a Symbol… since numbers can’t have spaces anyway.

But the original limitation on symbol2, which basically restricts them to identifiers is probably in part justified by the desire to be able to write “composition” (appending) expressions that don’t change their meaning with the spacing, e.g. if \++1 were parsed as '++1', then inserting a space as in \ ++1 would be a fairly obscure bug as the latter is just '1'.

Frankly allowing the empty Symbol to be written just as \ is a little debatable in itself as it looks rather confusing in some contexts…

I still haven’t found compelling reason why \1a is disallowed, except maybe that that (if unquoted) is not a valid identifier in SC. So maybe the Symbols were originally intended to be just quoted identifiers. But then \_hmm is a valid Symbol but not a valid identifier (unquoted)… well ok it might be so as a primitive (call) name.

More obscurely, 1s is a valid number in SC

1s // -> 1.1

but it can’t be escaped as a symbol with backslash. Arguably, it is a floating point number though and . can’t be part of \-led symbols either as that’s taken as method invocation on the symbol.

But then

3r21 // -> 7

is actually an integer…

RFluff · June 15, 2020, 11:01pm

Now for something truly obscure with the current lexer…

\123do: _.postln // 123

\foo1do: _.postln // syntax error

The first one works because once the lexer takes symbol4 branch due to the leading digit, it will terminate the Symbol upon encountering the first letter… which oddly can then be actually part of a “selector as infix operator” construct. (Stuff like 5div:2 can also be written without spaces.)

But I don’t really see practical uses for the first line… so changing the behavior of that program (into a syntax error, like that of the 2nd one) doesn’t seem much of a loss, even though it’s technically a “breaking change”.

semiquaver · June 15, 2020, 11:28pm

I ran into this trying to make symbols for note names using \a- and \a# for example

As Rfluff points out someone might want to write var b = \k; var c = \cat++b without spaces and get “catk”…

but a simpler rule is always better: perhaps letters and numbers should be allowed, in any order. alternatively the symbol could contain any characters up to white space.so \cat++b would give ‘cat++b’ Either would be an improvement imo

jamshark70 · June 16, 2020, 12:34am

OK – missed that one. If I had to guess why, I suppose I’d say that my internal heuristics tend to favor questions with practical use cases behind them (which #2673 didn’t present, but which semiquaver did).

FWIW – I’m not sure how the emoji is used in the US currently, but where I live, it’s always ironic or sarcastic. Perhaps be aware that not everybody is going to read that as a friendly smile.

I do see what you’re after there, but unfortunately \a- is out because symbols respond to math operators by returning this. (That is likely to drive rfluff around the bend, but the reason for this is: prior to Rest(), rests were written in patterns by providing a symbol for any of the pitch-related keys: (degree: \r, ...). Because pitch-related keys go through a lot of math conversions, it was necessary to implement math for them as no-ops.)

# is not a valid character for a math operator AFAIK – but it’s probably a bit risky to start folding punctuation into otherwise alphanumeric symbols. For example, if we say that ‘:’ is not a valid character in a binary operator so it should be allowed in a \symbol, then we run into trouble here: e = (\a: 1, \b: 2);. This is more commonly written e = (a: 1, b: 2); but I’m certain somebody, somewhere, is using the leading \ on the keys, so we’d better not break it.

I could agree to that. Rfluff is correct that it would break one currently accepted syntax, but the chance that anyone is using the syntax is infinitesimally small.

I can’t quite go along with that. My gut feeling is that it would be risky to start messing around with parsing of binary operators.

There might be, but we will probably never know what it is.

I had a quick look at git blame for PyrLexer.cpp. Much of it, apart from reformatting, dates back to “initial revision.” That means it’s likely to come directly from SC2, whose syntax is almost identical to SC3. SC2 was never open-source, so there is no repository to check. So the only person who knows for sure is James McCartney. SC2 was 1996 IIRC, so it’s quite likely that after a quarter century, even he might not remember.

I suspect that, while JMc was making the thousands – hundreds of thousands – of decisions involved in designing a language, in some places, he needed to make a judgment call.

hjh

RFluff · June 16, 2020, 12:41am

Yeah that is a lot more problematic than merely allowing the first char in an alphanum \-symbol to a digit.

The best I can think that could be done for that is to allow $ to escape chars inside a \ led symbol, i.e. so you could write \a$-. That will not mess with binary operators because $ is not a valid char for them, i.e. \a $- 1 is a syntax error already (at $ – “unexpected ASCII”).

But then 'a-' is also 4 chars-long and less confusing than \a$- which is of the same length to type…

Aside: chars don’t quite get converted to Symbols as often as they probably should

Pbinop('+', 1, 2).iter.next // ok
Pbinop($+, 1, 2).iter.next  // ERROR: Primitive '_ObjectPerform' failed. Wrong type.

Yeah that explains the difference between

\foo + \bar == \foo
// vs 
\foo ++ \bar == "foobar"

The first was alas needed so that, e.g. \r + \r + 1 == \r. (Frankly, \r + \r == 'r r' like its string analogue does would probably not break anything, but would surely look confusing when printing some events that added rests in the old format…)

Aside to that: a failing primitive op on + falls back to a commutation of arguments, which occasionally does have more questionable semantics

1 + \r // 'r'

[1, 2] + 3 // -> [ 4, 5 ]
[1, 2] + \r // -> 'r'

"foo" + \r // -> "foo r"
["foo", "bar"]  + \r // -> 'r'

"foo" + "r" // -> "foo r"
["foo", "bar"] + "r" // ERROR: Primitive '_ArrayAdd' failed. Wrong type.

VIRTUALDOG · June 16, 2020, 1:16am

noted! i like to use the literal : ) with no space but Discourse is very aggressive about replacing it.

semiquaver · June 16, 2020, 2:44am

Very interesting! I used to use symbols for rests but have switched to Rest() as instructed in the post window…

Now that symbols as rests are in the process of being deprecated maybe all of this should be rationalized.

It would be nice to be able to compose symbols \saw ++ \1 without .asSymbol.

To that end wondering do we really need both strings and symbols?

RFluff · June 16, 2020, 3:18am

\foo === \foo   // true
"foo" === "foo" // false

IdentityDictionary ["foo" -> 1, "foo" -> 2]
// -> IdentityDictionary[ (foo -> 2), (foo -> 1) ]

("foo": 1, "foo": 2) // -> ( "foo": 2, "foo": 1 )

Aside # on an Array makes it immutable, but not actually mem-addr-unique:

(1: \one, 1: \another) // -> ( 1: another )
(#[1]: \one, #[1]: \another) // -> ( [ 1 ]: another, [ 1 ]: one )

Speaking of confusing syntax, allowing seemingly “empty” Chars is perhaps even more confusing than the empty Symbol

$ ++ 1 // " 1" (a String)
$++1 //  Message '+' not understood. RECEIVER: +

$ .ascii // -> 32

It even works for newlines

(
$
.ascii
) // -> 10

VIRTUALDOG · July 5, 2020, 6:39am

It would be nice to be able to compose symbols \saw ++ \1 without .asSymbol.

I’m of the opinion that ++ shouldn’t be defined on symbols at all, and anyplace where you would want to use it would be a code smell, because it can lead to polluting the symbol table with unused intermediate symbols. Having ++ on symbol return a string is the next best compromise, in my view. At the very least it still disincentivizes this kind of operation.

To that end wondering do we really need both strings and symbols?

Well, it’s a little late to be making that decision. ^^ But yes, the distinction is important, it makes some operations in SC faster and there are many data structures and algorithms in SC’s core library that rely on identity comparison. I wouldn’t definitively say the language needs both, but it’s a reasonable design and would be quite laborious to change.

Speaking of confusing syntax, allowing seemingly “empty” Chars is perhaps even more confusing than the empty Symbol

$ forms a char literal with the next character that follows it. Once you know that rule, the confusion goes away. A newline char literal is also more idiomatically written as $\n.

In my experience an empty Symbol is used very rarely, and so it also doesn’t cause much confusion.