Would a reference parser generator be useful to SC hackers?

lnihlen · September 4, 2022, 2:00pm

Hey,

tl;dr - I’m thinking of writing a reference grammar for sclang using the parser-generator tool ANTLR which can generate parsers for a bunch of languages, including C++, JavaScript, and Python. We might also be able to add a module so it can generate a parser in SuperCollider. Please tell me if you’d be interested in using it and/or helping to maintain/develop it.

Longer form:

I’ve seen at least a few discussions of ongoing projects or project proposals that might benefit from having a pre-made, well-supported parser:

Automatic sclang code formatters
Interpreter pre-processing sclang frontends
Development tooling (like LSP)

I’ve been thinking about developing some kind of generic, reusable parser for sclang for a while now. I was inspired by all the neat language analysis tools that clang, the LLVM C-languages frontend, inspired because it provided a useful frontend parser for C++, a language notoriously difficult to parse.

It seemed most obvious to start with re-using the parser inside of sclang. I had written (and then abandoned) a SuperCollider PR long ago to expose the parse tree built by the sclang interpreter during compilation. The problem with this approach is that sclang, in the interest of compilation speed, transforms its parse tree somewhat while building it. So the result of parsing in sclang is a tree that is ready for the next stage of compilation but no longer exactly represents the input language in a form that makes the use cases detailed above as obvious or easy.

For example, the sclang parser expands syntax shortcuts into their underlying meaning (think performList syntax shortcuts or generator expressions), and also does the first pass of dead code deletion (creating all those PyrDropNode objects), etc.

My next attempt was to try and prop Hadron’s parser up as a possible “official reference parser” for the community, but as I continue my development work on Hadron, a couple of problems with this approach have come up for me:

I think an official parser is useful enough to the community that I don’t necessarily want to couple it to the fate of a project as experimental and uncertain as Hadron.
Hadron is written in C++ so any consumer of the parse tree data has to be able to interop with C++ (or an external binary), lowering its usefulness.
I’d like to remove some of the design constraints on Hadron’s parser and follow in sclang’s footsteps of lowering the parse tree during construction.
During the bootstrapping phase of Hadron compilation, I need a reference parser. Compiler writing is full of fun chicken-or-egg paradoxes like this.

I’ve looked through several different parser generators, and I think ANTLR looks the most promising. It supports languages that I’ve seen discussed most often here, and I think we could coerce it also to generate an SC parser. That’s right, we could distribute a Quark that contains an ANTLR-generated sclang parser that takes an input string and produces a parse tree of SCLang objects.

I haven’t started work yet on this, but am contemplating starting soon. I thought I’d ask for feedback and gauge interest first. So what do ya’ll think?

jamshark70 · September 4, 2022, 9:02pm

AFAIK a principal reason for the preprocessor is to translate non-sclang syntax into sclang syntax. If I’m understanding you correctly, the non-sclang syntax would choke in a sclang parser. I might be missing something here, though.

What would certainly help preprocessor usage is a set of functions to reliably scan through sclang syntax elements (e.g. “get the next string literal” while handling escaped characters, etc).

hjh

lnihlen · September 5, 2022, 12:35am

You’re right, this parser is designed to parse SCLang program inputs, and so would reject non-sclang inputs as invalid. The preprocessor use case I was referring to was more in the static analysis/automatic program translation alley of taking in valid sclang and translating it into some other valid sclang.

A task easily accomplished by the traversal of the tree produced by the parser.

jamshark70 · September 5, 2022, 5:13am

Sure, but not quite where I was headed with that, because: “One thing I tried to do when writing preprocessor code for Bacalao was to try to choose syntax that would not be valid SC code otherwise (e.g. deg"0 2 [8 6] 7"), because I want to be able to mix normal code with my ‘custom’ functionality” (totalgee, here).

Understood that you’re talking about something else, but I’ve spent rather a lot of time on mixed-dialect cases, and it gets a bit headachy right in this area: “just give me the next xyz element starting here” and there are a bunch of exceptional cases that have to be re-handled. Would be great if a parsing backend could handle some of those while returning control to the caller after finishing a given element.

hjh

lnihlen · September 5, 2022, 9:45am

I’m happy to chat with @totalgee about their specific use case if they’d like, but at first glance, this doesn’t seem like a good fit for a sclang parser. It looks like they chose to go with a series of regular expressions, which is probably what I’d do too.

One part of this project that may be in scope is adding some extensions to ANTLR to generate SuperCollider language parsers. That is, I will write the sclang grammar in ANTLR input format, which can then generate parser code in C++, Python, etc, but ANTLR currently doesn’t support SuperCollider as an output language.

I haven’t scoped how much work it would be to add SC output capabilities to ANTLR. If we build that, ANTLR can translate any arbitrary grammar into a SuperCollider class that can parse it. The most obvious use case is generating a SuperCollider parser in SuperCollider. Additionally, the sclang output capability might be useful for folks who translate non-sclang preprocessor input into sclang, as they could provide their own grammar to ANTLR.

There are several example grammars included with ANTLR; writing SuperCollider output support into ANTLR means we get parsers for all of those essentially “for free.”

jamshark70 · September 5, 2022, 10:43am

I’ll refrain from much further comment because you have a clear idea what you want to do, and it wouldn’t help me at all, so it’s probably best if I don’t muddy the waters further.

What I’m suggesting is that it might be useful to have a user-extensible SC parser.

(Speaking only for myself, I don’t favor regular expressions as a parser because context awareness gets very tricky. I do use them to distinguish between types of expressions but in my dialect, the real work is done by crunching through, pretty much character by character.)

hjh

lnihlen · September 5, 2022, 11:25am

Fine, sorry to hear this doesn’t help you.

I am proposing that we build a “user-extensible SC parser.” Maybe there’s some confusion here around what “user-extensibility” means, as that includes a diversity of use cases, some of which may or may not be a good fit for a sclang parser.

Where I’m starting from is a proposal to build a parser that can parse valid sclang input. It should accept any input that the sclang interpreter would accept. The “user extensibility” comes from the user acting on the concrete syntax tree provided by the parser.

It sounds like you want a parser that accepts a mix of sclang and non-sclang input. Just like the interpreter would not accept this, neither will the sclang parser I’m proposing. You could fork the ANTLR sclang grammar and then modify it to suit your needs, in which case the ANTLR sclang output project may be useful.

jamshark70 · September 5, 2022, 2:17pm

What I’m driving at is, admittedly, orthogonal to your idea, and of course you have no obligation to do anything that other people suggest. Buuuut… we have a preprocessor, which was added with the intention of supporting user-defined dialects. But then we, as a development community, do things like restrict the IDE so that it will evaluate code only from scd files, and assume that every scd file contains only sclang syntax, and make it impossible to disable colorizing in any document window where code execution is possible. So a user-definable dialect that conflicts with IDE colorizing rules is awkward to use in the IDE, and basically requires a custom editor. But… why? Why not make the behavior at least less restrictive?

As I see it, a parsing tool could be taken as an opportunity to deepen preprocessor support. Now, there are two sides to this position. One is, your parser is your project, and the scope is fully your decision. The thread is asking for input, and in response to that, I’m suggesting something that isn’t part of your usage, but it is a big part of mine, which might be a good conversation I think. You’re completely free to say, “eh, not the first few versions,” or even, “not even if pigs fly,” I won’t be offended. The other is… I do wonder when the preprocessor will be taken seriously and devs might more broadly take it into account when devising related features.

That’s very likely the best starting point. It would be nice if there could be some SC methods to add parser elements dynamically, but that might be totally incompatible with ANTLR, or even if it is, maybe it’s not a good candidate for a first/second/third release.

hjh

edrd · September 5, 2022, 7:25pm

I would find this very useful. My project involves scanning SC code + a little bit of custom syntax (basically annotations) and generating “augmented” SC code based on the scanned elements.

lnihlen · September 5, 2022, 8:24pm

I haven’t researched ANTLR’s capabilities for modifying a generated parser at runtime. I welcome feedback on ANTLR’s usability from this perspective, up to consideration of another parser generator if one is more suitable. Please let me know what you find out.

There’s no existing parser generator that I know of that supports SuperCollider as a generator output language. Given that, I chose ANTLR because:

It is largely generated language agnostic, so I don’t have to write a separate grammar for each output language
It is broadly used (so likely to be supported/maintained for some time)
It generates parsers in languages that are generally of interest to SC users and also meet my use case, so C++, JavaScript, and Python
It’s open-source and extensible, so I feel there’s a path towards adding sclang output

But there are some things I dislike:

(major concern) The existing online documentation is very thin, with the good stuff in a USD$30 book (for the online version), so this isn’t a particularly inclusive tool choice
(minor concern) It’s written in Java, which I’m less experienced developing in, so there’s a small learning curve around tooling here for me

So it’s not a “done deal” for me, hence my request for feedback. Thanks!

lnihlen · September 5, 2022, 8:27pm

Welcome, @edrd, and thanks for the feedback!

I feel like this discussion has gone a bit off the rails. By “this” do you mean a sclang parser or the parser generator proposed above?

edrd · September 5, 2022, 9:08pm

Hey @lnihlen, thanks!

I think either would work for me, even if I need to fork the parser to add my custom syntax.

By the way, if you go with ANTLR, I have a reasonable experience with Java and could help in this area.

lnihlen · September 5, 2022, 9:34pm

Cool, good to know, thanks. It looks like ANTLR language support requires one to implement a runtime component too, so there will be a chance to contribute to both the Java and SCLang side of an ANTLR-based sclang parser generator.

bovil43810 · September 7, 2022, 11:15am

As the projects I’ve been working on in SC have become more complex over time, I’ve started to feel the “tooling gap” between sclang and other, more mainstream languages like Python and JavaScript more and more, particularly the lack of a debugger and static analysis tools for linting and formatting, so I think having those may also be useful for more experienced users, not just as learning aids for beginners. I don’t have any experience with static analysis tool development, but at least intuitively, I can definitely see how having a reference grammar in a parser generator as popular and well supported as ANTLR could make their development much more streamlined.

As far as adding sclang output to ANTLR - if I’m understanding correctly, I see a lot of potential for this too, especially for people writing EDSLs such as livecoding dialects or small languages for pattern generation or expressing graph relations. This includes some of my favorite sclang projects from the community - Steno, Bacalao, ddwChucklib, just to name a few. I’m reminded of my experience of working through some of SICP, a famous computer science textbook teaching Lisp programming. In one of the lectures, Gerald Sussman talks about a general approach to solving problems using Lisp, which basically boils down to writing a small DSL first and then using it to solve the problem. This prefigured the more recent notion of Language-oriented programming. I found this very illuminating, but not every language is equally suited for this approach, with few matching the ease of Lisp macros and lanuage structures for creating parser logic. I’ve messed around with writing my own lexers/parsers in sclang before, and I feel like I have more ideas for very small, but still somewhat expressive DSLs for certain tasks than energy or motivation for writing yet another parser using a heap of hard to debug/refactor regexes. A parser generated would definitely make me experiment more with them.

lnihlen · September 7, 2022, 2:21pm

Excuse the shameless self-promotion, but Hadron has debugger support planned and partially implemented (some DAP fundamentals and a debugging virtual machine). It’s been on the roadmap since project inception. If you’re interested in language hacking, whether you’re an expert or a total beginner, I could use some help!

Your post reminds me of Greenspun’s Tenth Rule, which I encounter in a more general form in modern days. Sometimes a system becomes complicated enough that implementing a DSEL is the most elegant approach.

I’m pleasantly surprised there’s so much interest in the parser generator! To evaluate ANTLR a bit further, I’ve started working on the sclang grammar. I took Hadron’s Bison grammar, removed all the actions, and started hammering it into ANTLR form. There’s a VSCode plugin for ANTLR, and so far, my development experience has been pretty smooth (after I bought the book!) I expect it will be more than adequate for my C++ needs, at which point I will share what I have, and we can talk next steps.

lnihlen · September 8, 2022, 1:22pm

I’ve checked in a first draft of the SCLang grammar to the project repository, which I’m calling sparkler. I’m now working on some CMake scripting to generate the C++ sources from the grammar and compile a static library of the parser. I’ll be performing some light validation of the parser while I go, but my needs for a bootstrap compiler for Hadron are relatively straightforward.

Currently soliciting contributions for:

Additional grammar validation and bugfixes
Automatic library generation in other languages (Python, JS?)
sclang parser generator support

Cheers!

lnihlen · September 10, 2022, 1:11am

I’ve patched the sclang parser to successfully parse every valid sclang file in the supercollider repository. The parser is now complete for my needs in Hadron, and I plan to switch gears for a bit to finish up the task that started me on the parser in the first place.

If folks were planning on starting in on the sclang parser generator or starting another project based on the parser, I expect the code will be stable for a bit. I’ll ping this thread when I start up again, but if folks want the parser generator urgently, I’d be happy to help them get started on implementing it themselves.