Code Formatter Development for Supercollider

Oh that’s fantastic.
The goal of the project is to make something reasonably standalone like black or clang format - would a parser be able to be run so that you don’t need the server running or is that a prereq ?

doesn’t look like the server is involved in Hadron at all and this seems like a great solution.

I wonder if Hadron’s parser could also be used in lieu of Treesitter for us SCNvim folks @davidgranstrom

1 Like

@lnihlen the code for the hadron project is fantastic. Super duper clean - very nicely done.

Hey, thanks a lot for the kind words! I’m very glad to hear you think this might be useful.

I could create a binary that runs on the command line and prints a JSON dump to stdout of the parse tree of any input file. Would that work?

It may! I have to see what the output looks like and how quickly it could run - are there, perhaps, any external language bindings in the code ?

In short, there are no “external bindings” for the code. In longer form:

Hadron is written in C++ and compiles into a static library libhadron.a, which I suppose you could generate bindings for almost any language that can call against C++17 headers and a static lib. That’s not a use case I’d yet seriously considered, but I haven’t ruled it out either. I don’t have time to prioritize that work right now, but if someone wanted to generate automatic swig bindings for sclang data structures, I’d be happy to help guide and review PRs.

I also wrote preliminary support for an LSP interface but have recently gutted that. I feel @scztt’s work on a sclang-based LSP implementation is more inclusive of others in the dev community who are more comfortable with sclang code than C++. Adding LSP support to provide a parse tree over JSONRPC wouldn’t be much work, but I take the comments above that the server isn’t a helpful approach for folks here, which is fine. That code’s not long for this earth, anyway.

I’ve been working for the past several weeks on a complete refactor of the compilation pipeline to use intermediate data structures accessible from the sclang side. I’ve already completed the parser. In a sense, sclang serves as an “external binding” for the Hadron parse tree! I hoped to follow in the LSP tool’s footsteps by enabling the development of language tools written in sclang, including things like automatic code formatters.

I have a foreign function interface planned for Hadron, but that’s a nontrivial effort and not likely to materialize soon. But the task here is to marshall some simple data structures between languages, and I think JSON is suitable for this purpose. I had plans for this anyway, to support another volunteer refactoring my vistool to sclang. The vistool consumes parser JSON (and other artifacts) to produce detailed graphs of parser data structures, which has been valuable for fixing the parser when it breaks.

In terms of speed, it’s likely reading the file from disk (or stdin) and writing the resulting JSON to disk (or stdout) will be slower than the actual parsing job. I have no plans at this time to support incremental re-parsing. Parsers are highly context-sensitive machines, and the sclang grammar is pretty complex. For example, there are some “vexing parses” in sclang, some of my favorite syntactically valid expressions below:

1 - - 1 // the second hyphen is unary negation, and this evaluates to 2
12345x0 // the sclang Lexer accepts any numeric characters prefixing the "x"
        // for hexadecimal, this evaluates to 0
Foo {
  ** { } // an instance method named '**'
  * * { } // a class method named '*'
}
/* /* Nested comments are totally a thing in sclang! This was
 *  * a nightmare to lex with Ragel
 */ */

Good times! I can’t imagine trying to modify the parser to re-parse only part of input sequences like this.

Anyway, I’m wrapping up some work on porting another part of the compiler to generate sclang data structures. I should finish that soon and start work on the JSON serializer binary for the parse tree, and I can report back here when done.

Awesome - very much looking forward to this. I was thinking about it last night, and it may make a lot of sense to use the bindings from the language or write something using Hadron directly if there’s a lot of code that needs to be formatted. Do you have a proof of concept of what the Json looks like ?

Unfortunately, Hadron is not ready for use as a general-purpose sclang interpreter. That’s my goal, and I’m making steady progress towards that goal, but writing a JIT compiler is a big project, and right now, I’m the only contributor to Hadron, so it’s slow going.

Suppose you wanted to start work on a sclang version of the code formatter (to run it directly in Hadron eventually). In that case, you could begin by extending the classes in HadronParseNode to deserialize from JSON. For now, you could run the formatter using normal sclang, with the expectation that when Hadron starts working, you could skip the serialization/deserialization phase and use the objects provided by the Hadron parser directly.

Regardless of language choice, the JSON dump will be a verbatim serialization of the HadronParseNode classes. I haven’t seen an overall standard documented for serializing sclang objects to JSON because I would conform to that. There are some complexities in serialization (like object cycles) that we can gloss over here. For now, I’m envisioning that each object is a JSON dictionary with a required key _className: with value of the class name of that object, followed by key/value pairs for each non-nil, externally readable member of the object, so something like:

{
_className: "HadronBlockNode",
token: { _className: "HadronLexToken", name: "openCurly", ... },
arguments: { _className: "HadronArgListNode", ... }
variables: { _className: "HadronVarListNode", ... }
body: { _className: "HadronExprSeqNode", ... }
}

Because of ambiguities in the original sclang grammar, the parser will need to know in advance if you are expecting to parse a class file or an interpreter string. Each parse node descends from HadronParseNode and will contain a corresponding HadronLexToken. The lexer token contains a lot of what I think would be useful for the code formatter, including the position of the token in the input string. The parser will always return a single root-level object. If parsing a class file, the root-level object will be a HadronClassNode or a HadronClassExtensionNode, and if parsing an interpreter file it will always be a HadronBlockNode. If there are multiple classes or blocks present in the input the next member of the root object will have the next one, and so on.

I need to finish and merge this WIP PR, which is a considerable change to Hadron, including some changes in the ergonomics of the C++ wrappers around sclang data structures that will make finishing the parser serialization much more pleasant. Once that’s in I’ll start on the serialization.

1 Like

I’m very glad you made the distinction between the interpreted code and the class code, because i was starting to bump into that and wondering what, if any, distinction there was between the two. That’s a great bit of insight and I’m sure it’ll clear things up - to shore up my understanding, it is not possible for there to exist interpreted code and class code in the same file, correct ? That is, if you have a file that you are going to format, you should expect it to be one or the other, never both ?

Since your JSON output has all of the data required for lexing the data, including the positions of the characters, that means that most of the logic that I’m currently using will still apply ; I generally rebuild the tree after every pass, but since the lexed values will never change, the only thing that will need to be rebuilt is the character positions, and that can be done without having to call the parser again.

This sounds super-duper promising. When you’re done with the PR, I’d love to see how it handles the scd files in the codebase and what percentage of lines are failing.

I also expect that the de-serializer you’re using will only expect to read data from disk and will not accept data from stdin, correct ?

Yes, that’s correct. More accurately, I would say that the sclang interpreter is designed for two distinct use patterns: Ahead-of-Time (AOT) compilation of the class library, and Just-In-Time (JIT) compilation of interpreter code. So the sclang parser grammar, which I take as the authoritative grammar of the SuperCollider language, was never designed to mix class definitions and interpreter code in the same input string and would consider that input invalid.

I think with some minor changes it could be possible to mix these two use modes, but the more I work on the compiler the more I feel that the ability to define (or redefine) classes at runtime adds a lot of complexity to the compiler, requires the programmer to keep a lot of state in their mind while working, and may not add proportionate value to the language when weighed against that.

I’ve merged the PR and have started work on the parser JSON dump. I should have some statistics around parsing once I’ve ironed out some of the kinks around class library compilation.

I don’t understand this question. I’m producing a PR right now that will produce a compiled C++ binary that runs on macOS (can add Linux and/or Windows as needed). It will take a --sourceFile command line flag when run, will decide if it’s a class file from the presence of the “.sc” extension on the file, and on lexing/parsing success will write a JSON stream to stdout which contains the parse tree of the file on stdout, and on failure may write a probably not very helpful error message to stderr and provide a nonzero exit code. Is that useful? Or is there some other flow of data here you’d need support for?

1 Like

I have a proof-of-concept of the JSON dump of the parse tree working. I took an example input from above, slightly modified to make it compile in sclang:

(
var y = { |a, b, c| var d; q; d = a * b; d=a*b*d; "foo".postln; a = d*b; c = d*a; d = a &b|c ; c + d; };


y = { |a, b, c| var d; q; d = a * b;




c + d; };


y = { |a, b, c| var d, q; var x;  q; d = a * b;




c + d; };

y = { |a, b, c| var d;
    d = a * b;
        d = a * b * d;
      a = d*b; c = d*a; d = a &b|c ; c + d; };
)

Unfortunately I hit character limits on this post on the resulting JSON dump, I’ve posted it as a gist here.

Each object is a dictionary with two keys _className which is the name of the object, and _identityHash which uniquely identifies that object. There are a few cycles in this object graph, only in the tail member of each parse node, which is a member you can ignore. But additional references to the same object produce a dictionary with a single key _reference which has the same value as the _identityHash key in the referenced object. Symbols are encoded as strings, floats and integers get their normal values, and nil is encoded with JSON null. I’m going to add some additional code for serialization of specialized container objects, notably Arrays and their RawArray cousins, but otherwise this is mostly working as intended, so I thought I’d share some sample output.

Next up is a load-time optimization I’ve been meaning to add for a while, which should hopefully make the execution speed of the dump-diag binary that produced this JSON faster. I should have statistics on successful parsing by then, and can add the needed code to the parser to parse close to 100% of the extant sclang code.

1 Like

Tangentially, for Python/C++ interop I overheard some colleagues at work discussing CLIF, which generates Python wrappers around C++ objects for direct usage of a C++ library in Python. It uses LLVM as a dependency and seems really complex and powerful.

1 Like

Wasn’t checking the threads because of memorial day - got a lot to catch up on !
I’ll take a deep dive later and come back to the thread. Looks like a lot of great things are happening.

Cool, no rush on my part. I gathered some statistics for parsing classes. I added a --doesItParse flag to dump-diag, which prints either YES or NO: filename to stdout, then using the command:

% find ../../third_party/supercollider -name '*.sc' -print0 | xargs -n1 -0 -I file ./dump-diag --doesItParse --sourceFile 'file' | wc -l
     469
% find ../../third_party/supercollider -name '*.sc' -print0 | xargs -n1 -0 -I file ./dump-diag --doesItParse --sourceFile 'file' | grep YES | wc -l
     406

So the parser currently parses 406/469 of the .sc files in the supercollider repository, or about 86%.

On .scd files I don’t have a statistic, some file is causing a crash on parse. I think I’m going to commit this PR, then start a PR to get the parser to 100% of the .sc and .scd files within the supercollider repository.

I’ve merged the PR with the missing parser functionality. Hadron now parses every “valid” sclang file in the supercollider repository with one exception. By “valid,” I mean every input file that sclang also parses. In supercollider/testsuite/classlibrary/TestMethod.sc, Hadron returns a parse failure on a test input that used to crash sclang when parsed.

Hadron doesn’t parse some .scd files in the examples/ directory. The common problem is that they have multiple blocks designed to be run independently instead of running the file as a whole. I spot-checked several of them, but with no automated means of determining which ones sclang can parse, I didn’t want to spend the time going through each one.

I think I’m still missing some corner cases, but I’m generally satisfied that Hadron can parse “most” valid sclang input. I’m going to focus on my previous development project of bringing the rest of Hadron’s compilation artifacts into sclang-accessible data structures, a lead-up project to an interactive debugger for sclang.

If you encounter bugs, have questions, or have specific feature requests, please reach out!

Cheers

1 Like

I’m working on a PR that does two things of interest to this thread:

a) Introduce a HadronDeserializer class that can consume the JSON generated by the dump-diag tool and convert it back to HadronParseNode objects in sclang. So, that class can serve as an example of deserializing the JSON in other languages, or you can use the code directly to work with the parse node objects in sclang if you wish.

b) Add code to convert a tree of HadronParseNode objects to a Graphviz DOT file, which allows you to visualize the parse trees. I’ve continually found this helpful when developing Hadron, so I am porting the current Python implementation to sclang now that I have access to the data structures there.

For example, here’s a lightly-edited version of the Integer method factors presented as a code block:

(
var factors = { |num|
		var array, prime;
		if(num <= 1) { ^[] }; // no prime factors exist below the first prime
		num = num.abs;
		// there are 6542 16 bit primes from 2 to 65521
		6542.do {|i|
			prime = i.nthPrime;
			while { (num mod: prime) == 0 }{
				array = array.add(prime);
				num = num div: prime;
				if (num == 1) {^array}
			};
			if (prime.squared > num) {
				array = array.add(num);
				^array
			};
		};
		// because Integer is 32 bit, and we have tested all 16 bit primes,
		// any remaining number must be a prime.
		array = array.add(num);
		^array
	};
factors(23).postln;
)

And the accompanying parse tree visualization:

1 Like

Sorry for being out of the loop on this for a while. School and work murdered me. I’ll catch up on your work @lnihlen and see where this is at!

Excited - looks like there’s a lot of great progress.

Hey, welcome back!

I’ve been moving in a different direction with Hadron’s parser recently, and would no longer advise its use for general-purpose parsing. Instead, you might want to check out Sparkler, which is an ANTLR grammar for SuperCollider. I’ve got it generating C++ parsers, but it can also generate JavaScript, Python, Java, and other language parsers. We’re also looking for volunteers to write an sclang parser generator plugin for ANTLR so we can parse sclang in sclang. The ANTLR grammar is complete to the best of my knowledge, and can successfully parse everything I’ve tested it with, including all the sclang code in the supercollider repository.

Cheers

1 Like

That’s fantastic! I’ll download it and see if i can’t get it to work. Parsing the whole of SC code is a great start. I’m also going to try the latest version of the treesitter library to see if they’ve worked out some of the nullpointer exceptions that we were hitting a few months ago.

For anyone checking this thread, I just posted a link to an auto-formatter based on Sparkle: https://scsynth.org/t/sclang-auto-indent-tool/7342