Code Formatter Development for Supercollider

@sczzt

It looks like the idea of traversing the tree from the top-down may ultimately be the correct way to go for the indentation - you can modify the tree at the same time as the text by doing something like:

   nodes_to_investigate = [
            "arithmetic_series",
            "code_block",
            "function_block",
            "method_name",
            "variable_definition",
        ]

        while reached_root == False:

            # print(cursor.node.start_point)

            # Check Cursor
            if cursor.node.type in nodes_to_investigate:
                node = cursor.node
                print(node)
                text_start = Helpers.get_start_of_node(node, newline_offsets)
                text_end = Helpers.get_end_of_node(node, newline_offsets)
                node_start = node.start_point
                old_node_end = (node.end_point[0], node.end_point[1] + 1)
                new_node_end = (node.end_point[0], node.end_point[1] + 2)
                # node_end = (node.end_point[0] + 1, 0)

                data = data[:text_start] + "\t" + data[text_start:]
                # newline_offsets = Helpers.get_all_newline_offsets(data)
                tree.edit(
                    start_byte=text_start,
                    old_end_byte=text_start + 1,
                    new_end_byte=text_start + 2,
                    start_point=node_start,
                    old_end_point=old_node_end,
                    new_end_point=new_node_end,
                )

            # Rebuild Tree If Anything Changed

            if cursor.goto_first_child():
                continue

            if cursor.goto_next_sibling():
                continue

            retracing = True
            while retracing:
                if not cursor.goto_parent():
                    retracing = False
                    reached_root = True

                if cursor.goto_next_sibling():
                    retracing = False

which updates both at the same time. The ā€œget_all_newline_offsetsā€ is inefficient if itā€™s checking outside of the length of string that can be modified, so I need to write a method that just checks for newlines within the potentially modified string.

But yeah, it looks like Iā€™m almost there with getting indentation on one pass with a DFS of the tree, but thereā€™s still some things to work out. This is tricky!

3 Likes

Iā€™m currently making it so that short functions (and short PDefs, too, when I get to them), are collapsed onto one line.

so something like:


y = { |a, b, c| var d; d = a * b; c + d };

instead of

y = { |a, b, c| 
  var d; 
  d = a * b; 
  c + d 
};

Not sure if that makes sense - what do you guys think ? And if making a function short is too much, what do you think the cutoff is for ā€œshortā€ ?
I should have a proof of concept for function indenting by the end of this morning.
I needed to break it into two parts - overloading a function to do both line splitting and indentation was too much.

Letā€™s goooooo

Change this:

y = { |a, b, c| var d; q; d = a * b; d=a*b*d; "foo".postln; a = d*b; c = d*a; d = a &b|c ; c + d; };


y = { |a, b, c| var d; q; d = a * b;




c + d; };


y = { |a, b, c| var d, q; var x;  q; d = a * b;




c + d; };

y = { |a, b, c| var d;
    d = a * b;
        d = a * b * d;
      a = d*b; c = d*a; d = a &b|c ; c + d; };

to this:

y = { |a, b, c|
  var d;
  q;
  d = a * b;
  d = a * b * d;
  "foo".postln;
  a = d * b;
  c = d * a;
  d = a & b | c;
  c + d
};

y = { |a, b, c| var d; q; d = a * b; c + d };

y = { |a, b, c|
  var d, q;
  var x;
  q;
  d = a * b;

  c + d
};

y = { |a, b, c|
  var d;
  d = a * b;
  d = a * b * d;
  a = d * b;
  c = d * a;
  d = a & b | c;
  c + d
};

All of these manage to do a pass on the data without disturbing the tree, too, so you donā€™t get into gymnastics with having to keep everything managed.

@madskjeldgaard

As a test, I started going through the provided examples from the Supercollider code to see which ones would pass language parsing (without any formatting) and which ones would fail. Ideally, I would like to reformat example code so that itā€™s uniform, and gives a good presentation in the repo. However, it looks like a lot of them do not pass the initial treesitter parsing.

FWIW, I donā€™t know if this is even worth doing because some of the code is very old, but if you expected this to parse, I wanted to raise it to your attention.

EDIT: Reversed pass/fail in my initial post.

FAIL examples/other/Exploring_SCLang.scd
FAIL examples/other/keepyuppy.scd
FAIL examples/other/quines.scd
FAIL examples/other/KeyboardWindow.scd
FAIL examples/Bela/bela_test_cases.scd
PASS examples/Bela/bela_start_scsynth.scd
PASS examples/Bela/bela_example_digitalio.scd
PASS examples/Bela/bela_example_analogin_2.scd
PASS examples/Bela/bela_example_digitalout.scd
PASS examples/Bela/bela_example_analogout.scd
PASS examples/Bela/bela_example_digital.scd
PASS examples/Bela/bela_example_analogin.scd
FAIL examples/pieces/hommage_a_duerer.scd
FAIL examples/pieces/Little_Man_From_Another_Place.scd
FAIL examples/pieces/deap_sea.scd
FAIL examples/pieces/DreamHouse.scd
FAIL examples/pieces/RM-octaver.scd
FAIL examples/pieces/acid_otophilia.scd
FAIL examples/pieces/microhelix_study.scd
FAIL examples/pieces/DrummerSynthDef.scd
FAIL examples/pieces/some_constellations.scd
FAIL examples/pieces/Trains.scd
FAIL examples/pieces/aftergoeyvaerts.scd
FAIL examples/pieces/Termite_College.scd
PASS examples/pieces/Poeme Symphonique.scd
PASS examples/pieces/Record Scratcher.scd
PASS examples/pieces/atheoryofharmony.scd
PASS examples/pieces/spacelab.scd
FAIL examples/pieces/autohausen.scd
PASS examples1.scd
PASS examples/x-y-plot.scd
FAIL examples2.scd
FAIL examples/TwoMultiSlidersInOne.scd
FAIL examples/Nick's LetterGimmick.scd
FAIL examples/strike.scd
FAIL examples/analog-drum-tuner.scd
PASS examples/rotary hommage duchamp.scd
PASS examples/ColorBrowser.scd
PASS examples/demonstrations/Atari2600.scd
FAIL examples/demonstrations/stealthissound.scd
PASS examples/demonstrations/TriggerBufferGranulator.scd
PASS examples/demonstrations/Theremin.scd
FAIL examples/demonstrations/DemandingStudies.scd
PASS examples/demonstrations/100 FM Synths.scd
PASS examples/demonstrations/env automation.scd
PASS examples/demonstrations/oh yes more fibs.scd
FAIL examples_2.scd
PASS examples/demonstrations/Chaotic Patterns.scd
PASS examples_1.scd
FAIL examples/demonstrations/snare909.scd
PASS examples/demonstrations/sc_oneliner.scd
FAIL examples/demonstrations/single_sample_feedback_02.scd
PASS examples/demonstrations/DrumSynths.scd
FAIL examples/demonstrations/HarmonicsVoice.scd
PASS examples/demonstrations/babbling brook.scd
PASS examples/demonstrations/Modal Space.scd
PASS examples/demonstrations/bit_reduction.scd
PASS examples/demonstrations/more graphs.scd
FAIL examples/demonstrations/fft.scd
PASS examples/demonstrations/GetTheTwits.scd
FAIL examples/demonstrations/single_sample_feedback.scd
PASS examples/demonstrations/SC_website_example.scd
FAIL examples/research_and_tools/ASA.scd
FAIL examples/research_and_tools/Diamond.scd
PASS examples/research_and_tools/ShepardTones SignalBuf.scd
FAIL examples/research_and_tools/SeminaireMusical.scd
FAIL examples/research_and_tools/trochoid curve.scd
FAIL examples/between_languages/fortran_supercollider.scd

Example output for a broken file:

murphy@emergentprocess ~/Dropbox/GitHub/sclang_format (main)
$ for i in $(find /home/murphy/Downloads/supercollider-Version-3.12.1/examples/ -type f -name *.scd | head -1 ); do python ./src/sclang_format.py -f $i -l /home/murphy/Dropbox/GitHub/tidalCompositions/supercollider_sketchpad/formatter/sclang.so ; done
FAIL /home/murphy/Downloads/supercollider-Version-3.12.1/examples/other/Exploring_SCLang.scd
Error: 10:0 - [Object]
Error: 12:0 - [Object]
Error: 14:0 - [Event]
Error: 16:0 - [SinOsc]
Error: 37:0 - [Class]
Error: 37:27 - [""]
1 Like

This is not surprising I guess. Unfortunately I canā€™t get TreeSitter to parse most of my .scd or .sc files. Iā€™ve spent a little time with adding support for the language structures that donā€™t parse, but unfortunately I just havenā€™t had much time to make progress on this.

1 Like

All good! I just wanted to make people aware in case it was surprising.

1 Like

For reference, in the SC Codebase, these are the files that parse with treesitter and those that donā€™t.

In total:

murphy@emergentprocess /tmp
$ cat _success_rate
    312 FAIL
    217 PASS

Thereā€™s probably one or two language constructs that are gumming up the works.

Itā€™s not quite ready yet, but Iā€™ve put a ton of work into Hadronā€™s parser, including recently refactoring it so that the parse tree it generates is accessible from sclang (see hadron/HadronParseNode.sc at main Ā· hadron-sclang/hadron Ā· GitHub). Parsing is fast compared to most other compilation steps so I havenā€™t looked at incremental re-parse or anything fancy like that. But Iā€™m curious - is there some way this parser could be useful to the formatting effort? I could pretty quickly create a Hadron binary that dumps the parse tree to JSON if thatā€™s helpful.

1 Like

How close is the hadron parser to being able to parse the complete sclang syntax / classlib compared to TreeSitter? Seems like TreeSitter is at about 50%ā€¦

Iā€™m not sure about exact percentage support, but Iā€™d estimate well over 50%. Hadronā€™s parser is written in Bison, just like sclangā€™s, and I have over 75% of the Bison statements in sclang ported. If there was interest in this use case for Hadron I could pretty quickly turn around the rest of the grammar to get to 100% support, this is a few days work at most.

Oh thatā€™s fantastic.
The goal of the project is to make something reasonably standalone like black or clang format - would a parser be able to be run so that you donā€™t need the server running or is that a prereq ?

doesnā€™t look like the server is involved in Hadron at all and this seems like a great solution.

I wonder if Hadronā€™s parser could also be used in lieu of Treesitter for us SCNvim folks @davidgranstrom

1 Like

@lnihlen the code for the hadron project is fantastic. Super duper clean - very nicely done.

Hey, thanks a lot for the kind words! Iā€™m very glad to hear you think this might be useful.

I could create a binary that runs on the command line and prints a JSON dump to stdout of the parse tree of any input file. Would that work?

It may! I have to see what the output looks like and how quickly it could run - are there, perhaps, any external language bindings in the code ?

In short, there are no ā€œexternal bindingsā€ for the code. In longer form:

Hadron is written in C++ and compiles into a static library libhadron.a, which I suppose you could generate bindings for almost any language that can call against C++17 headers and a static lib. Thatā€™s not a use case Iā€™d yet seriously considered, but I havenā€™t ruled it out either. I donā€™t have time to prioritize that work right now, but if someone wanted to generate automatic swig bindings for sclang data structures, Iā€™d be happy to help guide and review PRs.

I also wrote preliminary support for an LSP interface but have recently gutted that. I feel @sczttā€™s work on a sclang-based LSP implementation is more inclusive of others in the dev community who are more comfortable with sclang code than C++. Adding LSP support to provide a parse tree over JSONRPC wouldnā€™t be much work, but I take the comments above that the server isnā€™t a helpful approach for folks here, which is fine. That codeā€™s not long for this earth, anyway.

Iā€™ve been working for the past several weeks on a complete refactor of the compilation pipeline to use intermediate data structures accessible from the sclang side. Iā€™ve already completed the parser. In a sense, sclang serves as an ā€œexternal bindingā€ for the Hadron parse tree! I hoped to follow in the LSP toolā€™s footsteps by enabling the development of language tools written in sclang, including things like automatic code formatters.

I have a foreign function interface planned for Hadron, but thatā€™s a nontrivial effort and not likely to materialize soon. But the task here is to marshall some simple data structures between languages, and I think JSON is suitable for this purpose. I had plans for this anyway, to support another volunteer refactoring my vistool to sclang. The vistool consumes parser JSON (and other artifacts) to produce detailed graphs of parser data structures, which has been valuable for fixing the parser when it breaks.

In terms of speed, itā€™s likely reading the file from disk (or stdin) and writing the resulting JSON to disk (or stdout) will be slower than the actual parsing job. I have no plans at this time to support incremental re-parsing. Parsers are highly context-sensitive machines, and the sclang grammar is pretty complex. For example, there are some ā€œvexing parsesā€ in sclang, some of my favorite syntactically valid expressions below:

1 - - 1 // the second hyphen is unary negation, and this evaluates to 2
12345x0 // the sclang Lexer accepts any numeric characters prefixing the "x"
        // for hexadecimal, this evaluates to 0
Foo {
  ** { } // an instance method named '**'
  * * { } // a class method named '*'
}
/* /* Nested comments are totally a thing in sclang! This was
 *  * a nightmare to lex with Ragel
 */ */

Good times! I canā€™t imagine trying to modify the parser to re-parse only part of input sequences like this.

Anyway, Iā€™m wrapping up some work on porting another part of the compiler to generate sclang data structures. I should finish that soon and start work on the JSON serializer binary for the parse tree, and I can report back here when done.

Awesome - very much looking forward to this. I was thinking about it last night, and it may make a lot of sense to use the bindings from the language or write something using Hadron directly if thereā€™s a lot of code that needs to be formatted. Do you have a proof of concept of what the Json looks like ?

Unfortunately, Hadron is not ready for use as a general-purpose sclang interpreter. Thatā€™s my goal, and Iā€™m making steady progress towards that goal, but writing a JIT compiler is a big project, and right now, Iā€™m the only contributor to Hadron, so itā€™s slow going.

Suppose you wanted to start work on a sclang version of the code formatter (to run it directly in Hadron eventually). In that case, you could begin by extending the classes in HadronParseNode to deserialize from JSON. For now, you could run the formatter using normal sclang, with the expectation that when Hadron starts working, you could skip the serialization/deserialization phase and use the objects provided by the Hadron parser directly.

Regardless of language choice, the JSON dump will be a verbatim serialization of the HadronParseNode classes. I havenā€™t seen an overall standard documented for serializing sclang objects to JSON because I would conform to that. There are some complexities in serialization (like object cycles) that we can gloss over here. For now, Iā€™m envisioning that each object is a JSON dictionary with a required key _className: with value of the class name of that object, followed by key/value pairs for each non-nil, externally readable member of the object, so something like:

{
_className: "HadronBlockNode",
token: { _className: "HadronLexToken", name: "openCurly", ... },
arguments: { _className: "HadronArgListNode", ... }
variables: { _className: "HadronVarListNode", ... }
body: { _className: "HadronExprSeqNode", ... }
}

Because of ambiguities in the original sclang grammar, the parser will need to know in advance if you are expecting to parse a class file or an interpreter string. Each parse node descends from HadronParseNode and will contain a corresponding HadronLexToken. The lexer token contains a lot of what I think would be useful for the code formatter, including the position of the token in the input string. The parser will always return a single root-level object. If parsing a class file, the root-level object will be a HadronClassNode or a HadronClassExtensionNode, and if parsing an interpreter file it will always be a HadronBlockNode. If there are multiple classes or blocks present in the input the next member of the root object will have the next one, and so on.

I need to finish and merge this WIP PR, which is a considerable change to Hadron, including some changes in the ergonomics of the C++ wrappers around sclang data structures that will make finishing the parser serialization much more pleasant. Once thatā€™s in Iā€™ll start on the serialization.

1 Like

Iā€™m very glad you made the distinction between the interpreted code and the class code, because i was starting to bump into that and wondering what, if any, distinction there was between the two. Thatā€™s a great bit of insight and Iā€™m sure itā€™ll clear things up - to shore up my understanding, it is not possible for there to exist interpreted code and class code in the same file, correct ? That is, if you have a file that you are going to format, you should expect it to be one or the other, never both ?

Since your JSON output has all of the data required for lexing the data, including the positions of the characters, that means that most of the logic that Iā€™m currently using will still apply ; I generally rebuild the tree after every pass, but since the lexed values will never change, the only thing that will need to be rebuilt is the character positions, and that can be done without having to call the parser again.

This sounds super-duper promising. When youā€™re done with the PR, Iā€™d love to see how it handles the scd files in the codebase and what percentage of lines are failing.

I also expect that the de-serializer youā€™re using will only expect to read data from disk and will not accept data from stdin, correct ?