Code Formatter Development for Supercollider

Oh that’s a great idea about the array - it shouldn’t be too bad to implement.
However, I don’t think align command should be inline, should be a comment in the line before the array/block, as otherwise the code would need to be moved around in the event that the line would be too long. I’ll try to set that up and see how it goes.

I think the array data being populated with data is something that’s unique to supercollider, but the other datastructures (i.e. events) are somethings that have analogs in other languages. The general feeling there is that all text is aligned left, and even though it’s awkward at first, I find that over time it kind of fades away and in very large codebases (i.e. over 20-50k sloc) having the alignment consistently left works better because even though the intra-block consistency may be a bit off, having the same alignment type in every block works well.

This is the precedent set by other formatters and I’m inclined to agree with it through my experience.

In my mind your code would be formatted as:

~event = (
    degree: [3, 5],              
    octave: 3,
    amp: 0.6,
    finish: { ~legato = rrand(0.5, 2) },
    callback: { "note played".postln }
);

And if the function block bled over the 80 character limit, then it would be formatted in K&R style. (Which leads me to - functions should be formatted in a single line if they are a single statement and the statement does not exceed the line limit.)

As a side note, doing these changes off the semantics of the code generated by treesitter is making these formatters super easy to write. There’s still some stuff I can pare down/make functions, but even the idea of removing the semicolon from a return statement boils down to a block like this:

 for match in captures:

            # Find the last statement - either a return statement or some
            # sort of expression. This is going to be the return statement.
            # The last statement is the statment that's neither a semicolon
            # nor a closing bracket.

            semicolon_found = False
            for child in reversed(match[0].children):
                if child.type == ";":
                    semicolon_found = True
                if child.type != "}" and child.type != ";":
                    return_statement = child
                    break

            start_line = return_statement.start_point[0]
            start_offset = return_statement.start_point[1]
            matched_char_loc = newline_offsets[start_line] + start_offset
            end_line = return_statement.end_point[0]
            end_offset = return_statement.end_point[1]
            end_matched_char_loc = newline_offsets[end_line] + end_offset
            return_statement_text = data[matched_char_loc:end_matched_char_loc]

            # Remove semicolon at end of statement to further distinguish return
            return_statement_text = re.sub(";", "", return_statement_text)

            # Splice Return Statement Into Data
            if not semicolon_found:
                data = (
                    data[:matched_char_loc]
                    + return_statement_text
                    + data[end_matched_char_loc:]
                )
            else:
                data = (
                    data[:matched_char_loc]
                    + return_statement_text
                    + data[end_matched_char_loc + 1 :]
                )

And I think that could even be made more efficient by removing that regex and just offsetting the splice. But, the point is, that there’s semantics, so if we can accurately describe what the formatting is, it should be easy to transliterate to code.

Yeah, this would be a more standard / naive formatting of this (which I definitely don’t have a problem with). For complex code, I will often manually format it to be something closer to what I posted (probably not EXACTLY that because it’s labor intensive) - I would of course rather write a formatter rule to do this for me.

Eesh, the expressiveness of sclang makes formatting so tough - there are plenty of reasonable cases where two statements should be left as a single line - for example, I know I had probably 100+ setter functions around my code written like:

SomeClass {
    lowpass_{ |value| lowpass = value; this.changed(\lowpass) }
    amp_{     |value| amp = value; this.changed(\amp) }
    pan_{     |value| pan = value; this.changed(\pan) }
}

It doesn’t feel great to expand each of these boilerplate one-liners to 4 lines?

Of course, we could nit-pick each of our own syntax idiosyncrasies forever and never come to an agreement (I’m definitely not interested in doing that!) - but it makes me think that there are a few tiers/categories of formatting:

  1. Left-hand only whitespace (e.g. indenting)
  2. Line-breaks based on column limits
  3. Intra-line spacing (e.g all whitespace that’s NOT indent whitespace)
  4. Semantic line-breaks (e.g. “multiple statements in a function should be split to separate lines”)
  5. Non-whitespace linting changes (e.g. “remove trailing ;”)

These seem to get progressively (a) harder to implement correctly, and (b) harder to find a consensus “right answer” that won’t annoy people / disrupt the expressiveness of sclang - where the “linting” style changes of #5 actually have the potential to break code (if there are bugs) or severely disrupt livecoding workflows - but also have the potential to radically clean up code / decrease the cognitive overhead of sclang’s over-expressiveness (where you can write anything 100 different ways).

It’s also important to note that, right now, the sclang formatter only does #1. In terms of workflow, MANY people will justifiably not want to enable any of 2-5 right away - muscle memory is muscle memory after all.

1 Like

yeah! I think the idea is that this ends up being a formatter that no one is 100% happy with, but is opinionated and consistent. I feel like formatters help identify smells and places where code can be cleaned up, and consistency is the biggest driver in making that happen. If there’s a hard-and-fast rule we can apply, we should do it, and we should limit the amount of configurability as much as possible.

I agree that sclang is extremely expressive, but this is both a blessing and a curse - I work with C++ for my job, and it’s similarly nimble, but jumping from codebase to codebase can feel like reading a different language entirely. This is great for personal work, but for larger projects, and within an entity, it can get exhausting trying to understand what’s going on. Once my team and I worked together to put .clang-format rules together and apply them globally to the codebase, things got easier to read, development cadence increased, and the annoying little details of code reviews for indentation/etc. faded away. (As an aside, this one of the benefits of left-aligning code - adding or removing a line doesn’t affect the surrounding lines and limits diffs.)

The problem with clang-format is that it’s not universal, and between business units, or even between repositories, the formatting can differ. One of the great benefits of python, especially now that Black is widely adopted, is that code universally looks the same. This means that when I look up a project on Github, I can scan it and pretty much grok what’s going on.

One of the things that I find difficult about native sclang (that is, not any of the wonderful DSLs built into Supercollider) is that when I look to find an answer or some example code on the web, it’s generally formatted in a wonky way, or it’s written in some sort of ideosyncratic form. I would love it if, like python, I could pull up a piece of sclang code and understand what was happening more or less immediately. Part of the goal of this project is to help move that along, and that leads into points #2-5.

I don’t think it’s necessary to require people to come over to the dark side right away, but having the option there (especially if it’s bundled up with the treesitter) will hopefully help normalize code over time. Programmers are a stubborn bunch, but given the ability to put some consistency into their code, they’re generally happy and will ultimately migrate. (Even if they do complain at points - that’s just the nature of us all being simultaneously cranky, creative, and opinionated.)

#######

To that end, I’d like to figure out how to get point #1 written out - indentation seems like it’s a bit complex, and I’m not sure exactly where to begin. I’ll think about it, write up some rules, and come back to this thread.

To me, these lines feel like they should be calling a function that takes in a value and field, all of them being consistent. If it were a code review, the repetition would jump out to me, and I’d likely ask to have those collapsed into a generic class-level getter.
Then the formatting could all be aligned left. :slight_smile:

The example from this post - Sound design tricks - #2 by Rainer - is a great candidate for figuring out what indentation/formatting should look like. (The treesitter chokes on it, too!)

To answer a previous question re. indentation - the cases I remember being tough when I tried to write “new” indentation logic:

Multi-line statements

a = b 
    + c
    + d;

b.method1
    .method2
    .method3;

And, how handle “last lines”:

a = {
    something.call();
    "called".postln;
};              // last line has only "})|;" tokens - dedent!

b = arr.send([1, 2, 3, 4
      5, 6, 7, 8]); // last line has other tokes, so don't dedent!

The first case there’s an obvious right answer in MOST cases (if the last line is only a ;, it seems like it follows the second rule, but probably some room for debate here). In the second case, it’s clear that the abstract RULE pretty much captures how sclang code is written, and there are valid cases for both - but I found that getting this right 100% of the time was very hard.

Part of the challenge was that I was writing with a limited tokenizer but not e.g. a properly parsed result from treesitter, so perhaps this will be easier now that we have a reasonable toolchain. I think when a context opens (meaning [({| or a new statement after a ;), you basically walk through every subsequent line that’s still within that statement and bump it’s indent by 1. When you’ve finished, you walk your lines but collapse indent jumps of >1 to 1 (you don’t want to double-indent after ([). The final line EITHER has a +1 indent, or has the indent where the outermost context started depending on whether it has tokens other than context closing tokens like })|; - but again, this one is tough to pin down.

If you can get a complex method chaining expression to work, this is probably going to cover most of the hard cases:

object.method(1)
    .method(2)
    .method([
        3, 4, [
            5
    ]]) // indent here is ambiguous ....
    .method({
          |foo=({ 
               |bar|
               bar * [
                   1, 2 
               ]
          })|
          foo.()
    })
;

Great! Thank you. This is some good stuff to work with!
I didn’t know you could put a function as a parameter to an argument list.
Using the rules of - arrays stay on the same line unless too long, the parameter list is on the same line as the function, multiple “.” connectors are one on each line indented, and one statement functions remain on the same line, I got this as the output:

object
  .method(1)
  .method(2)
  .method([3, 4, [5]])
  .method({|foo=({|bar| bar * [1, 2]})|
    foo.()
  });

what do you think of something like that ? I think it’s fairly readable, but the nested functions are always a pain in the neck.

I think this is largely correct as a final output, but I think removing from the original line breaks feels questionable to me in this case? I’m still not sure if there’s a net gain to removing human-added linebreaks in these kinds of cases in general - it will always reduce the line count, and specifically it will reduce the line count in places where the programmer explicitly added a line break…

There are basically two possible cases here…

  1. The programmer correctly added a line break to split apart a line that would otherwise have too much semantic content to be easily readable - to visually separate pieces of a complex expression.
  2. The programmer added a line break by accident, or because of a bad intuition or mistaken understanding about what the code is doing.

In case 2, the auto-formatted result is likely somewhat better, and MAY help the programmer better understand the code they’re writing, but could also just confuse.
In case 1, the auto-formatted result is almost always going to be worse.

I can see skewing toward “fixing” line breaks in cases where there’s likely to be a lot of case 2 (first-time SC user, MAYBE live coding), and “trusting” existing line breaks where there will be an abundance of case 1 (basically any experienced SC programmer). Probably this just means choosing the right default enablement of this rule?

A small thing - I believe there’s a common idiom to always have one space of padding inside of function braces, { something } rather than {something}. As I look around in the classlib, this seems to be followed almost everywhere?

That’s correct. I was manually editing that block, and I neglected to add that. Good catch - this is why we need a formatter! :slight_smile: (I do have that rule already written - it’s in the Code style guidelines · supercollider/supercollider Wiki · GitHub section about “Add Spaces with Curly Brackets”. )

I think with line breaks, part of the job of the formatter is to say “no, I know better” and put the line breaks where they belong semantically. Black does this, and it’s been in my experience the best way to do things - the exception it gives is for breaks between statements, in which case it allows for the user to put one empty line, and it puts two empty lines between classes or methods.

The work to make this function properly is on the developer of the code formatter, to put rules in place that are logical and consistent. Given your cases, I think this is especially important in places like live-coding, where you want to type sludge and then have the computer present it to you in a way that’s easier to read.

Newline management is in my todo-list this week, so this is a good thing to think about.

This looks better to me, though I’d imagine it’d be cleaned up if it were real code.

object
  .method(1)
  .method(2)
  .method([3, 4, [5]])
  .method({ |foo = ({ |bar| bar * [ 1, 2 ] })|
    foo.()
  });

Took a bit of doing, but added code to normalize the text into blocks, allowing two or more new lines to be separated into a single newline for display purposes.

Also collapses the above into one line and does some hacky shenanigans to get around the treesitter’s spacing issue in the argument list. (This results in the right side of the argument list not having a space - that’ll be taken care of in the ParameterListAlignment rule.)

Output:

Next on the list is fixing the argument list and then adding indentation for each logical piece.

2 Likes

Hey John.

I made an extremely simple little NeoVim plugin, mostly for myself to make it easier to try this out. Keep up the good work!

3 Likes

oh sweet! This is much better than the one I made.
Great stuff, thanks!

1 Like

The treesitter is wonderful, @madskjeldgaard. I feel like it bears repeating how much simpler it has made this project.

I’m getting to the point of analysis of what should be indented, and the tree basically gives me everything - looking at the tree, I’m seeing data like this:

(object.method(1).method(2).method([3, 4, [5]]).method(6).method({ |foo = 12| foo * 13 });)

((1..3) collect: { |x| x + 1 } bubble: 0)
((1..3) collect: { |x| x + 1 } bubble: 0)
{ |a, b, c| var d; d = a * b; c + d } - source_file
  (object.method(1).method(2).method([3, 4, [5]]).method(6).method({ |foo = 12| foo * 13 });) - code_block
    ( - (
    object.method(1).method(2).method([3, 4, [5]]).method(6).method({ |foo = 12| foo * 13 }) - function_call
      object - receiver
        object - variable
          object - local_var
            object - identifier
      .method(1) - method_call
        . - .
        method - method_name
        ( - (
        1 - parameter_call_list
          1 - argument_calls
            1 - unnamed_argument
              1 - literal
                1 - number
                  1 - integer
        ) - )
      .method(2) - method_call
        . - .
        method - method_name

Which gives us everything we need, and now just need to pare it down to make sure the newlines and tabs are put in the right place. Should hopefully have an indent proof of concept by next week.

3 Likes

Here’s a gist with a few parsed trees. Gonna review this today and see where it makes sense to indent and if there are any good rules to apply for when to split lines and when to keep them intact. (Like, I don’t know if it’s a fools errand to try to keep the pdefs readable/musical - it may be better to lean into using flucoma to render a pattern in western notation and do that quickly (similar to how you would render an envelope to view it), and focus on making the data easier to edit rather than easier to read.)

Another question, too, and this may be a bit more controversial - should we look to enforce using fractions when they evenly divide ? I don’t think we need to, but 1/4 , 1/8, 1/16 is inherently more kin to western notation than 0.25, 0.125, 0.0625.

If you do, please parenthesize them always.

SinOsc.ar * 3/8 will produce two BinaryOpUGens: a multiplication and a division.

SinOsc.ar * (3/8) will produce a single multiplication UGen.

The latter will incur less CPU cost in the server.

hjh

1 Like

Yeah, this is a semantic change, which should absolutely be avoided. In general, you should have the same compiled result (bytecode) before and after formatting - probably it would be good to build some comparison tests to validate this eventually.

2 Likes

Great - that’s something I didn’t even consider. To your point and @scztt s - is there a way to get the compiled bytecode to compare the formatted and unformatted results ?