Why are strings immutable?

I’m sure that I’m missing something here… but why are strings immutable.

In other languages, you want immutable strings so you can intern them and making their hashes equivalent.
This isn’t necessary in supercollider because we have explicitly interned strings (Symbol).
Because String isn’t interned, they are always compared byte-by-byte ("asdf" == "asdf"), and identity compared through their pointers ("asdf" =!= "asdf").

…so why can’t you change them?

f = {
    var a = "ABCDEFGHI";
    a.put(a.size.rand, "abcdefghi".choose);
};

If string literals are mutable, then executing this function corrupts its own function definition.

hjh

To add on to James’ point: ONLY string literals are immutable - non-literal strings can be mutated, they’re just arrays.

(
var literal = "literal";
var runtime = literal.copy;

literal[0] = $L; // fails, immutable
runtime[0] = $L; // works
)

The slightly strange thing is - Strings are just Arrays, but Array literals ARE mutable. This make sense for pragmatic reasons (mutating an array literal seems like a much more common thing to do), but it feels a little weird that String is a special case here.

Hm, seems like we missed that case when fixing the string literal issue. Replace the string with an array literal and you’d have the same problem.

hjh

are Array literals mutable?

#[1, 2, 3].put(1, 0) // ERROR: Primitive '_BasicPut' failed. Attempted write to immutable object.

Interesting:

(
[\a, \b, \c].put(0, \start); // okay
#[a, b, c].put(0, \start); // not okay, immutable
)

@scztt @jamshark70 Thanks both for your examples!

… but why are there language side string literals? or array literals?

The only benefit I can see for literals is in defining variables in functions as it means you don’t have to copy the object for each function invocation but can store it in the constants table. However, that seems like quite a small optimisation given that symbols exist and all other objects are created anew.

f = { var a = "str"; a }
f.def.prototypeFrame == ["str"]
f.def.dumpByteCodes
BYTECODES: (2)
  0   30		 PushTempZeroVar 'a'
  1   F2       BlockReturn

As an aside, why do variables in array literals turn into symbols?

a = "asdf";
g =  #[a];
g[0] == \a;

More than that. If you don’t have literals, then the way to build a collection is to add items into it one by one:

{ [1, 2, 3] }.def.dumpByteCodes;

That’s a bigger performance drain than copying.

We had a related case in the SCDoc code, where one function in a big switch couldn’t be inlined. The compiler rendered this as pushing like a hundred items onto the stack (a few dozen symbol/value pairs) and using a special byte code for dispatch (IIRC). Inlining the function, avoiding the incremental population of the branches, gained at least 1 order of magnitude faster scdoc rendering. It really matters, especially in high traffic areas.

Here, I’m not sure quite what you’re driving at. 1, as noted, the optimization is more than you think; 2, even if it were a minor performance gain, it’s… not a bad thing, right?

They can’t be variables because if there’s an expression in the array, then by definition it can’t be literal. There is no way to have a literal array that evaluates a and substitutes an on-the-fly result.

hjh

Ah, no, I get the point of them in the compiler. What I’m asking is why does this…

{ #[1, 2, 3] }.def.dumpByteCodes;
BYTECODES: (2)
  0   40       PushLiteral instance of Array (0x5590d8fc4178, size=3, set=2)
  1   F2       BlockReturn

… ‘push’ an immutable array onto the stack, rather than copy the array onto the stack, as it would mean there would be no need for immutable ‘literal’ objects? Or is there some other reason this wouldn’t work that I can’t see?

I imagine this will probably be the answer, I’m just looking to see if there is another reason which might improve my knowledge of the interpreter!

Optimization aside, I also see a bit of a conceptual problem. Given something like:

a = "literal";

… If we want the right-hand-side to be a new instance of a string every time this function is executed, then this has to compile into something that calls a String constructor. This gives you, at the AST level, something that might look like:

a = String.newFrom("literal")

But, then, what is the argument for newFrom? It has to be… a string literal - this data has to be stored SOMEWHERE. We don’t get to remove string literals from the compiler or runtime - in the end, we would only be able to hide them slightly from the user via syntactic sugar.

It’s worth noting that the ONLY thing that string literals prevent is in-place replacement of single characters: anything changing the length of a string, concatenating, etc., will generally construct a new string. sclang semantics are often very functional programming derived (for better and for worse…) - generally you modify arrays by applying functions to them that return new arrays. In-place mutation, while possible, isn’t really a central paradigm for the language - you’d much more often use things like format, replace, split, join, etc to modify strings - Python, for example, ONLY has immutable strings, and even mutation-like functions still return a new string.

I think sclang’s fast-and-loose jumping between functional programming, immutable data concepts and fully mutable data is honestly a deeper conceptual problem that makes things quite confusing. In Python, for example, any operation to a string is always of the form of an assignment string = string.doSomething() - there’s no need to think about it, this is just the way all code is written. But, in sclang, some things mutate and some things return a new object - it’s not obvious which do which, and code assuming one approach or the other might work just fine 95% of the time, but do unexpected things if you call the wrong method.

For me, the question is more like: why does { {1, 2, 3] } construct a new Array?! “All literals are immutable” seems like a straightforward rule - there are plausible performance benefits, and it only takes a .copy to change something to be mutable if you want to mutate it. If I had to guess, this was the original guiding principle, and Arrays were switched to being constructed at runtime for some kind of pragmatic reason (e.g. it allows overriding the add method)?

Actually, I was a bit curious:

(
bench {
    10000.do {
        var a;
        a = [$a, $b, $c, $d, $e, $f, $g];
    }
};

bench {
    10000.do {
        var a;
        a = "abcdefg";
    }
};

bench {
    var string = [$a, $b, $c, $d, $e, $f, $g];
    10000.do {
        var a = Array(7);
        a.addAll(string);
    }
};
)

These are almost identical from an operational perspective (the array version uses slightly more memory only, but otherwise the result should be basically the same), but the array version more than 10x slower. I’m curious how much code would break if we make arrays compile to immutable literals also, because the performance benefit could be huge.

Weirdly, the last one SHOULD be faster than the first, because it’s using addAll, where the first just calls add a bunch of times - but for some reason it’s even slower (2x…)? This was only hitting a slow path because there was a type difference between the String and Array - the correct version is 2x faster than the Array literal.

That could easily be solved though with a symbol as they are interned and immutable, which is essentially all the runtime string-literal is, albeit interned in the constant table of the chunk rather than vm wide.

a = String.newFrom('literal')

I wonder if function signatures and an lsp could solve this problem (in another language) as you could get a warning if it returns a new object that the user discards?

I wonder… are there are any interpreted languages which call the interpreter during compilation, keeping a track of which functions are ‘pure’ and which mutate global state and evaluating the creation of the object to be stored in the chunk during compilation?

I’m on a bit a of a language design learning spree at the moment…

VERY good questions to ask in the context of this :slight_smile:

Point taken, though one BIG factor here: Symbols are not exactly immutable - they’re more like content-based global keys. VERY importantly, they all have a unique id - converting a string into a symbol (e.g. by hasing it and then finding it in the symbol table) is expensive, and increasing the number of symbols can have an effect on performance. Making string literals into Symbols, at a glance, would mean adding 5000 or 6000 new entries to the symbol table, in spite of basically none of these actually needing the “unique id” functionality of Symbols. For really pragmatic reasons, this would probably throw a wrench in this approach (but it would make string comparisons faster, like Python!).

sclang is the only language off the top of my head that has a clear distinction between Symbols (globally-unique strings) and String (opaque data containers), which I think is fantastic - it’s something I often miss in other languages, though as of c++20 I think you can BASICALLY implement the same thing in C++?

This is something I hadn’t considered, both in terms of the hash and the increased size of the symbols table. Yeah, I really like supercolliders explicitness here too!

I’m actually making a language right now (dream big :grin:), typed with interfaces, errors as values, no inheritance but mixins, and modules. Got as far as making closures and a type system, only floats though, no complex types, now I’m rewriting and hooking in the GC before carrying on. It is in zig, which I hadn’t used before this, and it’s just a joy compared to c/c++. In this language, the type Id is stored next to the slot tag, meaning doing small string optimisation should be trivial and would be another solution to this particular issue as most string literals aren’t that big.