While the new opcode abstractions are a step forward for compiler code clarity, they underscore the absence of a shared specification. Without one, the compiler and interpreter operate on implicit agreements that are prone to strange errors.
It’s almost like one skipped this step in the process.
A bytecode specification plays a crucial role in any virtual machine-based language. It is the authoritative contract between the compiler, which emits bytecode, and the interpreter, which runs it.
A well-defined, machine-verifiable specification makes this contract explicit rather than implied. It outlines:
The list of opcodes
Their binary representation
Operand types
Instruction lengths and invariants
This isn’t just documentation — when shared between the compiler and VM, the spec can automatically validate correctness and generate boilerplate safely, avoiding many classes of errors that otherwise show up only at runtime.
The interpreter still consumes raw bytes directly (like before!). If the NEW compiler-side opcode abstractions diverge — in ordering, layout, or semantics — nothing is enforcing that the interpreter remains in sync.
What Can Go Wrong Without a Spec?
A LOT
These problems are subtle. Code may compile but misbehave in edge cases, especially with less common opcodes or combinations.
The goal: a single definition, verified across both layers. In other words, a formal machine-verifiable specification that itself needs to be tested as well. (Yes, it is not rare that specifications themselves have mistakes)
Some of these things are enforced by the type system, for others —in debug mode—, they are asserted.
These problems are subtle. Code may compile but misbehave in edge cases, especially with less common opcodes or combinations.
The presence of a formal spec doesn’t ensure that the interpreter implements it correctly. If we were to make a spec now, it would simply describe what the interpreter does today (warts and all). I don’t think that would be useful. Perhaps if we were designing a language from scratch this would be useful, but due to backwards compatibility, the correct behaviour of sclang is its current behaviour. In my opinion, it would be more useful to have specific tests for unusual parts of the language (I’ve add a few regarding int literals), although simply compiling the class library and running the test suite is pretty thorough.
You mention that much of the structure is encoded in the type asserted in debug builds, and that’s not bad. The missing piece is that there’s no shared source of truth between both layers. Right now, the interpreter consumes raw bytes, and the compiler emits them via abstractions that are, in a way, created detached from it. If either side changes (even innocently, a tiny thing), there’s no structural way to detect desyncs other than runtime behavior or failing tests
I accept your point, formal specs don’t catch bugs on their own. But they are fundamental. Besides, yhey can enable tools that do:
Round-trip tests
Auto-generated opcode tables
scripts to compare
Nice way to review what each opcode expects
Even just a declarative table could make tests more meaningful and bridge the gap between the high-level abstractions and the raw reads.
–
Jordan, I mean this as a constructive critique. If we are going to touch the current raw bytecodes, we should do it in a good way. Readability is not enough (actually, now we have to change two layers to modify one thing). I think a specification is not controversial here.
This all sounds good but I still don’t understand what change you are proposing to the c++ code? Is there some tool that does all this? Some project that does this which we should be copying?
no shared source of truth between both layers.
I’ve attempted to encode this ‘truth’ in code so it can be enforced.
With this PR:
If a bytecode changes (the identifier bit, not the operands) and the value doesn’t collide, it will work so long as the labelled goto table is updated (I can’t find a way to enforce this, perhaps a generator could be used for this but it seems overkill?), if it collides, it won’t compile.
If the number of operands change, neither the compiler nor the interpreter will compile because in the former case, not enough arguments are given to the emit method, and in the latter, the structured binding won’t have enough arguments.
If the types of the operands change, this one is a bit more subtle as structured bindings don’t let you declare types of the operands, but in most cases, where the operand is used, there will be a compiler error in the interpreter. In the compiler there will always be a compile error.
If we are going to touch the current raw bytecodes, we should do it in a good way.
If you have a concrete way to enforce any more of the opcodes structure, or to otherwise makes things more explicit, or any other specific feedback, I’d love to know and would gladly make the changes. I’m just struggling to understand what you are actually recommending.
Let’s see… A practical approach in C++ is to define one canonical opcode list and share it between the compiler and interpreter. For example, X-macro or constexpr array of structs in a header. In this case (a speculation for now_, each entry encodes the opcode name and metadata. This single list is then expanded to generate enums, tables, etc.
Going back to the main point: something like this ensures one source of truth: the compiler and interpreter include the same header/array.
To be fancy, why not the shared list to generate interpreter code? For instance, an X-macro can generate the switch/case blocks or a dispatch table automatically.
C++ experts can add specificities to an implementation, something that delivers an early warning, “single source of truth” without much redesign.
(Of course, I tried to reply your implementation question rather than talk about how important is to have a specification)
EDIT: With C++20 concepts or with type-tagging operands in the table, this will be safer.
The problem with a ‘constexpr array of structs in a header’ is that you now have to refer to them by index, not by name. What you need to do is make the number a part of the structure instead. This is exactly what I have done. If we had c++26 and the upcoming reflection, this could be simplified, but given we don’t, I think this is the best compromise.
Going back to the main point: something like this ensures one source of truth : the compiler and interpreter include the same header/array.
Yup that’s what they do. The compiler builds the opcodes, the interpreter uses them, but they include the same header. One source of ‘truth’. In fact, that was the whole point of this PR.
It appears you are just restating what I’ve done in the PR, but without considering the compromises one has to make during implementation. If you think there is a better approach or any changes given the technical constraints of the project to date, I’d be more than happy to consider it!
EDIT: With C++20 concepts or with type-tagging operands in the table, this will be safer.
No it won’t, all the constraints (bar the goto lables) of the opcodes can be expressed in C++17. Concepts just give you a better alternative to sfinae (which I don’t use because it isn’t needed).
my critique wasn’t about implementation details, but about fundamental architectural principles. You seem to have misconstrued my point about specifications as merely suggesting alternative C++
Yes, your PR implements a shared header file. but my concern was broader - about having an explicit, documented contract between compiler and interpreter components. This isn’t just about code organization - it’s about system architecture.
Your dismissive response to suggestions (“X-macro’s are horrid”) misses the forest for the trees.
The improvements in readability are valuable, but they don’t automatically verify guarantees that a proper specification would. Your approach does address some of the type-safety concerns through C++17 constraints, but doesn’t provide the same level of formalism that would make the contract truly explicit.
I stand by my original point: a language like that benefits significantly from having a formal, machine-verifiable specification that serves as a contract between components.
Also, think of subtle desynchronization. Types enforce some constraints, but they don’t document intention or verify semantic correctness across the boundary. I raised these points not to criticize your specific implementation, but to emphasize that a formal specification matters -
I think that’s all to be said in the paragraph above.
Structured specification goes beyond what well-typed C++17 code can do.
Since a lot of the post was on implementation, just to remind us: C++20 concepts offer more than just replacing SFINAE - they provide clearer semantics that would benefit this situation.
Your implementation works, but a formal specification would centralize semantic relationships between a complex boundary rather than scattering constraints throughout the code. I hope I am not the only one who sees the difference.
I see the point on writing a detailed specification of what the bytecodes do, but I think moss has really done this. The pr applies that specification, enforcing as much as possible in the type system, as I’ve tried to explain by giving examples, and showing that this pr meets many, if not all, of the definition of a machine verifiable specification you have given. If you think there is a specific case where this could be improved, I’d be happy to make changes.
Using concepts for operands is a bad idea. By using a specific type (as my PR does) it will tell you that the type of the operand needed should be of type X (e.g., Operand::BinayMathNibble), whereas yours produces a complicated concept violation. Fundamentally, concepts are used to match a set of types, whereas in the opcodes, only one type should be accepted as opcodes have a definitive signature as the existing bytecode documentation tells us.
I see - there’s a misunderstanding. You need a true machine-verifiable specification that serves as a formal contract between the compiler and interpreter as Structured Data (JSON/YAML/XML). I have not seen this. Consider:
Declarative rather than imperative
Separate from implementation
Formally verifiable
But, that’s one approach. If you a confident that this is unnecessary, let’s find out. Maybe one can end up doing that, just writing tests and realizing all the desynchronization with them. The problem is that those “desynchronizations” can happen in odd cases and certain combinations.
I am not suggesting a hardcore formal verification (TLA+ or Coq). That is just too much.
The most important aspect is ensuring that both compiler and interpreter derive their implementation from the same specification source.