Haskell Audio

smoge · February 29, 2024, 4:07am

Continuing the discussion from White house urges developers to avoid c & c++?:

I’ve searched into it, and indeed the garbage collector got better for us. And I’ve seen some compilation flags that may influence that, although never tried.

I believe it’s possible but not with low latency, and I’m not willing to give up that))

I’d love to see a proof-of-concept about that!

amindfv · February 29, 2024, 9:33pm

I’ve searched into it, and indeed the garbage collector got better for us. And I’ve seen some compilation flags that may influence that, although never tried.

I believe it’s possible but not with low latency, and I’m not willing to give up that))

Although I’ve seen projects like this one on hackage, and claim "hard real-time": atom: An EDSL for embedded hard realtime applications. – don’t know what’s the trick there.

It’s definitely possible to write non-allocating, non-GC-ing real time code in Haskell with predictable performance, but you’d need to do it within a framework.

For example, you could use destination-passing style - this might be what I’d choose.

Here’s an elegant way of writing in DPS with modern language features

Or you could go heavy-duty and use:

I haven’t looked into the Atom DSL which you linked to, but it looks intereesting too.

You can also get control of allocations without completely removing them, using streaming libraries like pipes/conduit/etc. These are claimed to sometimes get as fast as C.

An older way of dealing with streaming computations is to use arrows. These are kind of weird to work with but they at least “feel” like composing UGens.

A couple other possible avenues:

Linear types (a big area)
ST: restricted (safer) C-style coding

This is just an overview, and there are other avenues to explore. But this is long enough already!

smoge · March 1, 2024, 4:30am

Tom,

Thank you very much for the thoughtful and surprising response. I’m going to delve into these possibilities, and we should be in touch.))

I’ve been working on some DSL, and recently with your library too, which is quite cool.

shiihs · March 1, 2024, 6:28am

I do wonder though: when working with large blocks of memory one can run into all kinds of timing issues due to caching behavior. Memory cache misses can be quite expensive in terms of time and energy consumption. Pre-allocating a large block of destination memory therefore in general will not automatically lead to predictable performance.

smoge · March 1, 2024, 6:35am

What kind of audio process would require this? Allocating a buffer would be done by another process, I suppose?

For example, in SC the Synth has nothing do to with buffer allocation.

shiihs · March 1, 2024, 6:39am

If I understood correctly, DPS style mentioned above pre-allocates memory to write results to and then passes it around to the calculations. So depending on what you are calculating (writing output to a buffer e.g.) you might run into such issues.

smoge · March 1, 2024, 6:46am

I think there is some trickery there: “The elements of the intermediate vectors are consumed as they are produced”. The important thing is that there is no interruption. As I’m understanding, it’s not so different from scsynth. (I’m still absorbing this stuff, idk

scztt · March 1, 2024, 1:09pm

I do wonder though: when working with large blocks of memory one can run into all kinds of timing issues due to caching behavior. Memory cache misses can be quite expensive in terms of time and energy consumption. Pre-allocating a large block of destination memory therefore in general will not automatically lead to predictable performance.

I think the usual strategy in audio is to allocate everything in a big block, where areas of memory are adjacent where their execution would be adjacent (meaning, basically, if you have node_a -> node_b then you have [memory_of_node_a, memory_of_node_b]). IIRC both scsynth and supernova are already doing this. This way, you have a higher likelihood of cache hits as you traverse the graph. Both servers do this for memory with executable memory as well, instruction cache misses are I think even more costly.

Definitely you’re right re “automatically lead to predictable performance”, but a big block with predictable is better than a big block with bad order, or heavily fragmented memory.

smoge · March 1, 2024, 1:22pm

@scztt I’m confused, you said the opposite. So a big block WILL lead to predictable performance in this context of no gc no malloc etc. or you’re saying “but not automatically”?

So it bowls to get a big block always for all your nodes regardless.

scztt · March 1, 2024, 3:29pm

“Big block” is probably the easiest simple strategy to get predictable performance. I say “but not automatically” just because cache performance is complex and not perfectly predictable, and can vary a lot across e.g. different CPU’s. You can make some basic assumptions, but lots of code that I’ve encountered where cache optimization has been done, has at least one or two things that would not be obvious without discovering them in real-world testing.

Spacechild1 · March 1, 2024, 4:02pm

Yes. A Graph and all its containing UGens are allocated as a single large memory block.

One interesting thing Supernova does is that when it processes a Unit, it already prefetches the memory for the next one: supercollider/server/supernova/sc/sc_synth.hpp at db7eed2a17c361503dbc7f70a557874b6001e3cd · supercollider/supercollider · GitHub.

Generally, you want to keep the memory footprint as small as possible. For example, all the input and output buffers are allocated from one larger block of memory with a buffer coloring algorithm to keep the total amount of accessed memory as small as possible. (In scsynth, these wire buffers can be shared among all Synth instances; in Supernova, however, Synths may run in parallel so each of them needs its own set of wire buffers.)

For anyone interested in the topic of data locality and data oriented design, I highly recommend Mike Acton’s (in)famous CppCon talk: https://www.youtube.com/watch?v=rX0ItVEVjHc

smoge · March 1, 2024, 11:17pm

He presents a perspective that seems to oppose Simon Peyton Jones’ paper. You did on purpose? ))) Although my attention wasn’t all in it, I still have to watch the whole thing, but it appeared he drew a sharp distinction between “abstract models” (other coders) and those who immerse themselves in the shit, I mean, in the practicalities of data and hardware (his company).
It strikes me bc for me concepts such as categories and functors are more than mere mental constructs or ways to avoid confronting reality, but that’s a way to understand how we think in general. His presentation was refreshing for me to think again about it.
Functors emerge as unique conduits between different domains, preserving their inherent structures. Imagine a functor as a means to integrate one domain into another seamlessly, without disrupting their relational dynamics. It’s akin to embedding a model from one domain into another, where the original serves as a blueprint or scheme within the new context.
The integration of one domain into another can manifest in VARIOUS forms, with some methods bearing resemblance and others markedly different. One approach might condense the entirety of the original domain into a singular point, whereas another might meticulously map each element and its connections to new counterparts in the receiving domain. This thing can be actualized in multiple ways. We will always be able to compare these diverse implementations. Something out there still represents a special category of mappings between functors that maintain the integrity of the functor’s properties.

The thing is, this man KNOWS A FUCKING LOT ABOUT HARDWARE. I’m sure about that!!! But I thought he didn’t need the manifesto, did he? hehe

It’s just one of those forms of integrate domains…

I mean, I would prefer to know more about memory management and not ask stupid questions of you guys (with whom I learn a lot), I hope I will get there, but I don’t think if I knew more it would change something essentially different.

Spacechild1 · March 2, 2024, 12:28am

He presents a perspective that seems to oppose Simon Peyton Jones’ paper. You did on purpose? )))

I haven’t read the paper, so it was not intentional

I hope I will get there, but I don’t think if I knew more it would change something ‘deep’.

I don’t think there is something ‘deep’ about low level programming. You need to follow the hardware and a simple imperative programming style is likely the best way to achieve that. We use OOP and FP to build all these wonderful abstractions, but they eventually all break down once we try to squeeze the last bit of performance out of our machines. Also, keep in mind that OOP and FP have been invented in a time when memory latency simply wasn’t an issue; RAM access was basically as fast as any other instruction. This has changed dramatically over the last decades!

To be clear: OOP and FP certainly have their place, just not in low-level high-performance code

EDIT: if you want the real fun, have a look at advanced SIMD programming. Often you cannot even write trivial for-loops; instead you need to figure out how to manually reorder and interleave your operations to make the best use of the SIMD registers and prevent CPU pipeline stalls.

smoge · March 2, 2024, 12:58am

One does not need for-loops with lambda calculus either. I’ll take a look, thanks for the tip.

That’s the paper (link at the top of the discussion):

Spacechild1 · March 2, 2024, 1:17am

I think there’s a misunderstanding. My point was that with SIMD programming sometimes you can’t even do simple sequential programming. Lambda calculus simply does not help you at all when trying to deal with the individual bits of memory in the most efficient way.

smoge · March 2, 2024, 1:20am

It was a joke, my man! It was invented when “computer” meant a person doing computation, not a machine!

In theory, LC and turing machines are equivalents. Maybe in another dimension, they evolved the other way around.

Spacechild1 · March 2, 2024, 1:23am

I’m wondering what the memory latency was in those times

smoge · May 15, 2024, 3:34am

Hey Tom @amindfv , have you explored those ideas?

I have some experiments going on. I even have a new FFI for JACK to understand the unsafe side and also look for some optimization.

Did you know this about the CLAP plugins? Since they can represent so many things, Haskell could use them as UGen, Synth, and GUI. It looks like an exciting path.

I’m exploring the idea of using JACK2’s potential as a platform for parallelization and determining what would be a good balance with several independent Jack clients (just like non-mixer).

Unrelated to CLAP devs, I also found this project: GitHub - JBetz/clap-hs: Haskell bindings for the CLAP audio plugin API

Unsafe FFI in Haskell performs as fast as systems languages among the top ones (benchmarks: GitHub - dyu/ffi-overhead: comparing the c ffi (foreign function interface) overhead on various programming languages )

" GHC, since version 8.4, guarantees that garbage collection will never occur during an unsafe call, even in the bytecode interpreter, and further guarantees that unsafe calls will be performed in the calling thread. Making it safe to pass heap-allocated objects to unsafe functions."

https://ghc.gitlab.haskell.org/ghc/doc/users_guide/exts/ffi.html

But this kind of code is terrifying for me (those parts of the jack ffi I need to study more)