Experiences with RAVE UGen?

hey @elgiano :slight_smile:

thank you for nn.ar!
just to add to the conversation, I tested it here using RAVE provided models, I seem to have no dropouts even with lower buffer sizes :slight_smile:

running on ArchLinux, ThinkPad t14s gen 3.

1 Like

Buffer size seems to depend a whole lot on which model I’m using – with the “wheel” model I was able to run

(
{
  var in = SoundIn.ar(0) * 12;
  var out = NN.ar(\wheel, \forward, 128, in);
  out;
}.play
)

and it didn’t glitch out too much (though it was spiky – and oddly the low buffer size didn’t seem to make a big difference in realtime delay, i.e. the delay was still quite noticeable, though I didn’t test extensively)

but with “vintage” model I ran

{
  var in = PlayBuf.ar(1, c, BufRateScale.kr(c), loop:0);
  var latent = NN.ar(\vintage, \encode, 2048*8, in * 6);
  var mod = latent.collect { |l| l + LFNoise1.ar(1).range(-1, 1) };
  NN.ar(\vintage, \decode, 2048*8, mod)!2;
}.play;

(which was about as high as I could go before I got error messages) and still it wasn’t able to render in realtime, I just recorded the server output and turned down my speakers.

Here is my as-is m1 build in case it’s helpful to others:
https://drive.google.com/drive/folders/1lsxMF8eNkPXI0uVJdh2ZefEEVg4_PL43?usp=sharing

4 Likes

I compiled nn.ar for windows to test, here is the compiled plugin in case someone wants to try it out :slight_smile:
https://bgo.la/nn-ar-win-build.zip

I had zero experience compiling things on windows, so you might need to add the .ddl files from libtorch in the same folder as your scsynth.exe ( i used https://download.pytorch.org/libtorch/cpu/libtorch-win-shared-with-deps-2.0.1%2Bcpu.zip )

1 Like

thanks @bgola and @Eric_Sluyter for your builds!
I added GitHub actions, so that builds are now generated automatically for linux, mac-x64 mac-arm64 and windows. However I haven’t tested any of them (I have no access to mac or win machines at the moment). Would you mind testing them?

I also polished a bit interfaces and documentation. I’ve tried some NRT synthesis and it sounds good (no dropouts) as I would have expected. So, if RT is not really viable for “weaker” computers (like mine), at least we have more expressive tools to do NRT renders.

@scztt (and everyone interested): I have more time this week, would you like to have a look at the cpp code and sc interfaces together?

4 Likes

Yes, maybe we do a code review togethert? I’ll DM you tomorrow when I can figure out my schedule but this would be fun.

1 Like

Hey all!
I’ve released v0.0.2-alpha (:spaghetti: )

Apart from it fixing some release issues, it includes recent updates to nn_tilde, and some optimizations that made it lighter on the dsp chain. With this I mean it became more async, so if the processing thread takes too long you’ll get silences in NN.ar output, but not freeze the rest of the audio chain.

Let me know how it goes if you try it!

Hey all!
Version v0.0.3-alpha is out (:troll:)
Fixed some issues and changed the interface a little little bit.

Let me know how it goes if you try it!

3 Likes

I’ve added some considerations about latency to the README, I thought it could be nice to share them here too:

Latency considerations (RAVE)

RAVE models can exhibit an important latency, from various sources. Here is what I found:

tl;dr: if latency is important, consider training with --config causal.

  • first obvious source of latency is buffering: we fill a bufferSize of data before passing it to the model. With most of my rave v2 models, this is 2048/44100 = ~46ms.
  • then processing latency: on my 2048/44100 rave v2 models, on my i5 machine from 2016, this is between 15 and 30ms. That is very often bigger than 1024/44100 (~24ms, my usual hardware block size), so I have to use the external thread all the time to avoid pops.
  • rave intrinsic latency: cached convolutions introduce delay. From the paper Streamable Neural Audio Synthesis With Non-Causal Convolutions, this can be about 650ms, making up for most of the latency on my system. Consider using models trained with --config causal, which reduces this latency to about 5ms, at the cost of a “small but consistent loss of accuracy”.
  • transferring inputs to an external thread doesn’t contribute significantly to latency (I’ve measured delays in the order of 0.1ms)

Thanks for this. I experienced dropouts with the prior version, but this one seems better and I have no drops.

On my Mac M1 max and OS 13.3.1, the following does not work:

// 2. play
{ NN(\rave, \forward).ar(1024, WhiteNoise.ar) }.play;

However, this from your helpfile, works:

{ NN(\rave, \forward).ar(WhiteNoise.ar) }.play

Adam

Hi Adam!
with version 0.0.3 I changed the interface a little bit: so now the first argument to .ar is inputs, and not blockSize anymore, which was moved to be the second argument.
So the new syntax is:

NN(model, method).ar(inputs, blockSize, ...)

This is why your second example works and the first doesn’t work anymore. I hope I have updated the documentation in all places :smiley:

Happy your dropouts got better! Just out of curiosity, are you using a pretrained model or did you train your own?

Got it! I think your Github Readme still uses the old format, but I should use the SC help files anyway.

I am using pretrained models for testing for now, but hope to train my own soon.

Speaking of pretrained models: I am using the ones from here: https://acids-ircam.github.io/rave_models_download . If I understand correctly, in these .ts files, the prior is part of the file, instead of separate. In your implementation, is there a way of using the prior of these particular files, rather than loading a separate prior?

Again, thanks so much for the work on this!

Sure, it’s just another method. Old Rave prior is a method with 1 input (temperature) and “latent size” outputs. It outputs latent codes, so you need to decode them:

var temp = MouseY.kr.range(0,10);
var latent = NN(\rave, \prior).ar(temp);
NN(\rave, \decode).ar(latent);

By the way you can check if a model has a prior method (not all of them do):

NN(\rave).methods

I’ll update the README asap! I thought i did but maybe something went wrong, sorry for the confusion!

Excellent, this works, thanks so much!
I found, that for at least the prior of the VCTK.ts model, the latent codes need a multiplication factor of larger than 1 in order for the decoder to keep outputting audio. This is not the case with the prior of vintage.ts.

In case anyone missed it, we released 15 RAVE models recently:

2 Likes

My TidalCycles / SuperDirt integration is now based on NN.ar:

Thanks for sharing the models and releasing code for transfer learning! Can I ask what your experience is concerning transfer learning - I was surprised that this seems not the default way of training for rave.

Can I also ask something regarding training models?

I trained one model on around 5 to 10 minutes of spoken text by different people. Considering the amount of parameters I would say that this is maybe not enough data, but also the paper doesn’t say anything about these aspects.

I trained it within 5 days on a 3080 for 2M steps and the multiband spectral distance developed like this - which looks somehow fine I think? At least it is clear that the 2nd step of tuning the decoder started.

Yet the outcome sounds really off - like extreme tape wobble - and the input is here a sample from the dataset which is a really clean recording of a male voice, it is not even recognizable what the person is saying. And this is at fidelity 0.99.

5-10mins definitely not enough for RAVE, you would need minimum 2hrs and ideally 4-10hrs or more.

In general, I would recommend reading/asking on the RAVE Discord: RAVE

1 Like

Thanks for the reply.
BTW - do you have any checkpoints for the voice models? I only found them for the 2 organ models.

It’s a pity that more and more information gets locked behind Discord :confused:

We don’t as we’re in the early days of evaluating models created via transfer learning, but there will likely be more in future.

But the goal of transfer learning / “foundation models” is that you don’t need to be in the same stylistic domain. Our transfer learned models so far have been in voice, percussion, electronic, acoustic instruments, and they work extremely well, especially considering they only take hours to train instead of days/weeks.

To stay up to date with our work you can join our Discord: Intelligent Instruments

1 Like

But only if the dataset contains high enough variance, right? I already wondered why there has not been a “big universal” model from which you do transfer learning? Is the model not capable of handling so much variety?

Also - do you know if someone looked at conditioning the VAE? I think this would allow for more precise movement within the latent space. I will try looking into it, but most likely ditch the dynamic latent space dim-reduction algorithm while at it.