Splitting very large arrays

lionqueen · July 21, 2021, 6:23am

I’m trying to generate arrays containing all possible permutations of n elements. That I got, the problem is that when n gets to, say 10, that’s already above 3.6 million permutations, which means roughly 36 million indices, so my computer (2020 MBP) starts spinning its fans like crazy and that’s about it. n = 11 is nearly 40 million sets, and so on, so it gets much worse

var a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11];
~b = (a.size.factorial).asInteger.collect{arg i; a.permute(i)};
~b = ~b.flatten.asStream;

Is there a way to calculate/collect that in blocks, making sure each block does not exceed whatever value seems reasonable for array size? that way i could (hopefully) load block-arrays rather than the supermassive one that seems very unmanageable.

Thanks!

redFrik · July 21, 2021, 7:40am

start by collecting your results in a single List instead of in Arrays. that’ll give you a massive increase in performance.
a List is what you want for anything that’s growing. see its helpfile for a bit of explanation and how it compares to Array.

( //~8.5sec
{
var a = [1, 2, 3, 4, 5, 6, 7, 8];
~b = (a.size.factorial).asInteger.collect{arg i; a.permute(i)};
~b = ~b.flatten.asStream;
}.bench;
)

( //~0.09sec
{
var a = [1, 2, 3, 4, 5, 6, 7, 8];
var size = (a.size.factorial).asInteger;
var result = List.new(size);
size.do{arg i; result.addAll(a.permute(i))};
~b = result.asStream;
}.bench;
)

then, if/when you run into new problems, you can split up your code into blocks using a Routine. a .wait now and then will give the Interpreter some time for itself to collect garbage and what not.


(
Routine.run({
var a = [1, 2, 3, 4, 5, 6, 7, 8];
var size = (a.size.factorial).asInteger;
var result = List.new(size);
size.do{arg i;
if(i%10000==0, {
"breathe! collected % numbers".format(result.size).postln;
0.1.wait;
});
result.addAll(a.permute(i));
};
~b = result.asStream;
"done".postln;
result.size.postln;
});
)

good luck,
_f

21 juli 2021 kl. 08:30 skrev lioness via scsynth <noreply@m.scsynth.org>:

| lionqueen
July 21 |

| - |

I’m trying to generate arrays containing all possible permutations of n elements. That I got, the problem is that when n gets to, say 10, that’s already above 3.6 million permutations, which means roughly 36 million indices, so my computer (2020 MBP) starts spinning its fans like crazy and that’s about it. n = 11 is nearly 40 million sets, and so on, so it gets much worse
var a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11];
~b = (a.size.factorial).asInteger.collect{arg i; a.permute(i)};
~b = ~b.flatten.asStream;
Is there a way to calculate/collect that in blocks, making sure each block does not exceed whatever value seems reasonable for array size? that way i could (hopefully) load block-arrays rather than the supermassive one that seems very unmanageable.

Thanks!

#|
fredrikolofsson.com musicalfieldsforever.com

redFrik · July 21, 2021, 9:22am

oops, i now realised that in my example below the initial capacity for the List should’ve been bigger…

var result = List.new(size*a.size);

not a big deal but it’ll reduce the need to allocate more space during the growth.

but i also realised that, as you know the total size in advance, you can just create a single empty Array and .put your numbers in at the correct position.
that then allow you to use a special array class that’s more memory efficient - like Int8Array.


( //~0.08sec
{
var a = [1, 2, 3, 4, 5, 6, 7, 8];
var size = (a.size.factorial).asInteger;
var result = Int8Array.newClear(size*a.size);
size.do{arg i; a.permute(i).do{|b, j| result.put(i*a.size+j, b)}};
~b = result.asStream;
}.bench;
)

so this is probably the better solution if you’re only permuting 8bit values. uses a lot less RAM and you won’t need a growing List because with .newClear we create a fixed size Int8Array in advance.
_f

( //~0.09sec
{
var a = [1, 2, 3, 4, 5, 6, 7, 8];
var size = (a.size.factorial).asInteger;
var result = List.new(size);  //should really be size*a.size
size.do{arg i; result.addAll(a.permute(i))};
~b = result.asStream;
}.bench;
)

lionqueen · July 21, 2021, 11:09pm

Not sure what I’m doing differently (“nothing”, i would say!) but when I first tried your code in the morning it was working like a charm, even with a (1…11) array. I’ve been going at it for hours (again, haven’t changed your code one bit other than the initial array) and it seems totally stalled. Tried a few times already. Any clues?

jamshark70 · July 22, 2021, 3:02am

a List is what you want for anything that’s growing

There are a couple of concerns here: 1, ease of adding elements (List is the winner here), and 2, performance (List will never beat Array).

A List is just a wrapper for an array. It’s easier to add things because it does a = a.add(thing) internally, for you, saving you from the trouble of having to remember the reassignment for yourself. (The overhead of reallocating arrays is still there, just hidden from the visible code.) But add dispatches to Array’s add method – two method calls for each add – so it will always be slower.

Five benchmarks:

Wrong use of Array:add. This is the worst in terms of performance, because every expanded array created by add is discarded = huge garbage-collector load. And the final array is the wrong size.
Correct use of Array:add, without pre-allocating the number of elements. a = a.add(...) is necessary.
Correct use of Array:add, with pre-allocation. This is a bit faster because it doesn’t have to garbage-collect intermediate arrays.
List, without pre-allocating.
List, with pre-allocation. For 4 and 5, I believe the overhead of the double method calls outweighs other considerations – comparison of these two is probably random.

(
var n = 10000000;

bench {
	var a = Array.new;
	n.do { a.add(1.0.rand) };
	a.size.debug("unallocated, no reassignment");
};

bench {
	var a = Array.new;
	n.do { a = a.add(1.0.rand) };
	a.size.debug("unallocated, with reassignment");
};

bench {
	var a = Array.new(n);
	n.do { a.add(1.0.rand) };
	a.size.debug("pre-allocated, no reassignment");
};

bench {
	var l = List.new;
	n.do { l.add(1.0.rand) };
	l.size.debug("List, unallocated");
};

bench {
	var l = List.new(n);
	n.do { l.add(1.0.rand) };
	l.size.debug("List, pre-allocated");
};
)

unallocated, no reassignment: 1    // BZZZT wrong
time to run: 2.589702371 seconds.
unallocated, with reassignment: 10000000
time to run: 1.229955393 seconds.
pre-allocated, no reassignment: 10000000
time to run: 0.95807274900005 seconds.
List, unallocated: 10000000
time to run: 2.022845863 seconds.
List, pre-allocated: 10000000
time to run: 2.051731603 seconds.

In the 8.5 second example, it’s actually flatten that kills it. Merely restructuring the code (as fredrik did) to avoid flatten is enough. The distinction between List and Array isn’t really significant here.

( //~8.5sec
var a = [1, 2, 3, 4, 5, 6, 7, 8];
bench { ~b = (a.size.factorial).asInteger.collect{arg i; a.permute(i)}; };
bench { ~b = ~b.flatten.asStream };
)

time to run: 0.067028030000074 seconds.
time to run: 15.500484482 seconds.

hjh

PS Apparently my laptop is not very good

redFrik · July 22, 2021, 7:00am

that’s a fantastic explanation. thank you James.

and just in case - i’d like to mention lazy evaluation here too. as the OP had a .asStream in the example i wonder if maybe all this collecting into arrays and lists could be avoided in the first place…


(
r= Routine({
var n= 11;
n.factorial.do{|i| (1..n).permute(i).do{|x| x.yield}};
}).asStream;
)

r.next;
r.next;
//etc

r.nextN(4); //get chunks

_f

22 juli 2021 kl. 05:08 skrev James Harkins via scsynth <noreply@m.scsynth.org>:

| jamshark70
July 22 |

| - |

a List is what you want for anything that’s growing

There are a couple of concerns here: 1, ease of adding elements (List is the winner here), and 2, performance (List will never beat Array).

#|
fredrikolofsson.com musicalfieldsforever.com

lionqueen · August 2, 2021, 11:28pm

Hi all,

Supposing ~x is the extremely long List or Int8Array, i was looking for a way to store it out of SC.

I tried ~x.writeArchive("~/Desktop/test.hmm".standardizePath)but it crashed the interpreter as expected. I bet there’s a more convenient way to do it, though.

Thanks!

Sam_Pluta · August 3, 2021, 6:29am

What about using a buffer? SoundFile will store your array in a wav file, which can be pretty dang large. I don’t think you can do an Int8Array, but you can do an Int16Array:

a = Int16Array.fill(1000000, {128.rand})

f = SoundFile.new;
f.openWrite("~/Desktop/test.wav".standardizePath);
f.writeData(a)
f.close

g = SoundFile.new;
g.openRead("~/Desktop/test.wav".standardizePath)
h = Int16Array.newClear(1000000);
g.readData(h);  //the help file says this needs to be a FloatArray, but it seems an Int16 works as well
h.postln;

lionqueen · August 3, 2021, 11:42am

This is excellent and it blew my mind. I’m trying to store that to use it outside of SC, though. That’s why I thought a text file would be the most universal way.

Sam_Pluta · August 3, 2021, 1:02pm

Where are you using it? Most coding environments can read wav files. I’ve been using Rust quite a bit, and using the hound crate to read wav files. No problem. Unless you want the file to be human readable this should work almost anywhere.

lionqueen · August 3, 2021, 4:42pm

The list/array is a very long sequence of midi events, which i would like to play from some low level MIDI sequencer, possibly making my own on a teensy/arduino. So that would just read this and play it (stored on an SD card or whetever). Another possibility could be a Raspberry Pi running SC but it definitely needs to be small and unobtrusive, as it needs to be installed for a few months.
Thanks!

Sam_Pluta · August 3, 2021, 4:58pm

Use File?

a = Int16Array.fill(10000, {128.rand})

f = File.new("~/Desktop/test.wav".standardizePath, "w");
a.do{|int| f.write(int.asString++",")}
f.close

g = File.readAllString("~/Desktop/test.wav".standardizePath);
h = g.split($,).asInteger;
h.postln;

TXMod · August 3, 2021, 6:43pm

For this you could use class SimpleMIDIFile in the “wslib” quark to write MIDI files.

lionqueen · August 5, 2021, 12:27pm

Thanks for that, Sam.
The thing is that, while your example works perfectly, SC crashes when it tries to do it with a much larger array, with millions of items.
Maybe there’s a good way to turn that sound file storing all the data into text outside of SC? Any ideas?

Thanks!!

jamshark70 · August 5, 2021, 3:30pm

readAllString and split are doing all the parsing in memory, which is not guaranteed to scale up to massive files.

This is a thing that happens a lot in SC: you start with a convenience method, then crash into its limitations, and then you have to learn what it’s doing inside and build a better-optimized solution.

I’d suggest to read characters one by one, and when hitting a $, (or nil at the end) convert the string-in-progress to a number, reset the string and continue. Then, at any time you’d have no more than a dozen characters held in memory.

I’m not at the computer right now; could whip up an example later.

hjh

jamshark70 · August 6, 2021, 12:17am

(
a = Int16Array.fill(1000000, { 128.rand });

f = File.new("~/Desktop/test.txt".standardizePath, "w");
a.do { |int| f << int.asString << "," };
f.close;
)

(
~readMem = {
	g = File.readAllString("~/Desktop/test.txt".standardizePath);
	g.split($,).asInteger;
};

~readNum = {
	var f = File("~/Desktop/test.txt".standardizePath, "r");
	var str, char;
	var outArray = Int16Array.new;
	if(f.isOpen) {
		protect {
			str = String.new;
			while {
				char = f.getChar;
				char.notNil
			} {
				if(char == $,) {
					outArray = outArray.add(str.asInteger);
					str = String.new;
				} {
					str = str ++ char;
				}
			};
		} {
			f.close
		};
	} {
		Error("failed to open").throw
	};
	outArray
};
)

// 10000 benchmark -- 30.8 kb file
bench(~readMem);  // ~ 0.05 sec
bench(~readNum);  // ~ 0.035 - 0.04 sec

// 1000000 benchmark -- 3 MB file
bench(~readMem);  // ~ 4.35 - 4.5 sec
bench(~readNum);  // ~ 1.7 sec

// 10,000,000 benchmark -- 29.9 MB file
bench(~readMem);  // ~ 44.4 sec (1 trial only, it's long enough)
bench(~readNum);  // ~ 17 sec

~readNum is more code, but by reducing memory usage and garbage collector load, it performs roughly 3x better (except for the smallest test, which is already blink-of-an-eye).

But, if you know the data type in advance, you can do even better:

(
a = Int16Array.fill(10000000, { 128.rand });

f = File.new("~/Desktop/test2.txt".standardizePath, "wb");
a.do { |int| f.putInt16(int) };
f.close;
)

(
~readInt = {
	var int;
	var a = Int16Array.new;
	
	f = File("~/Desktop/test2.txt".standardizePath, "rb");
	if(f.isOpen) {
		protect {
			while {
				int = f.getInt16;
				int.notNil
			} {
				a = a.add(int);
			};
		} {
			f.close
		};
	} {
		Error("failed to open").throw;
	};
	a
};
)

bench(~readInt)  // 1.27 sec, vs 17 sec for the text way

hjh

lionqueen · August 6, 2021, 11:35pm

Hello James,

Impressive example. Not sure if I’m using it right, though. On my computer it says it’ll take 66.769560979 seconds to run when i bench it, but when i try to evaluate it, the file remains seemingly empty. Large, very large, close to a GB, but no text is written to it.

jamshark70 · August 7, 2021, 12:00am

If you mean the putInt16 example, it’s not writing text – it’s writing the integers as binary numbers, e.g., 300 is (in hex) 12C, so it will write two bytes into the file, 01 and 2C. They won’t be human readable, but by skipping the conversion to a string and the conversion back to an integer, it’s much faster.

66 seconds is a lot though. Which platform are you on?

Btw benchmarking does evaluate the function; it just measures the time (not part of normal evaluation).

hjh