Comment by OskarS - Hacker Neue

OskarS Jan 13, 2024 parent

I have a question about array languages that I've been wondering for a while: lets say you have the task "find all numbers smaller than N where predicate P is true". You know, like, "find all primes smaller than 1000", or "find all pythagorean triplets where Z is less than 1,000,000", something like that. The general pattern for doing this in an imperative language would be to have a for-loop that tests the predicate inside and then does something with the number. In a functional language, you'd use recursion or map/filter on a lazy list, something like that.

(to be clear, this is in the general case where you don't know exactly what P is. for the specific problem of finding primes smaller than N, you could obviously use sieves and stuff, but i'm more curious about the general pattern of problems like this)

Having only had limited exposure to array languages, my understanding is that in general the way to do that would be to do something like:

1. Generate an array from 1 to N

2. Test the array against the predicate, getting a new array with 0s and 1s (a mask, essentially)

3. Apply that mask to the array to get only elements where the predicate is true

Which is all fine, and I imagine that it takes a stupendously small amount of characters to do this. But my question is: what if N is large, and the predicate isn't true very often? Like, what if N is a billion or something? This is not an issue in the imperative/functional cases, because they don't have to generate the array in step 1, they can just loop/recurse/lazily generate the values (i.e. memory is O(1) for this). But it seems extraordinarily wasteful in array languages both of memory and resources to generate two different billion element arrays (the 1..N array and the mask) just to get out a handful of values.

That question is pretty specific, but this is one of my biggest questions about array languages in general. Aren't you always generating TONS of temporary arrays for no real reason, that will just pass through the calculation? Isn't that really slow? Or does the implementation somehow optimize this stuff away, maybe by using something like lazy evaluation?

mlochbaum Jan 13, 2024

Yes, you waste a lot of memory. Memory's cheap. If you need to, you can do the computation in blocks. Well, it's pretty rare to actually run out of memory but blocking is useful for staying at a lower cache level.

Scalar languages have a different problem: the default is to process one value at a time, which wastes the potential parallelism that array languages take advantage of with SIMD algorithms. Because this is the status quo, it's not as easy to see this as a big concern. And the solution is also blocking. That is, the best algorithm for SIMD-friendly stuff is generally a blocked array method.

In practice, whether an array language is good or not really depends on the specific problem. Of course, for most practical uses performance doesn't matter at all; I believe k's reputation mostly comes from kdb being fast as a database rather than k implementations being fast languages. But array languages can be surprisingly fast just by focusing on elegant array algorithms rather than machine-specific considerations. I have comments and benchmarks comparing with C here:

https://mlochbaum.github.io/BQN/implementation/versusc.html

jiggawatts Jan 13, 2024

Memory is cheap, but memory bandwidth isn’t.

Languages that can stay in L1 cache for the duration of a computation will run circles around a language that explicitly computes and stores all intermediate values in full.

Also, array-based languages can easily hit the wall of system memory capacity whereas traditional code tends to be streaming and can handle unbounded input lengths.

mlochbaum Jan 13, 2024

Which is exactly why I said you block the computation to stay at a low cache level. With SIMD loads and stores I don't think this matters quite as much as you suggest, even without blocking. It's pretty much only arithmetic that can saturate L1. I timed the BQN compiler on various files (some old version of itself, repeated). For 18K it runs at 21.4MB/s; for 1.7M, 16.5MB/s; for 17M, 12.0MB/s. So even when the source won't fit in L3 (mine's 8MB) the degradation is under a factor of 2 (and of course the compiler makes no consideration of cache, who writes a megabyte of BQN?).

dzaima Jan 13, 2024

There are a couple options for working around this. Lazy evaluation is indeed one option (e.g. Kap[0] uses it). Another clear option is loop-fusing the entire body together such that there are no temporary arrays.

Then there's the slightly simpler option of splitting the operation in chunks of a couple dozen kilobytes of input/output arrays such that unnecessary temporary memory usage is bounded; as far as I know no array language does this, but I hope to one day do something like this for CBQN. And this is even a thing that a user can manually do in an array language (and indeed often have to for maximizing perf).

[0]: https://aplwiki.com/wiki/KAP

DrDroop Jan 13, 2024

Your intuition is largely correct, but it is rarely a practical problem. The k language, including the ngn/k dialect has some lazy constructions for example !10000000 (iota of 10 million), will not create an array of with 10 million ints but a simple range from 0 to 10 million. Depending on the operator you will of course end up with such an array. There are also certain optimisations (I think they are called colloquialisms) where things like +|x, reverse x and take the first element, is translated into just take the last element.

omaranto Jan 14, 2024

> I think they are called colloquialisms

Close, they are called "idioms".

> +|x

The plus should be an asterisk, but yes, that is an idiom the ngn/k interpreter recognizes.

nickpeterson Jan 13, 2024

I think you might think the array instantiation is literal. Nothing prevents an array language from handling it in chunks under the covers. If I ask it to give me an array of 10 billions integers, it may not naively do that. Not an expert , just a thought.

lokedhs Jan 13, 2024

Many array languages do have this issue. To be precise, the issue is that the simple straightforward approach tends to compute a lot more than necessary.

It's of course possible to do it in a different way to deal with that issues but those solutions can be a bit longer and less pretty.

The APL dialect I'm working on (Kap) solves this problem in many cases by delaying the computation until the result is needed, which means that you can write the code using the straightforward approach but still avoid computing results that will be thrown away.

omaranto Jan 14, 2024

Briefly: yes, you generate lots of temporary arrays, but it is not slow because it often allows the use of vector instructions and parallelism, it can be faster than a constant-memory loop that does not take advantage of vectorization or parallelization. Also note that most array language implementations reuse temporary arrays a lot doing in-place modification when possible.

If the necessary temporary arrays are really huge, array programmers will just do the computation in chunks that fit comfortably in memory.

RyanHamilton Jan 14, 2024

Some array based languages use pipelining of blocks to get the best of both worlds (SIMD chunks, reduce chunk, then do next chunk) https://www.timestored.com/b/time-series-calcs-cache/

dr_kiszonka Jan 14, 2024

I am not sure if R meets the definition of an array language to the extent that K does, but it produces the “cannot allocate vector of size” error not infrequently.

max_ Jan 13, 2024

I don't know what a Pythagorean triplet is but for primes you can do it as follows in J.

1. The primitive p: generates the nth prime number so, p:0 generates the first prime number 2, p:1 generates the 2nd prime number.

P: 6 generates the 6th prime number 17.

2. P: supports arrays so p: 0 1 2 3 will print the prime numbers at index 0, 1, 2 & 3. Which are 2 3 5 7.

3. The primitive/verb i. Generates a sequential list. So i. 5 will generate 0 1 2 3 4

4. p: can consume other verbs like i:*

So to print the first 1,000 prime numbers, it is simply p: (i.1000)

OskarS OP Jan 13, 2024

This is not quite the problem I was asking (i wanted to know how many numbers smaller than N this is true for, not the first N numbers it's true for), but it illustrates my point anyway: the i.1000 generates an array of all numbers [0,1000) that is then immediately discarded. In imperative and functional languages, you don't need to do that: in an imperative language it's just loop counter, and in a functional language with lazy evaluation (e.g. Haskell), it's a lazy list that doesn't ever store all 1000 numbers.

My question is: will J cleverly optimize away the i.1000 array? Is it lazily generated, like in Haskell? Or is this just one of those "we don't really care about the memory of the i.1000 array, it's usually small in practice"? Like, for high performance stuff, it seems relevant to me that you spend an extra O(n) memory for no real reason.

max_ Jan 13, 2024

So basically you already have an arbitrary list of 1000 numbers.

That may or may not be prime? And you want to know how many are prime? (Let me know if I am wrong)

1. Pick the highest number, H in the arbitrary list of 1000 numbers.

2. Generate all prime numbers up to the highest number, H.

3. Use e., Verb also known as member of to check for prime numbers present in arbitrary list. +/(all prime numbers up to H e. arbitrary list of 1000 numbers) will give you the number of items in the arbitrary list that are prime.

> will J cleverly optimize away the i.1000 array?

The primitives & verbs in array languages are designed to be as fast and efficient as possible.

So even if your code is mediocre, it will perform better that any mediocre imperative code.

dzaima Jan 13, 2024

I believe J doesn't have any optimizations specifically for the problem of large temporary arrays (not to say that such can't exist, but I don't think anyone's written such yet). With multiple temporary operations, each making a temporary array larger than cache, it could in fact end up slower than mediocre imperative scalar code.

max_ Jan 13, 2024

Thanks for the insight.

convolvatron Jan 13, 2024

in general one doesn't want to materialize the intermediates. one evaluation strategy is treat the sets as streams. this removes the for loops, and leaves the evaluation strategy intact. for more complicated expressions, also allows you to move the filters first (referential transparency for the win), just like you would if you were compiling a relational query.

jodrellblank Jan 13, 2024

> "I imagine that it takes a stupendously small amount of characters to do this."

You're right, but it's not just codegolf; ideally the few characters are a clean high level way to express intent - and that is in tension with caring how a machine executes the algorithm most quickly.

> "This is not an issue in the imperative/functional cases, because they don't have to generate the array in step 1, they can just loop/recurse/lazily generate the values (i.e. memory is O(1) for this)"

It kind of is the same problem, just pushed down a level or two; writing it casually in Python will not get you the fastest performance, you should be working in C or you're wasting a lot of potential running the Python layer, then care about how you can parallelize the computation or you're wasting 7/8ths of an 8-core processor, then caring how you can make best use of the CPU SIMD instructions or you're wasting another half or more of the machine potential, caring about branch predictors, and caches, etc. That a loop written in Python is "not wasteful" but materializing a large array "is wasteful" is where the industry draws a fairly arbitrary line. The ArrayCast podcast episode 52 touches on this[1], me cutting/editing some relevant parts from the transcript; [ML] is Marshall Lochbaum who did performance optimizing work on Dyalog APL and designs/implements the BQN array language, and [CH] is Conor Hoekstra who hosts the podcast and works in nVidia research:

> ML: "using linear memory is just how array languages work. For any problem, pretty much, you're going to have to make a bunch of new arrays."

> CH: "there could be an array language that avoids [materializing] arrays when possible."

> ML: "it's going very much against the grain of the language to say, "All right, I've specified my answer in these high-level array terms, and now I want you to turn it into a C program for me"."

> ML: "it's still nice to specify a problem this way, but this array form for specifying gives you some pretty big advantages. [...] if you try to pack that all into a big iteration, your algorithm is no longer expressed as an array operation - and these array operations are things that we know how to do really quickly. You are giving up some performance information if you tell it, well, yes, I'm in an array language, but don't actually make me any arrays."

> CH: "my dream is that I want to be able to write like the most expressive solutions to problems and then have [the array language implementation] do the most performant thing. Like for Kadane's [algorithm], for example, the most performant thing is to hand roll that reduction yourself. It's going to be faster than materialize."

> ML: "I'm not convinced of that"

> CH: [where I work we have to optimize for teams working on large problems, recently had a discussion where 2 billion items is a small number]

> ML: "2 billion /is/ a small number"

> ML: "what you can do, even when you have an array algorithm, you can split it into to smaller arrays. this is often a lot better because you get to use vector operations with these. So for Kadane's algorithm in particular, I don't know how to express that purely with vector operations, but I think there probably is a way. And in that case, if you write it in C style, where you interleave scan and reduction, then it's much harder to go from that to a vectorized algorithm which (if it exists) would almost definitely be the fastest way. The way you would get the array thing to be cache friendly is that you run it blocks. And then within a block, it's doing a bunch of vector operations, but what you really want is like, you know, working on two vector registers, say at a time. the array language doesn't automatically chunk, but it's not that hard"

> ML: "Yeah, and I think the way for the implementation to get the best [performance] - maybe not with this particular problem - but definitely for things that are friendlier to arrays where you don't have any compound functions inside scans. The way to optimize those is not to immediately break it down into a series of scalar operations, but instead to be more careful and start with your whole array stuff and break that as necessary, and maybe even compile these array operations into operations of individual registers, which is hard. Nobody's really done that, but coming from the other side, from C, there have been who knows how many man hours poured into work on auto vectorization. And it's still pretty terrible. It would have absolutely no chance of handling something like Kadane's algorithm."

> ML: "getting the best implementation of an idea is pretty difficult. But I think actually starting from an array representation, you do have a pretty good chance without going through a C style scalar representation first."

That's edited parts from a longer chat; but yes if you want to write X algorithm without wasting machine resources, that's a hard problem and takes a lot of low level C/SIMD/CPU skills and time. Array languages can vector-accelerate the primitives of array symbols much easier than C compilers can identify vector-accelerate arbitrary looping code.

You can read more interesting things about the implementing of BQN and comparing with co-dfns and performance here: https://mlochbaum.github.io/BQN/implementation/codfns.html

[1] https://www.arraycast.com/episode-52-transcript - around 00:39:21*

mlochbaum Jan 13, 2024

Following this episode, I did find an AVX2 implementation of the maximum subarray sum that's about 25% faster than the sequential version, published here: https://gist.github.com/mlochbaum/b6e9701c6c1c617a2c2a4fb107...

Troels Henriksen (Futhark developer) pointed out to me that in expanding the state to make the scan associative I'd reinvented a fairly well-known method. The transformation from a specification as "maximum over the sums of each subarray" to the associative scan is very often used as an example of the power of the Bird-Meertens formalism or Squiggol[0], and some newer papers have demonstrated that it can be derived automatically, although not very quickly. Troels also wrote a simpler ISPC[1] implementation[2] that tested slower than C on an older CPU and faster on a newer one. Then I translated that to BQN and found it was about 4x slower than sequential C.

[0] https://en.wikipedia.org/wiki/Bird%E2%80%93Meertens_formalis...

[1] https://ispc.github.io/index.html

[2] https://gist.github.com/athas/f016084ea749602476b96c05ae415a...

beaumayns Jan 13, 2024 (dead)

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous