Scalar languages have a different problem: the default is to process one value at a time, which wastes the potential parallelism that array languages take advantage of with SIMD algorithms. Because this is the status quo, it's not as easy to see this as a big concern. And the solution is also blocking. That is, the best algorithm for SIMD-friendly stuff is generally a blocked array method.
In practice, whether an array language is good or not really depends on the specific problem. Of course, for most practical uses performance doesn't matter at all; I believe k's reputation mostly comes from kdb being fast as a database rather than k implementations being fast languages. But array languages can be surprisingly fast just by focusing on elegant array algorithms rather than machine-specific considerations. I have comments and benchmarks comparing with C here:
Languages that can stay in L1 cache for the duration of a computation will run circles around a language that explicitly computes and stores all intermediate values in full.
Also, array-based languages can easily hit the wall of system memory capacity whereas traditional code tends to be streaming and can handle unbounded input lengths.
Then there's the slightly simpler option of splitting the operation in chunks of a couple dozen kilobytes of input/output arrays such that unnecessary temporary memory usage is bounded; as far as I know no array language does this, but I hope to one day do something like this for CBQN. And this is even a thing that a user can manually do in an array language (and indeed often have to for maximizing perf).
Close, they are called "idioms".
> +|x
The plus should be an asterisk, but yes, that is an idiom the ngn/k interpreter recognizes.
It's of course possible to do it in a different way to deal with that issues but those solutions can be a bit longer and less pretty.
The APL dialect I'm working on (Kap) solves this problem in many cases by delaying the computation until the result is needed, which means that you can write the code using the straightforward approach but still avoid computing results that will be thrown away.
If the necessary temporary arrays are really huge, array programmers will just do the computation in chunks that fit comfortably in memory.
1. The primitive p: generates the nth prime number so, p:0 generates the first prime number 2, p:1 generates the 2nd prime number.
P: 6 generates the 6th prime number 17.
2. P: supports arrays so p: 0 1 2 3 will print the prime numbers at index 0, 1, 2 & 3. Which are 2 3 5 7.
3. The primitive/verb i. Generates a sequential list. So i. 5 will generate 0 1 2 3 4
4. p: can consume other verbs like i:*
So to print the first 1,000 prime numbers, it is simply p: (i.1000)
My question is: will J cleverly optimize away the i.1000 array? Is it lazily generated, like in Haskell? Or is this just one of those "we don't really care about the memory of the i.1000 array, it's usually small in practice"? Like, for high performance stuff, it seems relevant to me that you spend an extra O(n) memory for no real reason.
That may or may not be prime? And you want to know how many are prime? (Let me know if I am wrong)
1. Pick the highest number, H in the arbitrary list of 1000 numbers.
2. Generate all prime numbers up to the highest number, H.
3. Use e., Verb also known as member of to check for prime numbers present in arbitrary list. +/(all prime numbers up to H e. arbitrary list of 1000 numbers) will give you the number of items in the arbitrary list that are prime.
> will J cleverly optimize away the i.1000 array?
The primitives & verbs in array languages are designed to be as fast and efficient as possible.
So even if your code is mediocre, it will perform better that any mediocre imperative code.
You're right, but it's not just codegolf; ideally the few characters are a clean high level way to express intent - and that is in tension with caring how a machine executes the algorithm most quickly.
> "This is not an issue in the imperative/functional cases, because they don't have to generate the array in step 1, they can just loop/recurse/lazily generate the values (i.e. memory is O(1) for this)"
It kind of is the same problem, just pushed down a level or two; writing it casually in Python will not get you the fastest performance, you should be working in C or you're wasting a lot of potential running the Python layer, then care about how you can parallelize the computation or you're wasting 7/8ths of an 8-core processor, then caring how you can make best use of the CPU SIMD instructions or you're wasting another half or more of the machine potential, caring about branch predictors, and caches, etc. That a loop written in Python is "not wasteful" but materializing a large array "is wasteful" is where the industry draws a fairly arbitrary line. The ArrayCast podcast episode 52 touches on this[1], me cutting/editing some relevant parts from the transcript; [ML] is Marshall Lochbaum who did performance optimizing work on Dyalog APL and designs/implements the BQN array language, and [CH] is Conor Hoekstra who hosts the podcast and works in nVidia research:
> ML: "using linear memory is just how array languages work. For any problem, pretty much, you're going to have to make a bunch of new arrays."
> CH: "there could be an array language that avoids [materializing] arrays when possible."
> ML: "it's going very much against the grain of the language to say, "All right, I've specified my answer in these high-level array terms, and now I want you to turn it into a C program for me"."
> ML: "it's still nice to specify a problem this way, but this array form for specifying gives you some pretty big advantages. [...] if you try to pack that all into a big iteration, your algorithm is no longer expressed as an array operation - and these array operations are things that we know how to do really quickly. You are giving up some performance information if you tell it, well, yes, I'm in an array language, but don't actually make me any arrays."
> CH: "my dream is that I want to be able to write like the most expressive solutions to problems and then have [the array language implementation] do the most performant thing. Like for Kadane's [algorithm], for example, the most performant thing is to hand roll that reduction yourself. It's going to be faster than materialize."
> ML: "I'm not convinced of that"
> CH: [where I work we have to optimize for teams working on large problems, recently had a discussion where 2 billion items is a small number]
> ML: "2 billion /is/ a small number"
> ML: "what you can do, even when you have an array algorithm, you can split it into to smaller arrays. this is often a lot better because you get to use vector operations with these. So for Kadane's algorithm in particular, I don't know how to express that purely with vector operations, but I think there probably is a way. And in that case, if you write it in C style, where you interleave scan and reduction, then it's much harder to go from that to a vectorized algorithm which (if it exists) would almost definitely be the fastest way. The way you would get the array thing to be cache friendly is that you run it blocks. And then within a block, it's doing a bunch of vector operations, but what you really want is like, you know, working on two vector registers, say at a time. the array language doesn't automatically chunk, but it's not that hard"
> ML: "Yeah, and I think the way for the implementation to get the best [performance] - maybe not with this particular problem - but definitely for things that are friendlier to arrays where you don't have any compound functions inside scans. The way to optimize those is not to immediately break it down into a series of scalar operations, but instead to be more careful and start with your whole array stuff and break that as necessary, and maybe even compile these array operations into operations of individual registers, which is hard. Nobody's really done that, but coming from the other side, from C, there have been who knows how many man hours poured into work on auto vectorization. And it's still pretty terrible. It would have absolutely no chance of handling something like Kadane's algorithm."
> ML: "getting the best implementation of an idea is pretty difficult. But I think actually starting from an array representation, you do have a pretty good chance without going through a C style scalar representation first."
That's edited parts from a longer chat; but yes if you want to write X algorithm without wasting machine resources, that's a hard problem and takes a lot of low level C/SIMD/CPU skills and time. Array languages can vector-accelerate the primitives of array symbols much easier than C compilers can identify vector-accelerate arbitrary looping code.
You can read more interesting things about the implementing of BQN and comparing with co-dfns and performance here: https://mlochbaum.github.io/BQN/implementation/codfns.html
[1] https://www.arraycast.com/episode-52-transcript - around 00:39:21*
Troels Henriksen (Futhark developer) pointed out to me that in expanding the state to make the scan associative I'd reinvented a fairly well-known method. The transformation from a specification as "maximum over the sums of each subarray" to the associative scan is very often used as an example of the power of the Bird-Meertens formalism or Squiggol[0], and some newer papers have demonstrated that it can be derived automatically, although not very quickly. Troels also wrote a simpler ISPC[1] implementation[2] that tested slower than C on an older CPU and faster on a newer one. Then I translated that to BQN and found it was about 4x slower than sequential C.
[0] https://en.wikipedia.org/wiki/Bird%E2%80%93Meertens_formalis...
[1] https://ispc.github.io/index.html
[2] https://gist.github.com/athas/f016084ea749602476b96c05ae415a...
(to be clear, this is in the general case where you don't know exactly what P is. for the specific problem of finding primes smaller than N, you could obviously use sieves and stuff, but i'm more curious about the general pattern of problems like this)
Having only had limited exposure to array languages, my understanding is that in general the way to do that would be to do something like:
1. Generate an array from 1 to N
2. Test the array against the predicate, getting a new array with 0s and 1s (a mask, essentially)
3. Apply that mask to the array to get only elements where the predicate is true
Which is all fine, and I imagine that it takes a stupendously small amount of characters to do this. But my question is: what if N is large, and the predicate isn't true very often? Like, what if N is a billion or something? This is not an issue in the imperative/functional cases, because they don't have to generate the array in step 1, they can just loop/recurse/lazily generate the values (i.e. memory is O(1) for this). But it seems extraordinarily wasteful in array languages both of memory and resources to generate two different billion element arrays (the 1..N array and the mask) just to get out a handful of values.
That question is pretty specific, but this is one of my biggest questions about array languages in general. Aren't you always generating TONS of temporary arrays for no real reason, that will just pass through the calculation? Isn't that really slow? Or does the implementation somehow optimize this stuff away, maybe by using something like lazy evaluation?