Profile: gopalv - Hacker Neue

gopalv

Joined Aug 20, 2012 3,399 karma

@php.net / @apache.org / @isotopes.ai

https://aidnn.ai

gopalv Dec 12, 2025

> But for just the cost of doubling our space, we can use two Bloom filters!
We can optimize the hash function to make it more space efficient.
Instead of using remainders to locate filter positions, we can use a mersenne prime number mask (like say 31), but in this case I have a feeling the best hash function to use would be to mask with (2^1)-1.
gopalv Dec 12, 2025

This is roughly what my startup is doing, automating financials.
We didn't pick this because it was super technical, but because the financial team is the closest team to the CEO which is both overstaffed and overworked at the same time - you have 3-4 days of crunch time for which you retain 6 people to get it done fast.
This was the org which had extremely methodical smart people who constantly told us "We'll buy anything which means I'm not editing spreadsheets during my kids gymnastics class".
The trouble is that the UI that each customer wants has zero overlap with the other, if we actually added a drop-down for each special thing one person wanted, this would look like a cockpit & no new customer would be able to do anything with it.
The AI bit is really making the required interface complexity invisible (but also hard to discover).
In a world where OpenAI is Intel and Anthropic is AMD, we're working on a new Excel.
However, to build something you need to build a high quality message passing co-operating multi-tasking AI kernel & sort of optimize your L1 caches ("context") well.
gopalv Dec 8, 2025

> Well, if you don't fsync, you'll go fast, but you'll go even faster piping customer data to /dev/null, too.
The trouble is that you need to specifically optimize for fsyncs, because usually it is either no brakes or hand-brake.
The middle-ground of multi-transaction group-commit fsync seems to not exist anymore because of SSDs and massive IOPS you can pull off in general, but now it is about syscall context switches.
Two minutes is a bit too too much (also fdatasync vs fsync).
gopalv Dec 5, 2025

> produce something as high-quality as GoT
Netflix is a different creature because of streaming and time shifting.
They don't care about people watching a pilot episode or people binge watching last 3 seasons when a show takes off.
The quality metric therefore is all over the place, it is a mildly moderated popularity contest.
If people watch "Love is Blind", you'll get more of those.
On the other hand, this means they can take a slightly bigger risk than a TV network with ADs, because you're likely to switch to a different Netflix show that you like and continue to pay for it, than switch to a different channel which pays a different TV network.
As long as something sticks the revenue numbers stay, the ROI can be shaky.
Black Mirror Bandersnatch for example was impossible to do on TV, but Netflix could do it.
Also if GoT was Netflix, they'd have cancelled it on Season 6 & we'd be lamenting the loss of what wonders it'd have gotten to by Season 9.
gopalv Nov 19, 2025

> For double/bigint joins that leads to observable differences between joins and plain comparisons, which is very bad.
This was one of the bigger hidden performance issues when I was working on Hive - the default coercion goes to Double, which has a bad hash code implementation [1] & causes joins to cluster & chain, which caused every miss on the hashtable to probe that many away from the original index.
The hashCode itself was smeared to make values near Machine epsilon to hash to the same hash bucket so that .equals could do its join, but all of this really messed up the folks who needed 22 digit numeric keys (eventually Decimal implementation handled it by adding a big fixed integer).
Databases and Double join keys was one of the red-flags in a SQL query, mostly if you see it someone messed up something.
[1] - https://issues.apache.org/jira/browse/HADOOP-12217
gopalv Nov 15, 2025

> trauma that our parents, or grandparents experienced could lead to behavior modifications and poorer outcomes in us
The nurture part of it is already well established, this is the nature part of it.
However, this is not a net-positive for the folks who already discriminate.
The "faults in our genes" thinking assumes that this is not redeemable by policy changes, so it goes back to eugenics and usually suggests cutting such people out of the gene pool.
The "better nurture" proponents for the next generation (free school lunches, early intervention and magnet schools) will now have to swim up this waterfall before arguing more investment into the uplifting traumatized populations.
We need to believe that Change (with a capital C) is possible right away if start right now.
gopalv Nov 14, 2025

> Can you build a Linux version? :-)
Generally speaking, it is the hardware not the OS that makes it easier to build for Macs right now.
Apple Neural Engine is a sleeping giant, in the middle of all this.
gopalv Nov 10, 2025

> would a fixed line in India typically be above that speed?
My family lives outside of a tier 2 city border, in what used to be farmland in the 90s.
They have Asianet FTTH at 1Gbps, but most of the video/streaming traffic ends at the CDN hosts in the same city.
That CDN push to the edge is why Hotstar is faster to load there - the latency on seeks isn't going around the planet.
gopalv Oct 30, 2025

The useful part is that duckdb is so easy to use as a client with an embedded server, because duckdb is a great client (+ a library).
Similar to how git can serve a repo from a simple http server with no git installed on that (git update-server-info).
The frozen part is what iceberg promised in the beginning, away from Hive's mutable metastore.
Point to a manifest file + parquet/orc & all you need to query it is S3 API calls (there is no metadata/table server, the server is the client).
> Creating and publishing a Frozen DuckLake with about 11 billion rows, stored in 4,030 S3-based Parquet files took about 22 minutes on my MacBook
Hard to pin down how much of it is CPU and how much is IO from s3, but doing something like HLL over all the columns + rows is pretty heavy on the CPU.
gopalv Oct 24, 2025

> will try to learn more about normal sockets to see if I could perhaps make them work with the app.
There's a whole skit in the vein of "What have the Romans ever done for us?" about ZeroMQ[1] which has probably lost to the search index now.
As someone who has held a socket wrench before, fought tcp_cork and dsack, Websockets isn't a bad abstraction to be on top of, especially if you are intending to throw TLS in there anyway.
Low level sockets is like assembly, you can use it but it is a whole box of complexity (you might use it completely raw sometimes like a tickle ack in the ctdb[2] implementation).
[1] - https://www.hackerneue.com/item?id=32242238
[2] - https://linux.die.net/man/1/ctdb
gopalv Oct 23, 2025

> I had a friend who would drink a gallon of whole milk a day to maintain weight because he did so much at the gym.
That honestly might be an absorption issue, not an intake issue - you can hit aerobic limits enough for your body to skip digesting stuff & just shove protein directly out of the stomach instead of bothering to break it down.
My experience with this was a brief high altitude climb above 5km in the sky, where eating eggs & ramen stopped working and only glucon-d kept me out of it.
The way I like to think of it is that the fat in your body can be eaten or drank, but needs to be breathed out as CO2 to leave it.
The rate at which you can put it in and the rate of letting it go are completely different.
gopalv Oct 17, 2025

UUIDv7 is only bad for range partitioning and privacy concerns.
The "naturally sortable" is a good thing for postgres and for most people who want to use UUID, because there is no sorted distribution buckets where the last bucket always grows when inserting.
I want to see something like HBase or S3 paths when UUIDv7 gets used.
gopalv Oct 14, 2025

> are array languages competitive with something like C or Fortran
The REPL is what matters - also while being performant.
Someone asks you a question, you write something, you run it and say an answer, the next question is asked etc.
I've seen these tools be invaluable in that model, over "write software, compile and run a thousand times" problems which C/Fortran lives in.
gopalv Oct 14, 2025

> but it claims to be a tutorial.
This is, but only for someone who wants to do JIT work without writing assembly code, but can read assembly code back into C (or can automate that part).
Instead of doing all manual register allocations in the JIT, you get to fill in the blanks with the actual inputs after a more (maybe) diligent compiler has allocated the registers, pushed them and all that.
There's a similar set of implementation techniques in Apache Impala, where the JIT only invokes the library functions when generating JIT code, instead of writing inline JIT operations, so that they can rely on shorter compile times for the JIT and deeper optimization passes for the called functions.
gopalv Oct 8, 2025

It is a meme, but it's always DNS
This error can happen if there's an AAAA record, but it contains the ipv4 address packed inside a ipv6 mask.
If the AAAA record says ::ffff:10.0.0.105, then you can either fix DNS or do what's in the blog, which should stop checking for quad A records.
gopalv Oct 8, 2025

> would've prevented technically competent leadership from testing customer hostile business decisions?
Technically competent doesn't always mean empathetic.
The decisions can sometime look like the xkcd cartoon about scientists[1].
[1] - https://xkcd.com/242/
gopalv Oct 6, 2025

> Clearly useful to people who are already competent developers
> Utterly useless to people who have no clue what they're doing
> the same way that a fighter jet is not useless
AI is currently like a bicycle, while we were all running hills before.
There's a skill barrier and getting less complicated each week.
The marketing goal is to say "Push the pedal and it goes!" like it was a car on a highway, but it is a bicycle, you have to keep pedaling.
The effect on the skilled-in-something-else folks is where this is making a difference.
If you were thinking of running, the goal was to strengthen your tendons to handle the pavement. And a 2hr marathon pace is almost impossible to do.
Like a bicycle makes a <2hr marathon distance "easy" for someone who does competitive rowing, while remaining impossible for those who have been training to do foot races forever.
Because the bicycle moves the problem from unsprung weights and energy recovery into a VO2 max problem, also into a novel aerodynamics problem.
And if you need to walk a rock garden, now you need to lug the bike too with you. It is not without its costs.
This AI thing is a bicycle for the mind, but a lot of people go only downhill and with no brakes.
gopalv Oct 3, 2025

> if you are into load balancing, you might also want to look into the 'power of 2 choices'.
You can do that better if you don't use a random number for the hash, instead flip a coin (well, check a bit of the hash of a hash), to make sure hash expansion works well.
This trick means that when you go from N -> N+1, all the keys move to the N+1 bucket instead of being rearranged across all of them.
I've seen this two decades ago and after seeing your comment, felt like getting Claude to recreate what I remembered from back then & write a fake paper [1] out of it.
See the MSB bit in the implementation.
That said, consistent hashes can split ranges by traffic not popularity, so back when I worked in this, the Membase protocol used capacity & traffic load to split the virtual buckets across real machines.
Hot partition rebalancing is hard with a fixed algorithm.
[1] - https://github.com/t3rmin4t0r/magic-partitioning/blob/main/M...
gopalv Oct 2, 2025

> Is the big win that you can send around custom little vector embedding databases with a built in sandbox?
No, this is a compatibility layer for future encoding changes.
For example, ORCv2 has never shipped because we tried to bundle all the new features into a new format version, ship all the writers with the features disabled, then ship all the readers with support and then finally flip the writers to write the new format.
Specifically, there was a new flipped bit version of float encoding which sent the exponent, mantissa and sign as integers for maximum compression - this would've been so much easier to ship if I could ship a wasm shim with the new file and skip the year+ wait for all readers to support it.
We'd have made progress with the format, but we'd also be able to deprecate a reader impl in code without losing compatibility if the older files carried their own information.
Today, something like Spark's variant type would benefit from this - the sub-columnarization that does would be so much easier to ship as bytecode instead of as an interpreter that contains support for all possible recombinations from split up columns.
PS: having spent a lot of nights tweaking tpc-h with ORC and fixing OOMs in the writer, it warms my heart to see it sort of hold up those bits in the benchmark
gopalv Oct 1, 2025
> Python is never the best language to do it in, but is almost always the second-best language to do it in.
I've been writing python from the last century and this year is the first time I'm writing production quality python code, everything up to this point has been first cut prototypes or utility scripts.
The real reason why it has stuck to me while others came and went is because of the REPL-first attitude.
A question like
```
    >>> 0.2 + 0.1 > 0.3

    True
```
is much harder to demonstrate in other languages.
The REPL isn't just for the code you typed out, it does allow you to import and run your lib functions locally to verify a question you have.
It is not without its craziness with decorators, fancy inheritance[1] or operator precedence[2], but you don't have to use it if you don't want to.
[1] - __subclasshook__ is crazy, right?
[2] - you can abuse __ror__ like this https://notmysock.org/blog/hacks/pypes

This user hasn’t submitted anything.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous