This is a demo of our recent work presented at Oakland (IEEE S&P): https://eprint.iacr.org/2022/368. The server and client code are written in Rust and available here: https://github.com/menonsamir/spiral-rs. The general aim of our work is to show that homomorphic encryption is practical today for real-world applications. The server we use to serve this costs $35/month!
A quick overview: the client uses homomorphic encryption to encrypt the article number that they would like to retrieve. The server processes the query and produces an encrypted result containing the desired article, and sends this back to the client, who can decrypt and obtain the article. A malicious server is unable to determine which article the client retrieved. All search and autocomplete is down locally. The technical details are in the paper, but the high level summary is that the client creates a large one-hot vector of encrypted bits (0’s except for the index of the desired article, where they place a 1) and then the server computes something like a ‘homomorphic dot product’ between the query and the plaintext articles.
I’d like to caveat that this is an in-browser demo to show it is practical to use homomorphic encryption at this scale. As a real product, you’d probably want to distribute a signed client executable (or Electron app) since otherwise, a malicious server could simply deliver bad client JS on the fly.
Happy to answer any questions!
(And this is not a criticism; this is a compliment. You start so far behind the eight-ball with homomorphic encryption with regard to the resources it consumes I wasn't convinced it was ever going to be even remotely useful for much of anything. Precisely because I was so skeptical, I am that impressed to see something work this well. It's not the fastest Wikipedia mirror, but... honestly... I've been on slower websites! Websites with far less excuse.)
One reason homomorphic encryption has a reputation as absurdly slow is that people typically talk about “fully” homomorphic encryption, which essentially means that you can compute an arbitrarily large function on the encrypted data. This involves a very expensive process called bootstrapping, which incurs a roughly ~15 ms cost per binary operation evaluated. As you can imagine, that adds up.
This work uses “leveled” homomorphic encryption, where we only perform a function of ‘bounded’ size (that is, the homomorphic dot product). So, we do not have to perform bootstrapping, and thus avoid some of the more extreme costs.
The other reason this work is practical is that we ‘tailored’ our construction to the particular problem of private information retrieval. When people try to apply homomorphic encryption generically to their problem, they typically end up with disappointingly slow and expensive results. Cool work on better ‘FHE compilers’ so in progress, so hopefully that will also help!
What is the meaning of the word "large" here? How are we defining the size of a function?
I'm also closely working together with a team at $WORK who's using a protocol very similar to Apple's but not doing CSAM detection. We are seeing some severe pushback on this technology. I wouldn't be surprised if there are multiple homomorohic encryption based products at Big Tech that have yet to see the light of day.
[1] https://www.hackerneue.com/item?id=28223141
> This demo allows private access to 6 GB (~30%) of English Wikipedia. In theory, even if the server is malicious, it will be unable to learn which articles you request. All article title searches are performed locally, and no images are available.
In this demo, the number of article-titles is relatively small – a few million – & enumerable.
If the server is truly malicious, and it issues itself requests for every known title, does it remain true that this "Private Information Retrieval" (PIR) scheme still gives it no hints that subsequent requests from others for individual articles retrieve particular data?
(Presumably: every request touches every byte of the same full 6GB of data, and involves every such byte in constant-run-time calculations that vary per request, and thus have the effect of returning only what each request wanted – but not at all in any way correlatable with other requests for the exact same article, from the same or different clients?)
Thank you, these 4 words really helped with my understanding so I'm calling it out incase it helps others. So I was thinking, what prevents you from replaying the query and getting the same page back? But it seems the answer is: that would only produce a gibberish response because you don't have the key.
The mitigations for this are somewhat heuristic; the client should perhaps 'pace'/'debounce' subsequent requests when a retrieval fails. For example, enforce that the re-retrieval will happen at a uniformly random time between 1-5 seconds later.
It's good to note that this kind of attack is somewhat unavoidable, since denial-of-service is (generally) always possible. An interesting mitigation might be for the server to produce a ZKP that the full dot product was computed correctly, and have the client always check the ZKP before displaying the article. This is mostly a theoretical fix for now I believe, since ZKP's over homomorphic computation are in the realm of theory more than practice.
Edit: this assumes that the client gets a trusted index from a set of trusted servers who are at least as up to date as the latest index that is made available, which timestamped signatures or similar should suffice for. Or the client can supply the timestamped index signature and the server can reply with a different message if it's too old.
1. Record what item was retrieved from disk for a query
2. Run a dictionary through the query system, and see which item matches the record
Would it be feasible to add some other zero knowledge proof to this that would confirm a user has paid a subscription for access? For example, if this were a news site, the user would have to prove a valid subscription to read articles, but the site would not be able to know which articles any subscriber decided to read?
If that is possible, what could the site to to prevent a paying subscriber from sharing their access to an unreasonable number of others? Would it be possible to impose a rate limit per subscriber?
> With a proper implementation of PIR, the server still needs to scan through the entire encrypted dataset (this is unavoidable, otherwise its I/O patterns would leak information)
Is this technique therefore practical only when the server side dataset is relatively small (or full scans for every query are tolerable)?
(edit: sorry, misattributed the quote)
Indeed, except in some (exciting!) new theoretical constructions, server work is always linear in the database size.
However, I’d emphasize that our work shows that the constabt factor on this linear operation is really high. We process at 300MB/s to 1.9GB/s on a single core, which is fast enough for relatively large databases. Remember that the computation is embarrassingly parallel, so you can always throw more compute at larger databases. To summarize, we think the speed is now fast enough that it really can be feasible to scan the whole database to answer a single query.
Sorry if this is too much of a tangent, but I would love to know what these are!
Key idea is the client issues a query that is independent of what they really want. This is the “offline” query. This query is linear (unavoidable) and the client gets some hints as a result
Then later, the client can issue one or more queries for elements they want and those queries can be sublinear in terms of computation. The client uses hints to make that happen.
Some papers on this are:
https://eprint.iacr.org/2019/1075.pdf
https://eprint.iacr.org/2021/1438
https://eprint.iacr.org/2022/081.pdf
And there are clever ways you can make a system like this "scale" even if the overall dataset size is limited. For instance, the authors cite another interesting paper [1] that uses a similar technique to build a fully-private voice chat system. The basic idea seems to be that you build a "database" consisting of the most recent snippet of audio from every ongoing conversation, and let each client privately query for the segment that's addressed to it. And every fraction of a second when new audio data arrives, you just throw away the old database and build a new one, so the amount of data doesn't depend on the length of the conversation.
Even if this doesn't scale to arbitrary numbers of users, it could still be used to run a "cell" of a few thousand people, in which it's not possible for an adversary to reconstruct communication patterns within the cell.
[1]: https://www.usenix.org/conference/osdi21/presentation/ahmad
Do you think that people would want private DNS? I suppose it would still be an improvement over the what we have today, but I’m not sure that it will make it meaningfully harder for ISPs to collect and sell data to advertisers.
Regardless, a person today has a choice of which DNS server to use but they all could track the requests made. Tracking site visits via IP is a different link in that chain.
Would people pay? I don't know, but I could see it being a feature used to different a VPN service from its competitors.
Google’s CT design docs also state PIR Audits as the long term goal, so this would be a good area to focus on. https://docs.google.com/document/d/1G1Jy8LJgSqJ-B673GnTYIG4b...
This sounds like magic :O. How does it behave when new articles (elements) are added, does it need to rebuild the whole database and distribute new parameters?
I wonder how practical it would be for clients to synchronize content without server not being able to deduce the synchronization state at which the client is.
Parameters only need to be changed based on the number of items in the database (not the content of the items). Also, they don’t really need to be modified as long as we are within the same power of two number of items. So, I think client and server agreement on parameters seems feasible. Right now, it’s just hard coded :)
A trivial solution to this problem would be for the client to just download a copy of the entire Wikipedia database and pick out the particular article it's interested in. With a proper implementation of PIR, the server still needs to scan through the entire encrypted dataset (this is unavoidable, otherwise its I/O patterns would leak information) but the amount of information that needs to be sent to the client is much smaller.
Not high, which is why it might not be working well for folks right now…
Time scales almost perfectly linearly with cores, since the computation is embarrassingly parallel.
In terms of cost, we’re still talking only 18 CPU•s and ~300KB of outgoing bandwidth, which is not a ton at todays prices.
This doesn't seem like a particularly compelling application - can you give some practical problems that homomorphic encryption solves. I've always heard vote counting as the example.
As far as more compelling applications, several folks have suggested DNS, OCSP, and certificate transparency as potential examples. Using PIR as a building block, it is also possible to build more complex systems, like metadata-private messaging (systems like ‘Pung’) and even private voice calling (systems like ‘Addra’).
For arbitrary key-value retrieval, a hashing scheme would work pretty well, modulo some imbalances that will occur.
If the server needs to go pull the article from Wikipedia, how is it blind to which one is being requested?
If you've pre-seeded the server with an encrypted 30% of Wikipedia, how can I trust you haven't retained information that would enable you to derive what I requested?
The only way I understand this works is if the client itself seeded the encrypted data in the first place (or at least an encrypted index if all the server pushes back is article numbers).
Maybe I'm ignorant of something; if so thanks for ELI5.
With homomorphic encryption, the client sends a series of encrypted numbers. Nobody can decrypt them except the client. The server can do arithmetic with them, making new secret numbers that nobody can decrypt except the client.
There is no usable information to retain.
So the question becomes: what can you calculate using arithmetic on secret numbers?
Well, for this demo, treat every article as a number. Then multiply all the articles you don't want by 0, and the article you want by 1, and add them all together.
The server just sees that it's multiplying every article by a secret number. It can't tell what the number is. It can't tell if the output is "encrypted article" or "encrypted 000000..."
Then the server adds them all up. If the client asked for no articles, the result will be "encrypted 000000..." If the client asked for one article, the result will be that article, encrypted. If the client asked for multiple, the result will be a garbled mush of overlapping articles, encrypted. The server can't tell the difference. It just knows it has an encrypted number.
If the server has removed an article, then it's possible it would send back a broken response to you, but the server would not be able to tell if that happened. There would only be an information leak if you reacted to the broken response.
If you mean observing that you made a query at all, versus not making a query, sure that's observable.
The 'magic' of homomorphic encryption is that the server is able to take an encrypted one-hot vector of bits, and a vector of plaintext articles, compute a homomorphic dot product, and produce a single ciphertext encoding the single desired plaintext article. This single output ciphertext is crucially also encrypted under the client's secret key, so it reveals nothing to the server.
So basically (if I've got this right and at the risk of over-simplifying): Client computes a bit mask and encrypts it, sends that to the server, and says "do a big multiplication of this against Wikipedia content". That leaves only the relevant portion in the encrypted result, which only the client can decrypt.
How does the client know which bits in the initial "mask" (aka one-hot vector) to set? Does it have to download a full index of article titles? (So in a sense this is just a big retrieval system that pulls an array element based on its index, not a freeform search tool).
If the index got too large, we could use a hash to bucket things, but this incurs some waste and would only support exact article title matches.
Yes, it is definitely not a freeform search tool.
The nicer way to look at homomorphic encryption is like math, but the more literal way is like a bunch of logic gates. There are N output wires. You control the bits on the input wires. Nothing you do will change the number of wires.
If you think about it, the pacing of user queries could always reveal something (if there’s a long gap between quarries, perhaps it’s a dense mathematics article?). So the best we can hope for is pacing requests either randomly, or for even further protection, perhaps making dummy requests on a fixed cadence.
Every query touches every byte of that snapshot, in the same way... but the homomorphic-encryption math distills that down to just the fragment the client requested.
And it does so without giving even the CPU executing that math any hint of what bytes/ranges/topics survive the math. The much-smaller – but still, at every server step, encrypted – result is then sent to the client, which performs the decryption.
It's moon-math magic.
I'm assuming it'll depend on breaking down questions into hierarchical sub-questions that can either be recomposed locally or in another homomorphic context. But can that sort of thing be done without data-leaks, or prohibitively expensive inter-node communication?
Are there any introductory resources (that you know of) on homomorphic encryption and compute that'll turn this into less of a mind-fuck?
Basically, you upload encrypted blobs to a P2P network, and you can issue special proxy-re-encryption keys to people that allows them to download the encrypted content, without revealing any of the content to people running the nodes (who store and replicate the data). You can also do really interesting things like revoking keys to remove access for people who haven't downloaded the blob yet.
Since this has only recently become practical, there is a bit of dearth of resources on it at the moment. I’m going to try to do my part and write a blog post at some point.
- How do you handle variable length data? Do you need to pad it?
- What is the memory overhead of the storage of encrypted data?
I think that at least for video data, the streaming scheme "leaks" the size of the encrypted data with the number of streaming packets.
The memory overhead is significant but not prohibitive… we have to keep something like a 5-8x larger encoded database in memory, but this overhead is from encoding the plaintext in a special format to allow a fast dot product, not from inefficiency in the packing.
Or does the server go through a large chunk of its memory (say, at least a quarter of all of Wikipedia) and perform some oblivious computation on all of that data (applying the result modulo this 100KB return buffer)? That sounds very resource-intensive, at least for something large like Wikipedia (a doctor's office with some information pages of a few KB each could more easily do such a thing).
In the latter case, is each request unique (does it involve some sort of IV that the client can xor out of the data again) or could an index be built similar to a list of hashed PIN codes mapped back to plain text numbers?
Edit: I had already read some comments but just two comments further would have been my answer... :) https://www.hackerneue.com/item?id=31669630
> > With a proper implementation of PIR, the server still needs to scan through the entire encrypted dataset (this is unavoidable, otherwise its I/O patterns would leak information)
Now let's see if I can also find the answer regarding the ability to build a lookup table...
Edit #2: found that also! https://www.hackerneue.com/item?id=31669924
> One query for one item in the database is indistinguishable (without the client’s key) from another query for the same item later; in other words, it’s similar to something like the guarantee of CBC or GCM modes, where as long as you use it correctly, it is secure even if the attacker can see many encryptions of its choosing.
That is some cool stuff indeed. I'm going to have to up my game when building or reviewing privacy-aware applications. Sure, a file sharing service is not going to practically allow this, but I'm sure that with this knowledge, I will come across places where it makes sense from both a usefulness (e.g. medical info) and practicality (data set size) perspective.
Well, as the author points out here [0], it doesn't actually translate to exorbitant cost increases when almost all the cost is in bandwidth rather than compute. A file sharing service rather seems like an ideal example (assuming you can derive additional revenue from the privacy promises).
[0]: https://www.hackerneue.com/item?id=31673122
For me, compute has always been by far my expense, and that's without homomorphic encryption! Buying a server or the monthly rental (e.g. VPS) costs is where most of my budget goes. Next on the expense list are a few domain names, and probably about as much in electricity if we're considering the at-home scenario. Bandwidth is usually included, be it with the VPS or with the server I run at home (since I want that internet uplink for normal use anyway).
When it’s done it’ll be at https://samirmenon.com/.
http://static.wiki/
See the previous news article. https://www.hackerneue.com/item?id=28012829
Have you considered running (# of cpus) parallel scanners continuously? An inbound query “hops on” the the least-loaded scanner; at each article/chunk the scanner runs all the queries; each query “hops off” and returns after it has completed the cycle through the entire DB.
I hope it doesn't become standard practice for general websites(As I imagine some would like to see), but it's an amazing tool and there will probably be many wonderful uses.
Arguably, a malicious server could deliver a bad executable too.
This seems to be some kind of search applied on an encrypted dataset, is that right?
Quoting it here for convenience:
> With homomorphic encryption, the client sends a series of encrypted numbers. Nobody can decrypt them except the client. The server can do arithmetic with them, making new secret numbers that nobody can decrypt except the client.
> There is no usable information to retain.
> So the question becomes: what can you calculate using arithmetic on secret numbers?
> Well, for this demo, treat every article as a number. Then multiply all the articles you don't want by 0, and the article you want by 1, and add them all together.
> The server just sees that it's multiplying every article by a secret number. It can't tell what the number is. It can't tell if the output is "encrypted article" or "encrypted 000000..."
> Then the server adds them all up. If the client asked for no articles, the result will be "encrypted 000000..." If the client asked for one article, the result will be that article, encrypted. If the client asked for multiple, the result will be a garbled mush of overlapping articles, encrypted. The server can't tell the difference. It just knows it has an encrypted number.
If you found the explanation useful, you can upvote the original comment linked above
http://blog.tyrannyofthemouse.com/2013/05/i-added-your-numbe...
I suggested that my goal would be to add a v3 onion service. They actually had listed years of "homomorphic encryption" as a requirement. I phoned up the recruiter and basically said it's ok if there is a personality conflict, but the role as written was impossible to fill, and it scared me that very good suggestions for privacy as well as the health of the Tor network were discarded.
(If you set up a dot onion, that frees up traffic on exit nodes, whose capacity are limited.)
Big thanks to the OP for being willing to share this work, it's very cool and I'm about to read your eprint.
I'm excited about the potential of homomorphic encryption, though I worry about things like CPU cost -- I recall when folks had to really be nudged not to encrypt huge blocks of data with PGP, but instead use it to encrypt the passphrase to a Truecrypt volume using a symmetric cipher like AES.
(I'd love how to know we got to a point Twitter added an onion service then banned me, but Wikipedia continues to not even support MFA for logins -- I recently registered an account intending to eventually upload some art to the commons, but the perpetual refusal to allow folks to make healthy choices disturbs me.
In fact, after reading articles like these ones[1][2], it makes me question the integrity of the folks I interacted with during the interview process.
On my end, it was especially disturbing since prior to enrolling in my PhD, the alternative path I discussed was becoming an FBI agent focused on counter intelligence in the "cyber" realm.
The agent I spoke with told me I'd serve "at the needs of the bureau", so that would mean probably not using my computer skills, which would then languish, then after a couple years I might still not get my desired position, and gave me a card, which I eventually lost.
Years later, prior to the insurrection, I had to walk down to Carnegie Mellon and ask if anyone had his contact information, and was shocked that folks refused to even point me at a link to the lecture, which had been listed as open to the public.
I'm someone who reads Wikipedia, not really edits, but the vast majority of users are readers not editors, and this perpetual pattern of refusing to enable privacy enhancing technologies, paired with using privileges access to make hiring decisions against folks who lack the physical ability to make good privacy decisions offended me on a deep, personal level, and is why I often post in brash, erratic manner.
Because I see zero incentive to stay silent -- if I'm quiet, people will slowly drain my bank account.
If I post, there is a chance someone will see what I say, notice my skills, and offer full time employment. So I have to continue risking offending folks until I find a full time job, which I have not had since I left the Center for Democracy and Technology under duress following a series of electronic and physical attacks, paired with threats and harassment by staffers in the organization.
TL;DR: Great research, but I hope they also add an onion service rather than jump straight to using this :-)
[1] https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@list...
[2] https://slate.com/technology/2021/10/wikipedia-mainland-chin...
I worry they have insider threat issues that remain unsolved.
I hope they add an onion service.
I think the tech is cool, but issues about untested code aside, I worry about CPU overhead.
(However, on the last point, I suspect much like when we worried about CPU overhead,
I often feel like I have to speak exhaustively and at length to get my point across, but that may be a side effect of a large chunk of my professional network being K Streeters - they like to misunderstand on purpose then complain you explained things at length.
Is the above better? I can skip marketing myself if that's the issue - I just notice a persistent issue that folks say they need certain skills, I know I have them, but folks disbelieve me. Short of being arrested for a CFAA violation I'm not sure how to prove it to those types at this point, and I don't intend on doing that, LOL.
If you actually care, it's painful to see people destroy the things you care about.
Could I, as a malicious server, request myself a target article and correlate that with legitimate user requests?
https://www.hackerneue.com/item?id=31669370 (not original poster)