It worked because it also had a conventional data-processing pipeline that revolved around JSON documents.
For (2) it seems a system like that should be able to generate a script in Python, a codesigned DSL or some other language to do the conversion.
One interesting thing about the product I worked on was that it functioned as a profiler by looking at one cell at a time, so if there is some field that has "Gruff Rhys" or "范冰冰" it could tell that was probably somebody's name, all the better if it can also see the field label is something like "Full Name" or "姓名". I'd contrast that to more conventional column-based profilers who might noticed that a certain field only has the values "true" and "false" throughout the whole column and would probably have some rule that would determine it was a boolean field.
One thing that system could do is recognize private data inside unstructured data. Where I work for instance we have
https://www.spirion.com/sensitive-data-discovery
which scans text and other files and it warns if it sees something like a lot of personal data, like an Excel spreadsheet full of names, addresses and phone numbers -- even if I just made them up as test data.
IMO this use case is exactly what Copilot is for. Write a comment including one example each of input and output, and tab-complete in your language of choice to have it create a rewriter for you.
One benefit (and danger) is that it will look at the values, not just the keys, and also may generate arbitrary code that can e.g. adapt a firstName and lastName to a fullName. But that's why you have a human being triggering and auditing this for subtle bugs, and putting it through code review and source control, right?
There are a few ways this could be made a less expensive to run:
1. Cache those embeddings somewhere. You're only embedding simple strings like "name" and "address" - no need to do that work more than once in an entire lifetime of running the tool.
2. As suggested here https://www.hackerneue.com/item?id=40973028 change the design of the tool so instead of doing the work it returns a reusable data structure mapping input keys to output keys, so you only have to run it once and can then use that generated data structure to apply the transformations on large amounts of data in the future.
3. Since so many of the keys are going to have predictable names ("name", "address" etc) you could even pre-calculate embeddings for the 1,000 most common keys across all three embedding providers and ship those as part of the package.
Also: in https://github.com/rectanglehq/Shapeshift/blob/d954dab2a866c... you're using Promise.map() to run multiple embeddings through the OpenAI API at once, which risks tripping their rate-limit. You should be able to pass the text as an array in a single call instead, something like this:
https://platform.openai.com/docs/api-reference/embeddings/cr... says input can be a string OR an array - that's reflected in the TypeScript library here too: https://github.com/openai/openai-node/blob/5873a017f0f2040ef...