- I think the biggest injury to the hiring of junior devs happened after COVID made remote-work ubiquitous. It's a lot harder for a junior dev to get real mentorship, including the ambient kind of mentorship-by-osmosis, when everyone works alone in a sad dark room in their basement, rather than in an office with their peers and mentors.
The advent of agentic coding is probably punch #2 in the one-two punch against juniors, but it's an extension of a pattern that's been unfolding for probably 5+ years now.
- I read it. I also searched the page for the word "Opus" and it didn't appear anywhere. The word "Sonnet" appears, but only once.
There's also "GPT-4.1 or GPT-5", but that's not what my question implied, which was that it's weird to offer Sonnet but not Opus.
- Sonnet only?
- A similar kind of question about "understanding" is asking whether a house cat understands the physics of leaping up onto a countertop. When you see the cat preparing to jump, it take a moment and gazes upward to its target. Then it wiggles its rump, shifts its tail, and springs up into the air.
Do you think there are components of the cat's brain that calculate forces and trajectories, incorporating the gravitational constant and the cat's static mass?
Probably not.
So, does a cat "understand" the physics of jumping?
The cat's knowledge about jumping comes from trial and error, and their brain builds a neural network that encodes the important details about successful and unsuccessful jumping parameters. Even if the cat has no direct cognitive access to those parameters.
So the cat can "understand" jumping without having a "meta-understanding" about their understanding. When a cat "thinks" about jumping, and prepares to leap, they aren't rehearsing their understanding of the physics, but repeating the ritual that has historically lead them to perform successful jumps in the past.
I think the theory of mind of an LLM is like that. In my interactions with LLMs, I think "thinking" is a reasonable word to describe what they're doing. And I don't think it will be very long before I'd also use the word "consciousness" to describe the architecture of their thought processes.
- Same for me :)
- Is there way to get "speech marks" alongside the generated audio?
FYI, Speech marks provide millisecond timestamp for each word in a generated audio file/stream (and a start/end index into your original source string), as a stream of JSONL objects, like this:
{"time":6,"type":"word","start":0,"end":5,"value":"Hello"}
{"time":732,"type":"word","start":7,"end":11,"value":"it's"}
{"time":932,"type":"word","start":12,"end":16,"value":"nice"}
{"time":1193,"type":"word","start":17,"end":19,"value":"to"}
{"time":1280,"type":"word","start":20,"end":23,"value":"see"}
{"time":1473,"type":"word","start":24,"end":27,"value":"you"}
{"time":1577,"type":"word","start":28,"end":33,"value":"today"}
AWS uses these speech marks (with variants for "sentence", "word", "viseme", or "ssml") in their Polly TTS service...
The sentence or word marks are useful for highlighting text as the TTS reads aloud, while the "viseme" marks are useful for doing lip-sync on a facial model.
- If I'm reading the pricing correctly, these models are SIGNIFICANTLY cheaper than ElevenLabs.
https://platform.openai.com/docs/pricing
If these are the "gpt-4o-mini-tts" models, and if the pricing estimate of "$0.015 per minute" of audio is correct, then these prices 85% cheaper than those of ElevenLabs.
With ElevenLabs, if I choose their most cost-effectuve "Business" plan for $1100 per month (with annual billing of $13,200, a savings of 17% over monthly billing), then I get 11,000 minutes TTS, and each minute is billed at 10 cents.
With OpenAI, I could get 11,000 minutes of TTS for $165.
Somebody check my math... Is this right?
- 2 points
- Nope. They needed the maximum amount of thrust from those boosters in order to propel the spacecraft toward Jupiter, so they couldn't save enough fuel for the boosters to land themselves. This was the 6th flight of these boosters, so we thank them for their service!
- Awesome, I'm been following Seph's work for many years! Always thoughtful and well-executed. Probably the most prolific and insightful engineer in the "collaborative text editing" universe.
I use ShareDB every day, which originated from Seph's excellent work on OT algorithms. Good stuff!
- It doesn't sound like they "failed" any actual safety test, but rather that they rushed their safety tests, thereby "failing" (in the eyes of many people) to conduct sufficiently rigorous tests.
Now that the 4o model have been out in the wild for 2 months, have there been any claims of serious safety failures? The article doesn't seem to imply any such thing.
- Nice! Thank you for the links!
- This reminds me of the amazing molecular animations of Drew Berry, which he showed in this TED talk:
https://youtu.be/WFCvkkDSfIU?si=JNe06VS8TjIrHpqh
Which was 12 years ago! After watching that video, I had a much greater appreciation for how our bodies are made up of trillions of tiny protein machines. Fascinating stuff!!
- I find myself wondering if Apple applied some kind of back-channel pressure to oust Riccitiello.
With Unity at such a privileged position in the developer ecosystem of the upcoming Apple Vision Pro, I can imagine that Apple execs were pissed off that Unity would do something so stupid and shortsighted to jeopardize their developer ecosystem.
I haven't heard anyone float that idea yet, and the term "Apple" doesn't appear anywhere (yet!) in the comments of this post, so it doesn't seem to be on most people's minds. But still, I wonder...
- I'm a software engineer, and I think the fully remote-work culture can often be less personally fulfilling.
I enjoy going to an office, having a change of scenery, interacting with people (both close friends and casual acquaintances), brainstorming ideas with a small group around a whiteboard, getting lunch with colleagues, etc, etc...
Sure, the "writing code" part of the job is easier when there are fewer distractions. And a crowded open office can be an annoying source of distractions.
And I definitely enjoy having lunch more often with my family, or hanging out with the dogs, or sitting out on my deck under the trees while I read my morning email.
But working in an office was nice too, and I miss it.
- I'm still trying to get a handle on that part myself... But my ever-evolving understanding goes something like this:
The "Query" matrix is like a mask that is capable of selecting certain kinds of features from the context, while the "Key" matrix focuses the "Query" on specific locations in the context.
Using the Query + Key combination, we select and extract those features from the context matrix. And then we apply the "Value" matrix to those features in order to prepare them for feed-forward into the next layer.
There are multiple "Attention Heads" per layer (GPT-3 had 96 heads per layer), and each Head performs its own separate QKV operation. After applying those 96 Q+K->V attention operations per layer, the results are merged back into a single matrix so that they can be fed-forward into the next layer.
Or something like that...
I'm still trying to grok it myself, and if anyone here shed more light on the details, I'd be very grateful!
I'm still trying to understand, for example, how many QKV matrices are actually stored in a model with a particular number of parameters. For example, in a GPT-NeoX-20B model (with 20 billion params) how many distinct Q, K, and V matrices are there, and what is their dimensionality?
EDIT:
I just read Imnimo's comment below, and it provides a much better explanation about QKV vectors. I learned a lot!
- Okay, here's my attempt!
First, we take a sequence of words and represent it as a grid of numbers: each column of the grid is a separate word, and each row of the grid is a measurement of some property of that word. Words with similar meanings are likely to have similar numerical values on a row-by-row basis.
(During the training process, we create a dictionary of all possible words, with a column of numbers for each of those words. More on this later!)
This grid is called the "context". Typical systems will have a context that spans several thousand columns and several thousand rows. Right now, context length (column count) is rapidly expanding (1k to 2k to 8k to 32k to 100k+!!) while the dimensionality of each word in the dictionary (row count) is pretty static at around 4k to 8k...
Anyhow, the Transformer architecture takes that grid and passes it through a multi-layer transformation algorithm. The functionality of each layer is identical: receive the grid of numbers as input, then perform a mathematical transformation on the grid of numbers, and pass it along to the next layer.
Most systems these days have around 64 or 96 layers.
After the grid of numbers has passed through all the layers, we can use it to generate a new column of numbers that predicts the properties of some word that would maximize the coherence of the sequence if we add it to the end of the grid. We take that new column of numbers and comb through our dictionary to find the actual word that most-closely matches the properties we're looking for.
That word is the winner! We add it to the sequence as a new column, remove the first-column, and run the whole process again! That's how we generate long text-completions on word at a time :D
So the interesting bits are located within that stack of layers. This is why it's called "deep learning".
The mathematical transformation in each layer is called "self-attention", and it involves a lot of matrix multiplications and dot-product calculations with a learned set of "Query, Key and Value" matrixes.
It can be hard to understand what these layers are doing linguistically, but we can use image-processing and computer-vision as a good metaphor, since images are also grids of numbers, and we've all seen how photo-filters can transform that entire grid in lots of useful ways...
You can think of each layer in the transformer as being like a "mask" or "filter" that selects various interesting features from the grid, and then tweaks the image with respect to those masks and filters.
In image processing, you might apply a color-channel mask (chroma key) to select all the green pixels in the background, so that you can erase the background and replace it with other footage. Or you might apply a "gaussian blur" that mixes each pixel with its nearest neighbors, to create a blurring effect. Or you might do the inverse of a gaussian blur, to create a "sharpening" operation that helps you find edges...
But the basic idea is that you have a library of operations that you can apply to a grid of pixels, in order to transform the image (or part of the image) for a desired effect. And you can stack these transforms to create arbitrarily-complex effects.
The same thing is true in a linguistic transformer, where a text sequence is modeled as a matrix.
The language-model has a library of "Query, Key and Value" matrixes (which were learned during training) that are roughly analogous to the "Masks and Filters" we use on images.
Each layer in the Transformer architecture attempts to identify some features of the incoming linguistic data, an then having identified those features, it can subtract those features from the matrix, so that the next layer sees only the transformation, rather than the original.
We don't know exactly what each of these layers is doing in a linguistic model, but we can imagine it's probably doing things like: performing part-of-speech identification (in this context, is the word "ring" a noun or a verb?), reference resolution (who does the word "he" refer to in this sentence?), etc, etc.
And the "dot-product" calculations in each attention layer are there to make each word "entangled" with its neighbors, so that we can discover all the ways that each word is connected to all the other words in its context.
So... that's how we generate word-predictions (aka "inference") at runtime!
By why does it work?
To understand why it's so effective, you have to understand a bit about the training process.
The flow of data during inference always flows in the same direction. It's called a "feed-forward" network.
But during training, there's another step called "back-propagation".
For each document in our training corpus, we go through all the steps I described above, passing each word into our feed-forward neural network and making word-predictions. We start out with a completely randomized set of QKV matrixes, so the results are often really bad!
During training, when we make a prediction, we KNOW what word is supposed to come next. And we have a numerical representation of each word (4096 numbers in a column!) so we can measure the error between our predictions and the actual next word. Those "error" measurements are also represented as columns of 4096 numbers (because we measure the error in every dimension).
So we take that error vector and pass it backward through the whole system! Each layer needs to take the back-propagated error matrix and perform tiny adjustments to its Query, Key, and Value matrixes. Having compensated for those errors, it reverses its calculations based on the new QKV, and passes the resultant matrix backward to the previous layer. So we make tiny corrections on all 96 layers, and eventually to the word-vectors in the dictionary itself!
Like I said earlier, we don't know exactly what those layers are doing. But we know that they're performing a hierarchical decomposition of concepts.
Hope that helps!
- No, I don't pay for it myself. My employer setup the WeWork and pays the monthly bill, on the premise that there would be a handful of people in our city (Portland, OR) who also want to work from an office. There have a been a handful of other people who used to come in occasionally, but now it's dwindled down to mostly just me.
I also ride my bike ~10 miles each way (even in Portland winter!) because it's a great way for me to make sure I get daily exercise and breathe some fresh air.
Without the commute, I get cabin-fever in the dreary Portland winter.
I have a pretty robust social life with my wife and other non-work friends, but if I work from home every day, I start to get some major cabin fever...
And I specifically miss the PROFESSIONAL SOCIAL LIFE that comes from having a close face-to-face relationship with my collaborators.
Yeah, that's what I wanted to see too.