Email: <myusername>@gmail.com
Blog: http://www.music.mcgill.ca/~sinclair/blog
- radarsat1I find it really interesting that it uses a Mamba hybrid with Transformers. Is it the only significant model right now using (at least partially) SSM layers? This must contribute to lower VRAM requirements right? Does it impact how KV caching works?
- you're kind of describing the figure in table 1 (page 8) of the diffusion forcing paper
https://arxiv.org/abs/2407.01392
of course it doesn't redraw the image on every step, so not exactly what you're suggesting (interesting idea btw) but i think it's relevant.
- Maybe what they should do in the future is just automatically provide AI reviews to all papers and state that the work of the reviewers is to correct any problems or fill details that were missed. That would encourage manual review of the AI's work and would also allow authors to predict what kind of feedback they'll get in a structured way. (eg say the standard prompt used was made public so authors could optimize their submission for the initial automatic review, forcing the human reviewer to fill in the gaps)
ok of course the human reviewers could still use AI here but then so could the authors, ad infinitum..
- A lot of "generative" work is like this. While you can come up with benchmarks galore, at the end of the day how a model "feels" only seems to come out from actual usage. Just read /r/localllama for opinions on which models are "benchmaxed" as they put it. It seems to be common knowledge in the local LLM community that many models perform well on benchmarks but that doesn't always reflect how good they actually are.
In my case I was until recently working on TTS and this was a huge barrier for us. We used all the common signal quality and MOS-simulation models that judged so called "naturalness" and "expressiveness" etc. But we found that none of these really helped us much in deciding when one model was better than another, or when a model was "good enough" for release. Our internal evaluations correlated poorly with them, and we even disagreed quite a bit within the team on the quality of output. This made hyperparameter tuning as well as commercial planning extremely difficult and we suffered greatly for it. (Notice my use of past tense here..)
Having good metrics is just really key and I'm now at the point where I'd go as far as to say that if good metrics don't exist it's almost not even worth working on something. (Almost.)
- > fuzzifying logical computation?
Isn't that basically what the sigmoid operator does? Or more in the direction of averaging many logical computations, we have random forests.
- God of the gaps
- Ah so nothing bad happening anymore due to people believing what they read on the internet, huh? Interesting take.
- I would love to read more, but apart from not finding a lot of time lately, when I do read, it's fiction. Occasionally I have read a textbook on a topic I am really interested in, and I've read blogs and articles on various sciency themes, but when it comes to books, I have just never been very into reading non-fiction. I don't try often, but when I do, I get one or two chapters in and just .. fail to pick it up again.
I know that non-fiction would be "good for me." Particularly reading more in topics I'm less knowledgable about, like finance and business and politics. Personal growth. However, I do find that fiction helps expand my perspective and even, somehow, knowledge, but it's different from non-fiction, less direct. I don't read for that, explicitly, although I do like the effect. But I read because.. I guess, because it's nice for my brain to be somewhere else. I don't know. But non-fiction has never done it for me.. my mind just gets.. bored, I think, trying to absorb what someone else wants me to know. Even when I find the topic interesting.
I guess there are people who like non-fiction and people who like fiction and they often cross-over but I think most people lean one way or the other. I can see there being positives and negatives to either side. People who equally read both must be rare? Or maybe it's just my impression.
- It is a strange phenomenon though, these walls of text that LLMs output, when you consider that one thing they're really good at is summarization, and that if they are trained on bug report data, you'd think they would reproduce it in terms of style and conciseness.
Is it mainly post-training that causes this behaviour? They seem to do it for everything, like they are really biased towards super verbose output these days. Maybe something to do with reasoning models being trained for longer output?
- This sounds more correct to me. I've read previously somewhere that better generalization is usually associated with wider, smoother minima, and this is why regularization is important, because it has a smoothing function on the loss landscape.
- Location: Utrecht, The Netherlands
Remote: Yes (Remote or Hybrid, EU Timezone)
Willing to relocate: No
Technologies: [Deep Learning] PyTorch, TensorFlow, MLFlow; [Languages] Python, C, C++; [Infrastructure] AWS, Docker, PostgreSQL, DynamoDB.
I'm a Senior Machine Learning Engineer with 10+ years of R&D. My core expertise is Deep Learning Model Development and Pipeline Engineering, taking specialized models from concept to reliable output.
My recent work spans Computer Vision (traffic scenario analysis, SLAM, skin deformation analysis) and Generative Audio (speech synthesis focused on naturalness, novel voice generation, and controllability/editability).
I understand the full ML lifecycle, from novel research to scalable, cloud-ready API deployments. Seeking hands-on Senior-level roles and Lead positions to drive innovative model development. As Head of ML R&D (3 years), work included model development, AWS deployment, and rapid prototyping of LLM/GenAI applications for demos, all very hands-on along with a team of 10.
Background: PhD & Master's in Music Technology and Audio-Haptic Robotics (McGill).
CV and more details: https://sinclairs.gitlab.io/cv/sinclair_cv2025.pdf
Email: stephen.sinclair [..at ..] nonnegativ.com
- Just search for "chess LLM leaderboard" there are already several. Also check https://www.reddit.com/r/llmchess/ although admittedly it doesn't get a lot of traffic.
- one word: chairdogs
- That's exactly the problem in Europe though. It's quite the opposite here.
- I agree and honestly it may as well be considered a form of ABI incompatibility. They should make this explicit such that existing C extensions need to be updated to use some new API call for initialization to flag that they are GILless-ready, so that older extensions cannot even successfully be loaded when GIL is disabled.
- what do you think vector databases are? absolutely. i think the idea of a database and a "model" could start to really be merged this way..
- Oh wow, finally! They should have done this 20 years ago but this is awesome news.
- I'm curious, what is the use case for open-ended labeling like this? I can think of clustering ie finding similar tweets but that can also just be done via vector similarity. Otherwise maybe the labels contain interesting semantics but 6000 sounds like too many to analyze by hand. Maybe you are using LLMs to do further clustering and working on a graph or hierarchical "ontology" of tweets?
- it's literally the same technology as LLMs. Transformers were proposed for translation.
(But I don't know what methods the Firefox translation uses. I assume it's a local model but don't even know that for sure.)
- > scalability needs a whole bunch of complexity
I am not sure this is true. Complexity is a function of architecture. Scalability can be achieved by abstraction, it doesn't necessarily imply highly coupled architecture, in fact scalability benefits from decoupling as much as possible, which effectively reduces complexity.
If you have a simple job to do that fits in an AWS Lambda, why not deploy it that way, scalability is essentially free. But the real advantage is that by writing it as a Lambda you are forced to think of it in stateless terms. On the other hand if suddenly it needs to coordinate with 50 other Lambdas or services, then you have complexity -- usually scalability will suffer in this case, as things become more and more synchronous and interdependent.
> The monolith is composed of separate modules (modules which all run together in the same process).
It's of course great to have a modular architecture, but whether or not they run in the same process should be an implementation detail. Barriers should be explicit. By writing it all depending on local, synchronous, same-process logic, you are likely building in all sorts of implicit barriers that will become hidden dangers when suddenly you do need to scale. And by the way that's one of the reasons we think about scaling in advance, is that when the need comes, it comes quickly.
It's not that you should scale early. But if you're designing a system architecture, I think it's better to think about scaling, not because you need it, but because doing so forces you to modularize, decouple, and make synchronization barriers explicit. If done correctly, this will lead to a better, more robust system even when it's small.
Just like premature optimization -- it's better not to get caught up doing it too early, but you still want to design your system so that you'll be able to do it later when needed, because that time will come, and the opportunity to start over is not going to come as easily as you might imagine.