Profile: averne_ - Hacker Neue

averne_

Joined Oct 20, 2021 140 karma

averne_ Dec 24, 2025 parent

Not really. https://codecs.multimedia.cx/2022/12/ffhistory-fabrice-bella...
>Fabrice won International Obfuscated C Code Contest three times and you need a certain mindset to create code like that—which creeps into your other work. So despite his implementation of FFmpeg was fast-working, it was not very nice to debug or refactor, especially if you’re not Fabrice
averne_ Dec 2, 2025 parent

Not OP but I also often to listen to ambient while programming. A couple recommendations would be "Music for Nine Post Cards" and other works by Hiroshi Yoshimura, and "Music for 18 musicians" and others by Steve Reich.
In fact, the use of loops described in this article reminded me of what Reich called "phases", basically the same concept of emerging/shifting melodic patterns between different samples.
averne_ Oct 7, 2025 parent

New physics in this context means previously unknown effects or mechanisms, or even a new theory/framework for an already understood phenomenon. Using "physics" in this way is common amongst academics.
averne_ Sep 30, 2025 parent

The main reason a wafer scale chip works there is because their cores are extremely tiny, and silicon area that gets fused off in the event of a defect is much lower than on NVIDIA chips, where a whole SM can get disabled. AFAIU this approach is not easily applicable to complex core designs.
averne_ Sep 14, 2025 parent

The NVidia driver also has userland submission (in fact it does not support kernel-mode submission at all). I don't think it leads to a significant simplification or not of the userland code, basically a driver has to keep track of the same thing it would've submitted to an ioctl. If anything there are some subtleties that require careful consideration.
The major upside is removing the context switch on a submission. The idea is that an application only talks to the kernel for queue setup/teardown, everything else happens in userland.
averne_ Aug 26, 2025 parent

It actually doesn't make much difference: https://chipsandcheese.com/i/138977378/decoder-differences-a...
averne_ Aug 22, 2025 parent

No problem, just be aware there's a bunch of optimizations I haven't had time to implement yet. In particular, I'd to remove the reset kernel, fuse the VLD/IDCT ones, and try different strategies and hw-dependent specializations for the IDCT routine (AAN algorithm, packed FP16, cooperative matrices).
averne_ Aug 22, 2025 parent

Do you have a link for that? I'm the guy working on the Vulkan ProRes decoder mentionned as "in review" in this changelog, as part of a GSoC project.
I'm curious wrt how a WebGPU implementation would differ from Vulkan. Here's mine if you're interested: https://github.com/averne/FFmpeg/tree/vk-proresdec
averne_ Jul 29, 2025 parent

Do you mind going in some detail as to why they suck? Not a dig, just genuinely curious.
averne_ Jul 28, 2025 parent

Hardware GPU encoders refer to dedicated ASIC engines, separate from the main shader cores. So they run in parallel and there is no performance penalty for using both simultaneously, besides increased power consumption.
Generally, you're right that these hardware blocks favor latency. One example of this is motion estimation (one of the most expensive operations during encoding). The NVENC engine on NVidia GPUs will only use fairly basic detection loops, but can optionally be fed motion hints from an external source. I know that NVidia has a CUDA-based motion estimator (called CEA) for this purpose. On recent GPUs there is also the optical flow engine (another separate block) which might be able to do higher quality detection.
averne_ Jun 25, 2025 parent

Matrix instructions do of course have uses in graphics. One example of this is DLSS.
averne_ Jul 3, 2024 parent

Self-plug, but I wrote an open-source NVDEC driver for the Tegra X1, working on both the Switch OS and NVidia's Linux distro (L4T): https://github.com/averne/FFmpeg.
It currently integrates all the low-level bits into FFmpeg directly, though I am looking at moving those to a separate library. Eventually, I hope to support desktop cards as well with minimal code changes.
averne_ May 29, 2024 parent

The mushrooms are imported from China or Poland as mycelium, and the harvest is done in France. Since the law distinguishes between mycelium and mushroom, the mushroom were technically produced in France.
https://web.archive.org/web/20240121180131/https://www.reddi...
averne_ Nov 30, 2023 parent

It's not so clear cut. The author of the original PR had serious gripes about jart's handling of the situation, especially how hard they pushed their PR, practically forcing the merge before legitimate concerns were lifted.
See this post https://www.hackerneue.com/item?id=35418066
averne_ Aug 31, 2023 parent

This isn't true anymore. It was their first approach, but since then they have switched to their own JIT recompiler. You can read their rationale here: https://github.com/Ryujinx/Ryujinx/pull/693
For the MacOS port, they also added an ARM-to-ARM JIT in case hypervisor runs into issues.
averne_ May 8, 2023 parent

There are OpenGL extensions which can import a provided GPU buffer as a texture, using those you can achieve zero-copy.
For instance, with VAAPI->OpenGL you would use vaExportSurfaceHandle in conjunction with glEGLImageTargetTexture2DOES.
Check out the "hwdec" mechanism in MPV:
https://github.com/mpv-player/mpv/blob/master/video/out/hwde...
https://github.com/mpv-player/mpv/blob/master/video/out/hwde...
averne_ Dec 13, 2022 parent

You can just use __builtin_popcount or equivalent, which maps to a single instruction on most platforms.
averne_ Nov 29, 2022 parent

The nouveau project used a kernel module to intercept mmio accesses: https://nouveau.freedesktop.org/MmioTrace.html. Generally speaking hooking onto driver code is one of the preferred ways of doing dynamic reverse engineering. For userspace components, you can build an LD_PRELOAD stub that logs ioctls, and so on.
averne_ Aug 9, 2022 parent

> how much work would have been involved in getting this release open sourced
Close to no actual effort (the headers are autogenerated). However there was probably a lot of work behind the scenes with their legal team/whatever to clear the release.
About the shader ISA, I wish I knew. It's certain that the documentation exists, because they provide it to some developers (I've been told the Maxwell ISA docs are part of the Nintendo Switch SDK), so it's not like they have to write it from scratch.
And AMD provides full docs about it [1] (not sure about Intel), so I don't see how it could provide a significant edge over their competition. Maybe raytracing instructions? But for a motivated reverse engineer this stuff isn't impossible to figure out. I think it's down to company culture and inertia.
[1] https://developer.amd.com/wp-content/resources/RDNA2_Shader_...
averne_ Aug 9, 2022 parent

That's not exactly correct. This is register maps for the 3d engine (also called class), what you describe would be closer to the shader ISA.
In driver code you'll see them building command buffers that set registers in those classes to certain values. It could be the RGBA values of the clear color, or a virtual address in the GPU space.
This documents the names of these registers. This makes reverse engineering somewhat easier as you don't really have to guess anymore. But in most cases it was pretty clear from the start, and for some generations those were already well documented by open source efforts. One of best known gens is probably Maxwell since it was used in the Nintendo Switch, see for instance [1] (or code in yuzu/Ryujinx) which is the equivalent of those headers NV published.
However this isn't a very big step in documenting their GPUs. The exact functions of those registers aren't explained, but most importantly the shader ISA isn't documented at all, which is essential to build a good open-source driver.
Source: I have reverse engineered some driver code for Maxwell (and used similar headers to write drivers for nvdec and nvjpg).
[1] https://github.com/devkitPro/deko3d/blob/master/source/maxwe...
averne_ Aug 7, 2022 parent

I've written an open-source driver for the decoding side of the nvjpg module found in the Tegra X1 (ie. earlier hardware revision than the one in the A100).
I did some quick benchmarks against libjpeg-turbo, if that can give you an idea. I expect encoding performance would be similar.
https://github.com/averne/oss-nvjpg#performance
averne_ Jun 23, 2022 parent

I'm not sure there's much to be simplified, interpreted JS is just that slow. In more recent firmwares Nintendo introduced more security-oriented changes (CFI, PAC) that potentially slowed the browser down even further. The Switch CPU is also not fast to begin with, and severely underclocked compared to what you would see on a regular Tegra X1. It used to be that you could access the Switch eshop using a dumped console certificate, and it was very smooth on a regular desktop browser.
averne_ Jun 23, 2022 parent

Because it runs in a browser (WebKit-based), and because the browser is compiled without JIT for security concerns. It should really be a native applet, but Nintendo probably wanted to control it without needing a full firmware update.
averne_ Jan 3, 2022 parent

More importantly this uses the devkitPro/libnx homebrew toolchain, while the OP project uses the official SDK (the "complicated legal reasons" behind the absence of code probably being an NDA signature).
averne_ Oct 20, 2021 parent

Not true for plasmonic waveguides which can confine energy well beyond the diffraction limit. But I agree that for now, photonics is just an academic wet dream.

This user hasn’t submitted anything.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous