- Not OP but I also often to listen to ambient while programming. A couple recommendations would be "Music for Nine Post Cards" and other works by Hiroshi Yoshimura, and "Music for 18 musicians" and others by Steve Reich.
In fact, the use of loops described in this article reminded me of what Reich called "phases", basically the same concept of emerging/shifting melodic patterns between different samples.
- The main reason a wafer scale chip works there is because their cores are extremely tiny, and silicon area that gets fused off in the event of a defect is much lower than on NVIDIA chips, where a whole SM can get disabled. AFAIU this approach is not easily applicable to complex core designs.
- The NVidia driver also has userland submission (in fact it does not support kernel-mode submission at all). I don't think it leads to a significant simplification or not of the userland code, basically a driver has to keep track of the same thing it would've submitted to an ioctl. If anything there are some subtleties that require careful consideration.
The major upside is removing the context switch on a submission. The idea is that an application only talks to the kernel for queue setup/teardown, everything else happens in userland.
- It actually doesn't make much difference: https://chipsandcheese.com/i/138977378/decoder-differences-a...
- No problem, just be aware there's a bunch of optimizations I haven't had time to implement yet. In particular, I'd to remove the reset kernel, fuse the VLD/IDCT ones, and try different strategies and hw-dependent specializations for the IDCT routine (AAN algorithm, packed FP16, cooperative matrices).
- Do you have a link for that? I'm the guy working on the Vulkan ProRes decoder mentionned as "in review" in this changelog, as part of a GSoC project.
I'm curious wrt how a WebGPU implementation would differ from Vulkan. Here's mine if you're interested: https://github.com/averne/FFmpeg/tree/vk-proresdec
- Hardware GPU encoders refer to dedicated ASIC engines, separate from the main shader cores. So they run in parallel and there is no performance penalty for using both simultaneously, besides increased power consumption.
Generally, you're right that these hardware blocks favor latency. One example of this is motion estimation (one of the most expensive operations during encoding). The NVENC engine on NVidia GPUs will only use fairly basic detection loops, but can optionally be fed motion hints from an external source. I know that NVidia has a CUDA-based motion estimator (called CEA) for this purpose. On recent GPUs there is also the optical flow engine (another separate block) which might be able to do higher quality detection.
- Self-plug, but I wrote an open-source NVDEC driver for the Tegra X1, working on both the Switch OS and NVidia's Linux distro (L4T): https://github.com/averne/FFmpeg.
It currently integrates all the low-level bits into FFmpeg directly, though I am looking at moving those to a separate library. Eventually, I hope to support desktop cards as well with minimal code changes.
- The mushrooms are imported from China or Poland as mycelium, and the harvest is done in France. Since the law distinguishes between mycelium and mushroom, the mushroom were technically produced in France.
https://web.archive.org/web/20240121180131/https://www.reddi...
- It's not so clear cut. The author of the original PR had serious gripes about jart's handling of the situation, especially how hard they pushed their PR, practically forcing the merge before legitimate concerns were lifted.
See this post https://www.hackerneue.com/item?id=35418066
- This isn't true anymore. It was their first approach, but since then they have switched to their own JIT recompiler. You can read their rationale here: https://github.com/Ryujinx/Ryujinx/pull/693
For the MacOS port, they also added an ARM-to-ARM JIT in case hypervisor runs into issues.
- There are OpenGL extensions which can import a provided GPU buffer as a texture, using those you can achieve zero-copy.
For instance, with VAAPI->OpenGL you would use vaExportSurfaceHandle in conjunction with glEGLImageTargetTexture2DOES.
Check out the "hwdec" mechanism in MPV:
https://github.com/mpv-player/mpv/blob/master/video/out/hwde...
https://github.com/mpv-player/mpv/blob/master/video/out/hwde...
- The nouveau project used a kernel module to intercept mmio accesses: https://nouveau.freedesktop.org/MmioTrace.html. Generally speaking hooking onto driver code is one of the preferred ways of doing dynamic reverse engineering. For userspace components, you can build an LD_PRELOAD stub that logs ioctls, and so on.
- > how much work would have been involved in getting this release open sourced
Close to no actual effort (the headers are autogenerated). However there was probably a lot of work behind the scenes with their legal team/whatever to clear the release.
About the shader ISA, I wish I knew. It's certain that the documentation exists, because they provide it to some developers (I've been told the Maxwell ISA docs are part of the Nintendo Switch SDK), so it's not like they have to write it from scratch.
And AMD provides full docs about it [1] (not sure about Intel), so I don't see how it could provide a significant edge over their competition. Maybe raytracing instructions? But for a motivated reverse engineer this stuff isn't impossible to figure out. I think it's down to company culture and inertia.
[1] https://developer.amd.com/wp-content/resources/RDNA2_Shader_...
- That's not exactly correct. This is register maps for the 3d engine (also called class), what you describe would be closer to the shader ISA.
In driver code you'll see them building command buffers that set registers in those classes to certain values. It could be the RGBA values of the clear color, or a virtual address in the GPU space.
This documents the names of these registers. This makes reverse engineering somewhat easier as you don't really have to guess anymore. But in most cases it was pretty clear from the start, and for some generations those were already well documented by open source efforts. One of best known gens is probably Maxwell since it was used in the Nintendo Switch, see for instance [1] (or code in yuzu/Ryujinx) which is the equivalent of those headers NV published.
However this isn't a very big step in documenting their GPUs. The exact functions of those registers aren't explained, but most importantly the shader ISA isn't documented at all, which is essential to build a good open-source driver.
Source: I have reverse engineered some driver code for Maxwell (and used similar headers to write drivers for nvdec and nvjpg).
[1] https://github.com/devkitPro/deko3d/blob/master/source/maxwe...
- I've written an open-source driver for the decoding side of the nvjpg module found in the Tegra X1 (ie. earlier hardware revision than the one in the A100).
I did some quick benchmarks against libjpeg-turbo, if that can give you an idea. I expect encoding performance would be similar.
- I'm not sure there's much to be simplified, interpreted JS is just that slow. In more recent firmwares Nintendo introduced more security-oriented changes (CFI, PAC) that potentially slowed the browser down even further. The Switch CPU is also not fast to begin with, and severely underclocked compared to what you would see on a regular Tegra X1. It used to be that you could access the Switch eshop using a dumped console certificate, and it was very smooth on a regular desktop browser.
>Fabrice won International Obfuscated C Code Contest three times and you need a certain mindset to create code like that—which creeps into your other work. So despite his implementation of FFmpeg was fast-working, it was not very nice to debug or refactor, especially if you’re not Fabrice