Preferences

Const-me
Joined 6,291 karma
meet.hn/city/42.4303762,18.6988104/Tivat

http://const.me/


  1. > if they are stored unpowered for a couple of years, then you clearly aren't doing regular backups

    I am doing regular backups yet I have a few backup disks unpowered for years. They are older, progressively smaller backup HDDs I keep for extra redundancy.

    Every 2-4 years I am getting a larger backup drive, and clone my previous backup drive to the new one. This way when the backup drive fails (happened around 2013 because I was unfortunate to get notoriously unreliable 3TB Seagate), I don’t lose much data if at all because most of the new stuff is still on the computers, and the old stuff is left on these older backup drives.

  2. Flash drives are less than ideal for backups. I think when they are stored cold i.e. unpowered, flash memory only retains data for a couple of years. Spinning hard drives are way more reliable for the use case.
  3. > keeps on nudging me to use Onedrive

    At least on Windows 10, you only have to decline once, here’s how. Ctrl+Shift+Esc to launch task manager, “Startup” tab, right click on “Microsoft OneDrive”, select “Disable” from the context menu, then either reboot, or log out and log in.

  4. > How large is large?

    About the same size as amount of physical memory. For Word 97, minimum system requirement was 8MB RAM. That’s not just for the word, also the entire OS.

    > Loading and saving a few GiB from my SSD is pretty fast

    Indeed, that’s one of the reasons why modern word processors stopped doing complicated tricks like the ones I described, and instead serialize complete documents.

    > a special file format and use virtual memory

    That’s not the brightest idea: inflates disk bandwidth by at least a factor of 2. A modern example of software which routinely handles datasets much larger than physical memory is database engines (large ones, not embedded). They avoid virtual memory as much as possible because IO amplification leads to unpredictable latency.

    > Saddling some guy on an underpowered Chromebook

    The guy will be fine. The program might allocate a large buffer on startup but will only use the small initial slice because chromebooks don’t come with particularly large screens. Linux kernel does not automatically commit allocated memory.

  5. > to know "what is clearly a bug" 100% of the time in projects you don't own

    Owning a project is counter-productive for QA. If it’s your project, you know where to click and where to not click.

    OTOH, you don’t need to know anything about a project to conclude that a crash with access violation, or hang with 100% CPU usage, are clearly bugs.

  6. > What if you don't know ahead of time how big that monitor is that you are displaying stuff on?

    Use a reasonable upper estimate?

    > ad-hoc re-implementation of virtual memory?

    If you rely on actual virtual memory instead of specially designed file format, saving large files will become prohibitively slow. On each save you have to stream the entire document from page file to actual memory, serialize the document, produce the entire file, then replace. And then when resuming editing after the save, you probably have to load the visible portion back from disk.

  7. It’s technically possible to do, just very complicated and hard. Quite often, prohibitively so.

    Still, the main idea is despite the input files are arbitrarily large, you don’t need an entire file in memory because displays aren’t remotely large enough to render a megabyte of text. Technically, you can only load a visible portion of the input file, and stream from/to disk when user scrolls. Furthermore, if you own the file format, you can design it in a way which allowing editing without overwriting the entire file: mark deleted portions without moving subsequent content, write inserts to the end of files, maybe organize the file as a B+ tree, etc.

    That’s how software like Word 97 supported editing of documents much larger than available memory. As you can imagine, the complexity of such file format, and the software handling them, was overwhelming. Which is why software developers stopped doing things like that as soon as computers gained enough memory to keep entire documents, and instead serialize them into sane formats like zipped XMLs in case of modern MS office.

  8. I did forget to mention something important. Since about Vista, Microsoft tends to replace or supplement C WinAPI with IUnknown based object-oriented ones. Note IUnknown doesn’t necessarily imply COM; for example, Direct3D is not COM: no IDispatch, IPC, registration or type libraries.

    IUnknown-based ABIs exposing methods of objects without any symbols exported from DLLs. Virtual method tables are internal implementation details, not public symbols. By testing SDK-defined magic numbers like SDKVersion argument of D3D11CreateDevice factory function, the DLL implementing the factory function may create very different objects for programs built against different versions of Windows SDK.

  9. > versioned symbols are a thing on Windows

    There’re quite a few mechanics they use for that. The oldest one, call a special API function on startup like InitCommonControlsEx, and another API functions will DLL resolve differently or behave differently. A similar tactic, require an SDK defined magic number as a parameter to some initialization functions, different magic numbers switching symbols from the same library; examples are WSAStartup and MFStartup.

    Around Win2k they did side by side assemblies or WinSxS. Include a special XML manifest into embedded resource of your EXE, and you can request specific version of a dependent API DLL. The OS now keeps multiple versions internally.

    Then there’re compatibility mechanics, both OS builtin and user controllable (right click on EXE or LNK, compatibility tab). The compatibility mode is yet another way to control versions of DLLs used by the application.

    Pretty sure there’s more and I forgot something.

  10. > there are source generators

    Last time I tried them discovered source generators in the current .NET 10 SDK are broken beyond repair, because Microsoft does not support dependencies between source generators.

    Want to auto-generate COM proxies or similar? Impossible because library import and export are implemented with another source generators. Want to generate something JSON serializable? Impossible because in modern .NET JSON serializer is implemented with another source generator. Generate regular expressions? Another SDK provided source generator, as long as you want good runtime performance.

  11. A new keyboard for X1 Carbon 7th gen is available on e-bay for the price of $50-100.

    A faulty keyboard is IMO not a good reason to replace a whole computer.

  12. > it cannot be faster than normalization alone

    Modern processors are generally computing stuff way faster than they can load and store bytes from main memory.

    The code which does on the fly normalization only needs to normalize a small window. If you’re careful, you can even keep that window in registers, which have single CPU cycle access latency and ridiculously high throughput like 500GB/sec. Even if you have to store and reload, on-the-fly normalization is likely to handle tiny windows which fit in the in-core L1D cache. The access cost for L1D is like ~5 cycles of latency, and equally high throughput because many modern processors can load two 64-bytes vectors and store one vector each and every cycle.

  13. > I'm genuinely surprised Microsoft's attitude towards "wndprocs don't have a context pointer"

    They designed windows classes to be reusable, and assumed many developers going to reuse windows classes across windows.

    Consider the following use case. Programmer creates a window class for a custom control, registers the class. Designs a dialog template with multiple of these custom controls in a single dialog. Then creates the dialog by calling DialogBoxW or similar.

    These custom controls are created automatically multiple at once, hard to provide context pointers for each control.

  14. I find WSL 1 incredibly useful. C++ and .NET compiler toolchains, ssh and scp clients, and many other command-line Linux tools are working flawlessly for me despite the fake emulated kernel lacking some of the APIs. When I develop anything related to Linux be it embedded or servers, I use WSL1 a lot.

    I find WSL2 pretty much useless. When I want Linux inside a VM I use VMware which is just better. VMware has tree of snapshots to rollback disk state, hardware accelerated 3D graphics (limited though, I think only GL is there no Vulkan, but it’s better than nothing), can attach complete USB devices to the guest OS, can setup proper virtual networks with multiple VMs, and the GUI to do all that is decent, no command line required.

  15. Hey, Katie. Just FYI, last week I paid $250 to your competitor after being unable to pass your identity verification. BTW, your web site didn’t even offer me an option to verify ID with a Paypal payment.
  16. > the api and the different ways everything has to be fed for it leaves a lot to be desired

    I think Microsoft fixed that in Windows Vista by providing a higher-level APIs on top of IOCP. See CreateThreadpoolIo, CloseThreadpoolIo, StartThreadpoolIo, and WaitForThreadpoolIoCallbacks WinAPI functions.

  17. You’re describing an edge case. Generally speaking, memory is only reused after old objects are deallocated. And here’s the relevant quote from the OP’s post:

    > Having all the intermediate calculations still available is helpful in the debugger

  18. For performance critical code, you want to reuse L1D cache lines as much as possible. In many cases, allocation of a new immutable object boils down to malloc(). Newly allocated memory is unlikely to be found on L1D cache. OTOH, replacing data in recently accessed memory and reusing the memory is very likely to become L1D cache hit in runtime.
  19. > is written in languages that inherited their array ordering from C

    It’s not just C. Modern GPU hardware only supports row major memory layout for 2D and 3D textures (ignoring specialized layouts like swizzling and block compression but none of them are column major either). Modern image and video codecs only support row major layout for bitmaps.

  20. What you wrote only applies to rotational latency, not seek latency. The seek latency is the time it takes for the head to reach the target. Heads only rotate within the small range like [ 0 .. 25° ], they are designed for rapid movements in either direction.
  21. > would you know if there is something similar that works on u128 instead of just u32/u64?

    Not as far as I’m aware, but I think your use case is handled by the u64 version rather well. Instead of u128, use array of two uint64 integers, pack the length into unused high bits of one of them.

    Here’s example C++ https://godbolt.org/z/Mrfv3hrzr The packing function in that source file requires AVX2, unpack is scalar code based on that BMI1 instruction.

    Another version with even fewer instructions to unpack, but one extra memory load: https://godbolt.org/z/hnaMY48zh Might be faster if you have a lot of these packed vectors, extracting numbers in a tight loop, and s_extractElements lookup table remains in L1D cache.

    P.S. I’ve tested that code just a couple of times, might be bugs

  22. The support for BMI1 instruction set extension is almost universal by now. The extension was introduced in AMD Jaguar and Intel Haswell, both launched in 2013 i.e. 12 years ago.

    Instead of doing stuff like (word >> bit_offset) & self.mask, in C or C++ I usually write _bextr_u64, or when using modern C# Bmi1.X64.BitFieldExtract. Note however these intrinsics compile into 2 instructions not one, because the CPU instruction takes start/length arguments from lower/higher bytes of a single 16-bit number.

  23. > full-platter seek time: ~8ms; half-platter seek time (avg): ~4ms

    Average distance between two points (first is current location, second is target location) when both are uniformly distributed in [ 0 .. +1 ] interval is not 0.5, it’s 1/3. If the full platter seek time is 8ms, average seek time should be 2.666ms.

  24. For FFTW the showstopper was GPL license. For IPP, 200 MB of binary dependencies, also I remember when Intel was caught testing for Intel CPUs specifically in their runtime libraries instead or CPUID feature bits, deliberately crippling performance on AMD CPUs. I literally don’t have any Intel CPUs left in this house. For cuFFT, the issue is vendor lock-in to nVidia.

    And the problem is IMO too small to justify large dependencies. I only needed like 200×400 FFT as a minor component of a larger software.

  25. I have recently needed a decently performing FFT. Instead of doing Cooley-Tukey, I have realized the bruteforce version essentially computes two vector×matrix products, so I have interleaved and reshaped the matrices for sequential full-vector loads, and did bruteforce version with AVX1 and FMA3 intrinsics. Good enough for my use case of moderately sized FFT where matrices fit in L2 cache.
  26. Good article, but it uses less than ideal formula for weights of the gaussian blur kernel.

    Gaussian function for coefficients is fine for large sigmas, but for small blur radius you better integrate properly. Luckily, C++ standard library has std::erf function you gonna need for the proper formula. Here’s more info: https://bartwronski.com/2021/10/31/practical-gaussian-filter...

  27. C# would catch the bug at compile time, just like Rust.

    https://www.rocksolidknowledge.com/articles/locking-asyncawa...

  28. > does not account for frequency scaling on laptops

    Are you sure about that?

    > time spent in syscalls (if you don’t want to count it)

    The time spent in syscalls was the main objective the OP was measuring.

    > cycle counter

    While technically interesting, most of the time I do my micro-benchmark I only care about wallclock time. Contradictory to what you see in search engines and ChatGPT, RDTSC instruction is not a cycle counter, it’s a high resolution wallclock timer. That instruction was counting CPU cycles like 20 years ago, doesn’t do that anymore.

  29. Not sure if that’s relevant, but when I do micro-benchmarks like that measuring time intervals way smaller than 1 second, I use __rdtsc() compiler intrinsic instead of standard library functions.

    On all modern processors, that instruction measures wallclock time with a counter which increments at the base frequency of the CPU unaffected by dynamic frequency scaling.

    Apart from the great resolution, that time measuring method has an upside of being very cheap, couple orders of magnitude faster than an OS kernel call.

  30. Good article, but I believe it lacks information what specifically these magical dFdx, dFdy, and fwidth = abs(dFdx) + abs(dFdy) functions are computing.

    The following stackexchange answer addresses that question rather well: https://gamedev.stackexchange.com/a/130933/3355 As you see, dFdx and dFdx are not exactly derivatives, these are discrete screen-space approximations of these derivativities. Very cheap to compute due to the weird execution model of pixel shaders running in hardware GPUs.

This user hasn’t submitted anything.

Keyboard Shortcuts

Story Lists

j
Next story
k
Previous story
Shift+j
Last story
Shift+k
First story
o Enter
Go to story URL
c
Go to comments
u
Go to author

Navigation

Shift+t
Go to top stories
Shift+n
Go to new stories
Shift+b
Go to best stories
Shift+a
Go to Ask HN
Shift+s
Go to Show HN

Miscellaneous

?
Show this modal