Comment by koolba - Hacker Neue

koolba Jan 25, 2017 parent

> Non-QEMU implementation: Google does not use QEMU, the user-space virtual machine monitor and hardware emulation. Instead, we wrote our own user-space virtual machine monitor that has the following security advantages over QEMU: Simple host and guest architecture support matrix. QEMU supports a large matrix of host and guest architectures, along with different modes and devices that significantly increase complexity.

I've noticed that a lot of projects that do support multiple architectures, particularly obscure ones, tend to find oddball edge cases more easily than those that don't. For example, not assuming the endianness of the CPU arch forces you to get network code to cooperate well.

> Because we support a single architecture and a relatively small number of devices, our emulator is much simpler.

No doubt it's simpler than QEMU but I wonder if adding tests to QEMU, even if they're only for the specific architectures they're running (most likely x86-64) would have been just as usable.

Then again when you're Google and have the resources to build a VM runtime from the ground up it's easier to convince management that "This is the right decision!".

luu Jan 25, 2017

> I wonder if adding tests to QEMU, even if they're only for the specific architectures they're running (most likely x86-64) would have been just as usable.

> Then again when you're Google and have the resources to build a VM runtime from the ground up it's easier to convince management that "This is the right decision!".

Is it possible you're either underestimating the effort it takes to make QEMU solid or overestimating the effort it takes to write an emulator?

I worked at a company where I hacked up QEMU as a stopgap before we switched to an in-house solution (this wasn't Google, although I've also worked at Google). I made literally hundreds of bug fixes to get QEMU into what was, for us, a barely usable state and then someone else wrote a solution from scratch in maybe a month and a half or two months. I doubt I could have gotten QEMU into the state we needed in a couple months. And to be clear, when I say bug fixes, I don't mean features or things that could possibly arguably be "working as intended", I mean bugs like "instruction X does the wrong thing instead of doing what an actual CPU does".

BTW, I don't mean to knock QEMU. It's great for what it is, but it's often the case that a special purpose piece of software tailored for a specific usecase is less effort than making a very general framework suitable for the same usecase. Even for our usecase, where QEMU was a very bad fit, the existence of QEMU let us get an MVP up in a week; I applied critical fixes to our hacked up QEMU while someone worked on the real solution, which gave us a two month head start over just writing something from scratch. But the effort it would have taken to make QEMU production worthy for us didn't seem worth it.

bonzini Jan 25, 2017

KVM does not use any of the CPU/instruction emulation in QEMU. It only uses the device emulation code and the interface to the host (asynchronous I/O, sockets, VNC, etc.).

We are adding unit tests for a lot of new code, and some parts of the code (especially the block device backends) have a comprehensive set of regression tests.

Also, distributions can disable obsolete devices if they wish. Red Hat does that in RHEL, for both security and supportability reasons. So if you want a free hardened QEMU, use CentOS. :-) Several other companies do so, including Nutanix and Virtuozzo.

j_s Jan 25, 2017

disable obsolete devices

Highly recommended!

Venom – A security vulnerability in virtual floppy drive code (~2 years ago)

https://www.hackerneue.com/item?id=9538437

bonzini Jan 25, 2017

Unfortunately VENOM was not so easy because some OSes (ehm Windows XP but also 2003...) only support driver floppies as opposed to driver CD-ROMs.

But we disable a bunch of old SCSI adapters, NICs, most audio cards, the whole Bluetooth emulation subsystem. All the cross-architecture emulation is also compiled out (x86-on-x86 emulation is still left in, until nested virtualization matures---which the Google folks are helping us with too!---but we only support it for libguestfs appliances).

Furthermore, in RHEL most image formats are forbidden or only supported read-only in the emulator (you can still use qemu-img to convert to and from them). Read-only support can be useful because of virt-v2v, an appliance that reads from VMware or Hyper-V images and tweaks them to run as KVM guests.

pm215 Jan 25, 2017

Mmm, in this case I suspect a lot of the benefit is not having to carry around the 90% of QEMU that's completely unused in the x86-KVM usecase (ie all of the emulation, all the devices for non-x86, all the random PCI devices you don't care about, etc) -- you don't have to security-audit that (it won't run if you're using QEMU but you'd have to convince yourself of that). Plus you don't need to care about maintaining compatibility with previous QEMU command lines or migration data formats.

Incidentally for instruction emulation the quality is rather variable: for instance I trust the 64-bit ARM userspace instructions pretty well because we were able to random-instruction-sequence test them and worked all the bugs out; x86 emulation I trust rather less because very few people want to emulate that these days because everybody's got the hardware, so bugs don't get found or fixed. QEMU is an enormous million-line codebase which satisfies multiple use cases several of which barely overlap at all, and its level of robustness and testing depends a lot on which parts you're looking at...

bonzini Jan 25, 2017

They have moved MMIO instruction emulation from KVM to userspace though. This is not yet part of upstream KVM.

I'm not sure how much of the emulation they left in the kernel, but something probably is there because handling simple MOV instructions in the kernel can have a massive effect on performance. Andy, what can you say? :)

amluto Jan 26, 2017

That VMX is seriously unfriendly toward full exits to user mode. I have some ideas to mitigate this. Intel could step up and fix it easily if they cared to.

For those unfamiliar with the issue: in a hypervisor like KVM on arcane hardware like x86, switching from guest mode to host kernel mode is considerably faster than switching from guest mode to host user mode. The reason you'd expect is that guest -> host user involves going to host kernel first and then to host user, but the actual kernel->user transition uses SYSRET and is very fast. The problem is that, in VMX (i.e., Intel's VM extensions), a guest exit kicks you back to the host with a whole bunch of the host control register state badly corrupted. To run normal kernel code, the host only needs to fix up some of the state, but to go all the way to user mode, the kernel needs to fix up the state completely, and Intel never tried to optimize control register programming, so this takes a long time (several thousand cycles, I think). I don't know if SVM (AMD's version) is much better.

As just one example, many things on x86 depend on GDTR, the global descriptor table register. VMX restores the GDTR base address on VM exit, but it doesn't restore the GDTR size. Exits to host user mode need to fix up the size, and writing to GDTR is slow.

How hard would it be to instrument the in-kernel emulation to see which instructions matter for performance? I bet that MOV (reg to/from mem) accounts for almost all of it with ADD and maybe MOVNT making up almost all the balance. Instructions without a memory argument may only matter for exploits and for hosts without unrestricted guest mode.

Hmm. Is SYSCALL still busted? The fact that we emulate things like IRET scares me, too.

Edit: added background info

bonzini Jan 26, 2017

Well I was thinking of andyhonig but I am not surprised to see you here, either...

pm215 Jan 25, 2017

Wait, x86 still requires instruction emulation for non-weirdo non-legacy cases? My vague recollection of the KVM Forum talk G. did was that you don't need it for "modern" guests.

(We were talking about emulation-via-just-interpret-one-instruction in userspace in upstream QEMU the other day -- you'd want it for OSX hypervisor.framework support too, after all. And maybe for the corner cases in TCG where you'd otherwise emulate one instruction and throw away the cached translation immediately.)

bonzini Jan 25, 2017

Apart from the legacy case, you need it for MMIO---KVM for ARM also has a mini parser for LDR/STR instructions.

x86 however has all sorts of wonderful read-modify-write instructions too. You need to support those, but it would still be a small subset of the full x86 instruction set if you all you want to support is processors newer than circa 2010.

pm215 Jan 26, 2017

KVM for ARM doesn't parse instructions -- you can just use the register info the hardware gives you in the syndrome register, which covers everything except oddball cases like trying load-multiple to a device, which doesn't happen in practice and so we don't support it.

jsolson Jan 26, 2017

Yeah, it still gets hit now and then. It should not get hit often in the typical steady state, though, which is why you can punt it to userspace with little performance penalty.

(I work on the custom VMM we run)

strstr Jan 26, 2017

(Echoing Bonzini) You don't need it to be in the kernel for modern guests (performance wise), but you still need it.

strstr Jan 25, 2017

Current implementation has everything in userspace. The perf hit hasn't been compelling enough to make even minor perf improvements.

(Yet-another-Googler: I worked on this and spoke about it at KVM Forum)

bonzini Jan 26, 2017

Interesting. So ioeventfd is also handled in userspace, I guess.

A couple years ago I measured a huge slowdown on userspace vmexits for guests spanning multiple NUMA nodes, because of cacheline bouncing on tsk->sighand->siglock. Maybe you're not using KVM_SET_SIGNAL_MASK.

(Steve, I suppose?)

jsolson Jan 27, 2017

ioeventfd for PIO exits is still handled in the kernel, but that one is easy since it's a dedicated VMEXIT type.

We do very little that typically requires trapping MMIO, particularly in places that are performance sensitive (VIRTIO Net and VIRTIO SCSI do not, and honestly there's not too much that guests do inside GCE that isn't either disk or networking :).

nellydpa Jan 26, 2017

You are right, some instructions are not suitable for userspace due to their performance implications and have to stay in the kernel. We identified a small set of them, for example, some parts of IOAPIC support have to stay put.

bonzini Jan 26, 2017

LAPIC I think? But those get their own special vmexit code so they do not need emulation (on Ivy Bridge or newer Xeons).

IOAPIC is legacy and replaced by MSI. I am surprised you don't use ioeventfd though!

jsolson Jan 27, 2017

> I am surprised you don't use ioeventfd though!

We do in some cases, for both networking and storage. Since our devices are (mostly) VIRTIO (of pre-1.0 vintage), we're using it for OUTs into BAR0 (which again of course get their own VMEXIT and don't require emulation).

By and large we try to elide the exits entirely if we can, naturally, although in today's GCE production environment serialized request/response type workloads will see exits on every packet. Streaming workloads fare better, as we do make use of EVENT_IDX and aggressively trying to find more work before advancing the used.avail_idx field.

koolba OP Jan 25, 2017

> Is it possible you're either underestimating the effort it takes to make QEMU solid or overestimating the effort it takes to write an emulator?

Not just possible, it's highly likely.

walterbell Jan 25, 2017

Are there open-source efforts to create a better emulator?

wmf Jan 25, 2017

Intel is working on QEMU-lite and there are some QEMU alternatives that aren't focused on emulating the full PC architecture like novm and kvmtool.

bonzini Jan 25, 2017

Depends on how you define better. If better means running old games and demos more faithfully, there's DOSEMU and DOSBOX. If better means emulating newer processors on older ones, there's Bochs. None of them supports KVM (except maybe DOSEMU??).

For KVM use, QEMU is pretty much the only choice with support for a wide range of guests, architectures, and features. lkvm (aka kvmtool) doesn't support Windows, UEFI, s390 hosts, live migration, etc.

At the same time, QEMU's binary code translator is improving. As pm215 said elsewhere, 64-bit ARM guest support is much better than x86 support, and we're also working on SMP which is scalable and pretty much state-of-the-art for cross-architecture emulators (of course QEMU is already scalable to multiple guest CPUs when using KVM, but doing it in an emulator is a different story).

rrdharan Jan 25, 2017

> No doubt it's simpler than QEMU but I wonder if adding tests to QEMU, even if they're only for the specific architectures they're running (most likely x86-64) would have been just as usable.

A little later in the post I believe this is somewhat addressed: > QEMU code lacks unit tests and has many interdependencies that would make unit testing extremely difficult.

Personally based on my previous experience at VMware and passing familiarity with QEMU, I think they made the right call.

(I work at Google, but not on the hypervisor)

jsolson Jan 26, 2017

> No doubt it's simpler than QEMU but I wonder if adding tests to QEMU, even if they're only for the specific architectures they're running (most likely x86-64) would have been just as usable.

There are some folks who have this view, but doing it from scratch also has the advantage that it integrates much more cleanly with Google's global shared codebase. There's a huge body of existing work that I can leverage more or less trivially. This includes things like Google's internal metrics and monitoring framework, our RPC framework, etc. Yes, you could bolt these onto the side of qemu, but qemu is a C codebase and most of Google (including the custom VMM described in the article) is not.

Additionally, when software is built using the same style, tools, and best practices as the rest of Google's codebase, it makes it easy for other engineers in the company to contribute. We benefit from Google-wide code cleanups, *SAN analysis tools, codebase-wide refactorings that make code easier to reason about the correctness of, etc.

Several years ago I think the question would've been a lot more difficult to answer, but today I the advantages of the route taken are unambiguous.

(my team owns the virtual network devices visible to GCE VM and the first chunk of the on-host dataplane, one virtual hardware component of the custom VMM we run :)

bonzini Jan 26, 2017

So you don't run vhost? This post is getting interestinger and interestinger! :-)

jsolson Jan 26, 2017

We do not. That it's not vhost you can infer (correctly) from the topology of our VIRTIO Net device; specifics on that front will have to wait for another day, though :)

aseipp Jan 25, 2017

> I've noticed that a lot of projects that do support multiple architectures, particularly obscure ones, tend to find oddball edge cases more easily than those that don't. For example, not assuming the endianness of the CPU arch forces you to get network code to cooperate well.

It's a balancing act, like anything is. Do you add or reject this patch from a contributor? This new feature someone wants, or a bug fixed? Is it better off designed/done in a different way? Is this kind of work maintainable e.g. 3 years from now when I'm not working on it? Can we reliably continue to support these systems or know someone who will care for them? Can we reasonably understand the entire system, and evolve it safely? Do you need more moving parts than are absolutely required? Does that use case even matter, everything else aside? The last one is very important.

Of course there's something to be said about software systems that maintain high quality code while supporting a diverse set of use cases and environments, like you mention.

But -- I'd probably say that, that result? It's likely more a function of focused design than it is a function of "trying to target a lot of architectures". Maybe targeting lots of architectures was a design goal, but nonetheless, the designing aspect is what's important. No amount of portability can fix fundamental misunderstandings of what you're trying to implement, of course. And that's where a lot of problems can creep in.

In this case, they have a very clear view of what they want, and a lot of what QEMU does is very irrelevant. It may be part of QEMU's design to be portable, but ultimately it is a lot of unnecessary, moving parts. You can cut down its scope dramatically with this knowledge -- from a security POV, that's very often going to be a win, to remove that surface area.

(Also, I'm definitely not saying QEMU is bloated or something, either. It's great software and I use it every day, just to be clear.)

mappu Jan 25, 2017

The KVM API is quite simple: https://lwn.net/Articles/658511/

This item has no comments currently.