Comment by bonzini - Hacker Neue

bonzini Jan 25, 2017 parent

They have moved MMIO instruction emulation from KVM to userspace though. This is not yet part of upstream KVM.

I'm not sure how much of the emulation they left in the kernel, but something probably is there because handling simple MOV instructions in the kernel can have a massive effect on performance. Andy, what can you say? :)

amluto Jan 26, 2017

That VMX is seriously unfriendly toward full exits to user mode. I have some ideas to mitigate this. Intel could step up and fix it easily if they cared to.

For those unfamiliar with the issue: in a hypervisor like KVM on arcane hardware like x86, switching from guest mode to host kernel mode is considerably faster than switching from guest mode to host user mode. The reason you'd expect is that guest -> host user involves going to host kernel first and then to host user, but the actual kernel->user transition uses SYSRET and is very fast. The problem is that, in VMX (i.e., Intel's VM extensions), a guest exit kicks you back to the host with a whole bunch of the host control register state badly corrupted. To run normal kernel code, the host only needs to fix up some of the state, but to go all the way to user mode, the kernel needs to fix up the state completely, and Intel never tried to optimize control register programming, so this takes a long time (several thousand cycles, I think). I don't know if SVM (AMD's version) is much better.

As just one example, many things on x86 depend on GDTR, the global descriptor table register. VMX restores the GDTR base address on VM exit, but it doesn't restore the GDTR size. Exits to host user mode need to fix up the size, and writing to GDTR is slow.

How hard would it be to instrument the in-kernel emulation to see which instructions matter for performance? I bet that MOV (reg to/from mem) accounts for almost all of it with ADD and maybe MOVNT making up almost all the balance. Instructions without a memory argument may only matter for exploits and for hosts without unrestricted guest mode.

Hmm. Is SYSCALL still busted? The fact that we emulate things like IRET scares me, too.

Edit: added background info

bonzini OP Jan 26, 2017

Well I was thinking of andyhonig but I am not surprised to see you here, either...

pm215 Jan 25, 2017

Wait, x86 still requires instruction emulation for non-weirdo non-legacy cases? My vague recollection of the KVM Forum talk G. did was that you don't need it for "modern" guests.

(We were talking about emulation-via-just-interpret-one-instruction in userspace in upstream QEMU the other day -- you'd want it for OSX hypervisor.framework support too, after all. And maybe for the corner cases in TCG where you'd otherwise emulate one instruction and throw away the cached translation immediately.)

bonzini OP Jan 25, 2017

Apart from the legacy case, you need it for MMIO---KVM for ARM also has a mini parser for LDR/STR instructions.

x86 however has all sorts of wonderful read-modify-write instructions too. You need to support those, but it would still be a small subset of the full x86 instruction set if you all you want to support is processors newer than circa 2010.

pm215 Jan 26, 2017

KVM for ARM doesn't parse instructions -- you can just use the register info the hardware gives you in the syndrome register, which covers everything except oddball cases like trying load-multiple to a device, which doesn't happen in practice and so we don't support it.

jsolson Jan 26, 2017

Yeah, it still gets hit now and then. It should not get hit often in the typical steady state, though, which is why you can punt it to userspace with little performance penalty.

(I work on the custom VMM we run)

strstr Jan 26, 2017

(Echoing Bonzini) You don't need it to be in the kernel for modern guests (performance wise), but you still need it.

strstr Jan 25, 2017

Current implementation has everything in userspace. The perf hit hasn't been compelling enough to make even minor perf improvements.

(Yet-another-Googler: I worked on this and spoke about it at KVM Forum)

bonzini OP Jan 26, 2017

Interesting. So ioeventfd is also handled in userspace, I guess.

A couple years ago I measured a huge slowdown on userspace vmexits for guests spanning multiple NUMA nodes, because of cacheline bouncing on tsk->sighand->siglock. Maybe you're not using KVM_SET_SIGNAL_MASK.

(Steve, I suppose?)

jsolson Jan 27, 2017

ioeventfd for PIO exits is still handled in the kernel, but that one is easy since it's a dedicated VMEXIT type.

We do very little that typically requires trapping MMIO, particularly in places that are performance sensitive (VIRTIO Net and VIRTIO SCSI do not, and honestly there's not too much that guests do inside GCE that isn't either disk or networking :).

nellydpa Jan 26, 2017

You are right, some instructions are not suitable for userspace due to their performance implications and have to stay in the kernel. We identified a small set of them, for example, some parts of IOAPIC support have to stay put.

bonzini OP Jan 26, 2017

LAPIC I think? But those get their own special vmexit code so they do not need emulation (on Ivy Bridge or newer Xeons).

IOAPIC is legacy and replaced by MSI. I am surprised you don't use ioeventfd though!

jsolson Jan 27, 2017

> I am surprised you don't use ioeventfd though!

We do in some cases, for both networking and storage. Since our devices are (mostly) VIRTIO (of pre-1.0 vintage), we're using it for OUTs into BAR0 (which again of course get their own VMEXIT and don't require emulation).

By and large we try to elide the exits entirely if we can, naturally, although in today's GCE production environment serialized request/response type workloads will see exits on every packet. Streaming workloads fare better, as we do make use of EVENT_IDX and aggressively trying to find more work before advancing the used.avail_idx field.

This item has no comments currently.