On Mon, Nov 7, 2011 at 12:03 PM, Anthony Liguori <anthony@xxxxxxxxxxxxx> wrote: > On 11/07/2011 11:52 AM, Sasha Levin wrote: >> >> Hi Anthony, >> >> Thank you for your comments! >> >> On Mon, 2011-11-07 at 11:37 -0600, Anthony Liguori wrote: >>> >>> On 11/06/2011 02:40 PM, Sasha Levin wrote: >>>> >>>> Hi all, >>>> >>>> I'm planning on doing a small fork of the KVM tool to turn it into a >>>> 'Secure KVM' enabled hypervisor. Now you probably ask yourself, Huh? >>>> >>>> The idea was discussed briefly couple of months ago, but never got off >>>> the ground - which is a shame IMO. >>>> >>>> It's easy to explain the problem: If an attacker finds a security hole >>>> in any of the devices which are exposed to the guest, the attacker would >>>> be able to either crash the guest, or possibly run code on the host >>>> itself. >>>> >>>> The solution is also simple to explain: Split the devices into different >>>> processes and use seccomp to sandbox each device into the exact set of >>>> resources it needs to operate, nothing more and nothing less. >>>> >>>> Since I'll be basing it on the KVM tool, which doesn't really emulate >>>> that many legacy devices, I'll focus first on the virtio family for the >>>> sake of simplicity (and covering 90% of the options). >>>> >>>> This is my basic overview of how I'm planning on implementing the >>>> initial POC: >>>> >>>> 1. First I'll focus on the simple virtio-rng device, it's simple enough >>>> to allow us to focus on the aspects which are important for the POC >>>> while still covering most bases (i.e. sandbox to single file >>>> - /dev/urandom and such). >>>> >>>> 2. Do it on a one process per device concept, where for each device >>>> (notice - not device *type*) requested, a new process which handles it >>>> will be spawned. >>>> >>>> 3. That process will be limited exactly to the resources it needs to >>>> operate, for example - if we run a virtio-blk device, it would be able >>>> to access only the image file which it should be using. >>>> >>>> 4. Connection between hypervisor and devices will be based on unix >>>> sockets, this should allow for better separation compared to other >>>> approaches such as shared memory. >>>> >>>> 5. While performance is an aspect, complete isolation is more important. >>>> Security is primary, performance is secondary. >>>> >>>> 6. Share as much code as possible with current implementation of virtio >>>> devices, make it possible to run virtio devices either like it's being >>>> done now, or by spawning them as separate processes - the amount of >>>> specific code for the separate process case should be minimal. >>>> >>>> >>>> Thats all I have for now, comments are *very* welcome. >>> >>> I thought about this a bit and have some ideas that may or may not help. >>> >>> 1) If you add device save/load support, then it's something you can >>> potentially >>> use to give yourself quite a bit of flexibility in changing the sandbox. >>> At any >>> point in run time, you can save the device model's state in the sandbox, >>> destroy >>> the sandbox, and then build a new sandbox and restore the device to its >>> former >>> state. >>> >>> This might turn out to be very useful in supporting things like device >>> hotplug >>> and/or memory hot plug. >>> >>> 2) I think it's largely possible to implement all device emulation >>> without doing >>> any dynamic memory allocation. Since memory allocation DoS is something >>> you >>> have to deal with anyway, I suspect most device emulation already uses a >>> fixed >>> amount of memory per device. This can potentially dramatically simplify >>> things. >>> >>> 3) I think virtio can/should be used as a generic "backend to frontend" >>> transport between the device model and the tool. >> >> virtio requires server and client to have shared memory, so if we >> already go with shared memory we can just let the device manage the >> actual virtio driver directly, no? > > Let's say you're implementing an IDE device model in the sandbox. You can > try to implement the block layer in the sandbox but I think that quickly > will become too difficult. > > You can do as Avi suggested and do all DMA accesses from the IDE device > model as RPCs, or you can map guest memory as shared memory and utilize (1) > in order to change that mapping as you need to. > > At some point, you end up with a struct iovec and an offset that you want to > read/write to the virtual disk. You need a way to send that to the > "frontend" that will then handle that as a raw/qcow2 request. > > Well, virtio is great at doing exactly that :-) So if you increase your > shared memory to have a little bit extra to stick another vring, you can use > that for device model -> front end communication without paying an extra > memcpy. > > For notifications, the easiest thing to do is setup an "event channel" > bitmap and use a single eventfd to multiplex that event channel bitmap. > This is pretty much how Xen works btw. A single interrupt is reserved and > a bitmap is used to dispatch the actual events. > > So the sandbox loop would look like: > > void main() { > setup_devices(); > > read_from_event_channel(main_channel); > for i in vrings: > check_vring_notification(i); > } > > Once vring would be used for dispatching PIO/MMIO. The remaining vrings > could be used for anything really. > > Like I mentioned elsewhere, just think of the sandbox as just an extension > of the guests firmware. The purpose of the sandbox is to reduce a very > complicated, legacy device model, into a very simple and easy to audit, > purely virtio based model. > >> >> Also, things like interrupts would also require some sort of a different >> IPC, which would complicate things a bit. >> >> >>> 4) Lack of select() is really challenging. I understand why it's not >>> there >>> since it can technically be emulated but it seems like a no-risk syscall >>> to >>> whitelist and it would make programming in a sandbox so much easier. >>> Maybe >>> Andrea has some comments here? I might be missing something here. >> >> There are several of these which would be nice to have, and if we can >> get seccomp filters we have good flexibility with which APIs we allow >> for each device. > > Yeah, filters are nice but I fear that you lose some of the PR benefits of > sandboxing. Once the first application claims to use sandboxing, whitelists > a syscall it shouldn't, you'll start getting slashdot articles about "Linux > sandbox broken, Linux security hopeless broken". Then what's the point of > all of this? Approaching the limit: since no security code/infrastructure is perfect, then what's the point of all of this? :) When I've spoken about seccomp_filter, I've tried to avoid the word 'sandbox' as that comes with more baggage than just creating a means of reducing the kernel's attack surface. Ideally, seccomp_filter just fills the void between read/write/sigreturn/exit and all-the-system-calls: Don't want select? ok. Want epoll? ok. . . It does mean that developers will have to determine the tradeoffs themselves (or with some general guidance). But, I expect there'd be quite a few more consumers of seccomp if it was possible to not need to emulate select() behavior or if, for example, brk() was allowed. cheers! will -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html