On 04/30/2010 11:24 AM, Avi Kivity wrote: >> I'd argue the opposite. There's no point in having the host do swapping >> on behalf of guests if guests can do it themselves; it's just a >> duplication of functionality. > > The problem with relying on the guest to swap is that it's voluntary. > The guest may not be able to do it. When the hypervisor needs memory > and guests don't cooperate, it has to swap. Or fail whatever operation its trying to do. You can only use overcommit to fake unlimited resources for so long before you need a government bailout. >> You end up having two IO paths for each >> guest, and the resulting problems in trying to account for the IO, >> rate-limit it, etc. If you can simply say "all guest disk IO happens >> via this single interface", its much easier to manage. >> > > With tmem you have to account for that memory, make sure it's > distributed fairly, claim it back when you need it (requiring guest > cooperation), live migrate and save/restore it. It's a much larger > change than introducing a write-back device for swapping (which has > the benefit of working with unmodified guests). Well, with caveats. To be useful with migration the backing store needs to be shared like other storage, so you can't use a specific host-local fast (ssd) swap device. And because the device is backed by pagecache with delayed writes, it has much weaker integrity guarantees than a normal device, so you need to be sure that the guests are only going to use it for swap. Sure, these are deployment issues rather than code ones, but they're still issues. >> If frontswap has value, it's because its providing a new facility to >> guests that doesn't already exist and can't be easily emulated with >> existing interfaces. >> >> It seems to me the great strengths of the synchronous interface are: >> >> * it matches the needs of an existing implementation (tmem in Xen) >> * it is simple to understand within the context of the kernel code >> it's used in >> >> Simplicity is important, because it allows the mm code to be understood >> and maintained without having to have a deep understanding of >> virtualization. > > If we use the existing paths, things are even simpler, and we match > more needs (hypervisors with dma engines, the ability to reclaim > memory without guest cooperation). Well, you still can't reclaim memory; you can write it out to storage. It may be cheaper/byte, but it's still a resource dedicated to the guest. But that's just a consequence of allowing overcommit, and to what extent you're happy to allow it. What kind of DMA engine do you have in mind? Are there practical memory->memory DMA engines that would be useful in this context? >>> At this point we're back with the ordinary swap API. Simply have your >>> host expose a device which is write cached by host memory, you'll have >>> all the benefits of frontswap with none of the disadvantages, and with >>> no changes to guest code. >>> >> Yes, that's comfortably within the "guests page themselves" model. >> Setting up a block device for the domain which is backed by pagecache >> (something we usually try hard to avoid) is pretty straightforward. But >> it doesn't work well for Xen unless the blkback domain is sized so that >> it has all of Xen's free memory in its pagecache. >> > > Could be easily achieved with ballooning? It could be achieved with ballooning, but it isn't completely trivial. It wouldn't work terribly well with a driver domain setup, unless all the swap-devices turned out to be backed by the same domain (which in turn would need to know how to balloon in response to overall system demand). The partitioning of the pagecache among the guests would be at the mercy of the mm subsystem rather than subject to any specific QoS or other per-domain policies you might want to put in place (maybe fiddling around with [fm]advise could get you some control over that). > >> That said, it does concern me that the host/hypervisor is left holding >> the bag on frontswapped pages. A evil/uncooperative/lazy can just pump >> a whole lot of pages into the frontswap pool and leave them there. I >> guess this is mitigated by the fact that the API is designed such that >> they can't update or read the data without also allowing the hypervisor >> to drop the page (updates can fail destructively, and reads are also >> destructive), so the guest can't use it as a clumsy extension of their >> normal dedicated memory. >> > > Eventually you'll have to swap frontswap pages, or kill uncooperative > guests. At which point all of the simplicity is gone. Killing guests is pretty simple. Presumably the oom killer will get kvm processes like anything else? J -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>