On Wed, Feb 24, 2021 at 02:45:03PM +0100, Alexander Graf wrote: > > Above should try harder to explan what are the things that need to be > > scrubbed and why. For example, I personally don't really know what is > > the OpenSSL session token example and what makes it vulnerable. I guess > > snapshots can attack each other? > > > > > > > > > > Here's a simple example of a workflow that submits transactions > > to a database and wants to avoid duplicate transactions. > > This does not require overseer magic. It does however require > > a correct genid from hypervisor, so no mmap tricks work. > > > > > > > > int genid, oldgenid; > > read(&genid); > > start: > > oldgenid = genid; > > transid = submit transaction > > read(&genid); > > if (genid != oldgenid) { > > revert transaction (transid); > > goto start: > > } > > I'm not sure I fully follow. For starters, if this is a VM local database, I > don't think you'd care about the genid. If it's a remote database, your > connection would get dropped already at the point when you clone/resume, > because TCP and your connection state machine will get really confused when > you suddenly have a different IP address or two consumers of the same stream > :). > > But for the sake of the argument, let's assume you can have a connectionless > database connection that maintains its own connection uniqueness logic. Right. E.g. not uncommon with REST APIs. They survive disconnect easily and use cookies or such. > That > database connector would need to understand how to abort the connection (and > thus the transaction!) when the generation changes. the point is that instead of all that you discover transaction as a duplicate and revert it. > And that's logic you > would do with the read/write/notify mechanism. So your main loop would check > for reads on the genid fd and after sending a connection termination, notify > the overlord that it's safe to use the VM now. > > The OpenSSL case (with mmap) is for libraries that are stateless and can not > guarantee that they receive a genid notification event timely. > > Since you asked, this is mainly important for the PRNG. Imagine an https > server. You create a snapshot. You resume from that snapshot. OpenSSL is > fully initialized with a user space PRNG randomness pool that it considers > safe to consume. However, that means your first connection after resume will > be 100% predictable randomness wise. I wonder whether something similar is possible here. I.e. use the secret to encrypt stuff but check the gen ID before actually sending data. If it changed re-encrypt. Hmm? > > The mmap mechanism allows the PRNG to reseed after a genid change. Because > we don't have an event mechanism for this code path, that can happen minutes > after the resume. But that's ok, we "just" have to ensure that nobody is > consuming secret data at the point of the snapshot. Something I am still not clear on is whether it's really important to skip the system call here. If not I think it's prudent to just stick to read for now, I think there's a slightly lower chance that it will get misused. mmap which gives you a laggy gen id value really seems like it would be hard to use correctly. > > > > > > > > > > > > > > > +Simplifyng assumption - safety prerequisite > > > +------------------------------------------- > > > + > > > +**Control the snapshot flow**, disallow snapshots coming at arbitrary > > > +moments in the workload lifetime. > > > + > > > +Use a system-level overseer entity that quiesces the system before > > > +snapshot, and post-snapshot-resume oversees that software components > > > +have readjusted to new environment, to the new generation. Only after, > > > +will the overseer un-quiesce the system and allow active workloads. > > > + > > > +Software components can choose whether they want to be tracked and > > > +waited on by the overseer by using the ``SYSGENID_SET_WATCHER_TRACKING`` > > > +IOCTL. > > > + > > > +The sysgenid framework standardizes the API for system software to > > > +find out about needing to readjust and at the same time provides a > > > +mechanism for the overseer entity to wait for everyone to be done, the > > > +system to have readjusted, so it can un-quiesce. > > > + > > > +Example snapshot-safe workflow > > > +------------------------------ > > > + > > > +1) Before taking a snapshot, quiesce the VM/container/system. Exactly > > > + how this is achieved is very workload-specific, but the general > > > + description is to get all software to an expected state where their > > > + event loops dry up and they are effectively quiesced. > > > > If you have ability to do this by communicating with > > all processes e.g. through a unix domain socket, > > why do you need the rest of the stuff in the kernel? > > Quescing is a harder problem than waking up. > > That depends. Think of a typical VM workload. Let's take the web server > example again. You can preboot the full VM and snapshot it as is. As long as > you don't allow any incoming connections, you can guarantee that the system > is "quiesced" well enough for the snapshot. Well you can use a firewall or such to block incoming packets, but I am not at all sure that means e.g. all socket buffers are empty. > This is really what this bullet point is about. The point is that you're not > consuming randomness you can't reseed asynchronously (see the above OpenSSL > PRNG example). > > > Alex > > > > Amazon Development Center Germany GmbH > Krausenstr. 38 > 10117 Berlin > Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss > Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B > Sitz: Berlin > Ust-ID: DE 289 237 879 > >