Hi Alex, On Mon, May 02, 2022 at 08:57:01PM +0200, Alexander Graf wrote: > So far I see little that would block your patch? It seems to go into a > good direction from all I can tell. With kernel code for the kernel, if it's good enough, it's good enough, and can be made better later. With kernel code for userspace, it might not be possible to make it better later, so if we've got a plethora of options, it makes the incrementalism I was enthusiastic about a little bit more risky. Here's where we're at: - /proc/sys has this easy to use poll() interface [which currently breaks normal poll() semantics; see patch 1/2]. It'd be nice to use this as an async notifier interface. - A race-free mmap()d mechanism needs virtual hardware support to be useful, so I don't want to add it now. But when we do have it, where should it be added? * Atop the same /proc/sys file that this patch has returning -ENODATA at the moment? /proc/sys doesn't do mmap now so that'd be some plumbing work, which would be pretty odd for the sysctl abstraction. * Atop some new file elsewhere? Not hard to do, but also begs the question of which driver should do it, where the file should be, and a whole host of bikesheddy questions. * In the VDSO? This would make the most sense to me if we're ever going to do a VDSO RNG, and to design it along with the userspace RNG stuff people think they want, even if it's maybe not the best idea in the end. This is a huge can of worms. Termites, even, and they'll eat through the beams of your house. * Nowhere, because even though this race-free mechanism is attractive, it's way too hard to program for, and nobody is going to use it, and not many people care /that/ much about having this mechanism being 100% race-free. Colm said Amazon doesn't, for example, since it uses quiescent states. And apparently Microsoft doesn't care much either, and they don't even use quiescent states. Maybe I'm the only one raising a fuss about the race window (while hypocritically also not being very eager to write code around an interface that has to check a variable all the time). - A file in /proc/sys can return -ENODATA, as this patch has it doing, or it can return something else: * A counter. Flawed for reasons discussed elsewhere. * A UUID, just like /proc/sys/kernel/random/boot_id, except we'd name it generation_id or something and it'd change on VM fork. An earlier version of this patch did that. * -ENODATA, as this patch does, because each caller can get a new unique value with getrandom(), as there's no point in having some global identifier like with boot_id. This is what I'm leaning toward. It seems like getrandom() would handle Lennart's case, Go's case, s2n's case, and we don't need a particular ID. * The direct value of the ACPI vmgenid value. Discussed in earlier emails; I don't want to do this. - You don't care about race-free, but you do care about mmap(), for "programming reasons" in s2n. I'm not yet sympathetic to this case :-P. The point of mmap() should be for the race-free stuff. Otherwise poll() should do the job, and if you can't poll, you can hack around it with another process or a thread or whatever other trick. Your systemd proposal with a file was just punting the idea elsewhere in userspace (which I think is a bit ugly too...). This state of affairs inclines me to: - Have a /proc/sys/kernel/random/fork_event that has a poll() interface with reads returning -ENODATA (this patch). - Try to convince whomever that poll()ing on fork_event and then calling getrandom() for a process-local snapshot ID is better than poll()ing on and subsequently read()ing a /proc/sys/kernel/random/generation_id that returns a global UUID like boot_id. [Lennart - I'm specifically interested in trying to convince you of this. Thoughts?] - Given that you are okay with an async mechanism, figure out how poll() can work for s2n rather than mmap(). - Not really pursue the race-free mmap() thing unless Microsoft goes full steam ahead with it because it's complicated to program for and maybe (or maybe not? I don't know? data please?) a theoretical concern. > You block the thread on poll, so the only option is a thread. I was > until now always under the working assumption that we can't do this in a > thread because you don't want your single threaded application to turn > into a pthreaded one, but you make me wonder. Let me check with Torben > tomorrow. Can't you use raw clone() and just have a super dirty thread that shares a single page mapping and nothing else? Poll on the pidfd of the parent and the fork_event fd, quit on the former, and ++ on the latter. > >> 2) A way to notify larger applications (think Java here) that a system > >> is going to be suspended soon so it can wipe PII before it gets cloned > >> for example. > > Suspension, like S3 power notification stuff? Talk to Rafael about that; > > > Whether you go running -> S3 -> clone or you go running -> paused -> > clone is an implementation detail I'm not terribly worried about. Users > can do either, because on both cases the VM is in paused state. Ahh, I see. https://lore.kernel.org/lkml/20220501123849.3858-1-Jason@xxxxxxxxx/ might interest you. Jason