On Thu, May 11, 2023 at 1:40 PM Axel Rasmussen <axelrasmussen@xxxxxxxxxx> wrote: > > On Thu, May 11, 2023 at 1:29 PM Mike Kravetz <mike.kravetz@xxxxxxxxxx> wrote: > > > > On 05/11/23 11:24, Axel Rasmussen wrote: Apologies for the noise, I should have CC'ed +Jiaqi on this series too, since he is working on other parts of the memory poisoning / recovery stuff internally. > > > The basic idea here is to "simulate" memory poisoning for VMs. A VM > > > running on some host might encounter a memory error, after which some > > > page(s) are poisoned (i.e., future accesses SIGBUS). They expect that > > > once poisoned, pages can never become "un-poisoned". So, when we live > > > migrate the VM, we need to preserve the poisoned status of these pages. > > > > > > When live migrating, we try to get the guest running on its new host as > > > quickly as possible. So, we start it running before all memory has been > > > copied, and before we're certain which pages should be poisoned or not. > > > > > > So the basic way to use this new feature is: > > > > > > - On the new host, the guest's memory is registered with userfaultfd, in > > > either MISSING or MINOR mode (doesn't really matter for this purpose). > > > - On any first access, we get a userfaultfd event. At this point we can > > > communicate with the old host to find out if the page was poisoned. > > > > Just curious, what is this communication channel with the old host? > > James can probably describe it in more detail / more correctly than I > can. My (possibly wrong :) ) understanding is: > > On the source machine we maintain a bitmap indicating which pages are > clean or dirty (meaning, modified after the initial "precopy" of > memory to the target machine) or poisoned. Eventually the entire > bitmap is sent to the target machine, but this takes some time (maybe > seconds on large machines). After this point though we have all the > information we need, we no longer need to communicate with the source > to find out the status of pages (although there may still be some > memory contents to finish copying over). > > In the meantime, I think the target machine can also ask the source > machine about the status of individual pages (for quick on-demand > paging). > > As for the underlying mechanism, it's an internal protocol but the > publicly-available thing it's most similar to is probably gRPC [1]. At > a really basic level, we send binary serialized protocol buffers [2] > over the network in a request / response fashion. > > [1] https://grpc.io/ > [2] https://protobuf.dev/ > > > -- > > Mike Kravetz > > > > > - If so, we can respond with a UFFDIO_SIGBUS - this places a swap marker > > > so any future accesses will SIGBUS. Because the pte is now "present", > > > future accesses won't generate more userfaultfd events, they'll just > > > SIGBUS directly. > > > > > > UFFDIO_SIGBUS does not handle unmapping previously-present PTEs. This > > > isn't needed, because during live migration we want to intercept > > > all accesses with userfaultfd (not just writes, so WP mode isn't useful > > > for this). So whether minor or missing mode is being used (or both), the > > > PTE won't be present in any case, so handling that case isn't needed. > > >