On Tue, Mar 12, 2024 at 8:38 AM Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Mon, Mar 11, 2024 at 03:28:28PM -0700, Jiaqi Yan wrote: > > On Mon, Mar 11, 2024 at 2:27 PM James Houghton <jthoughton@xxxxxxxxxx> wrote: > > > > > > On Mon, Mar 11, 2024 at 12:28 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > > > > > On Mon, Mar 11, 2024 at 11:59:59AM -0700, Axel Rasmussen wrote: > > > > > I'd prefer not to require root or CAP_SYS_ADMIN or similar for > > > > > UFFDIO_POISON, because those control access to lots more things > > > > > besides, which we don't necessarily want the process using UFFD to be > > > > > able to do. :/ > > > > > > I agree; UFFDIO_POISON should not require CAP_SYS_ADMIN. > > > > +1. > > > > > > > > > > > > > > > > > Ratelimiting seems fairly reasonable to me. I do see the concern about > > > > > dropping some addresses though. > > > > > > > > Do you know how much could an admin rely on such addresses? How frequent > > > > would MCE generate normally in a sane system? > > > > > > I'm not sure about how much admins rely on the address themselves. +cc > > > Jiaqi Yan > > > > I think admins mostly care about MCEs from **real** hardware. For > > example they may choose to perform some maintenance if the number of > > hardware DIMM errors, keyed by PFN, exceeds some threshold. And I > > think mcelog or /sys/devices/system/node/node${X}/memory_failure are > > better tools than dmesg. In the case all memory errors are emulated by > > hypervisor after a live migration, these dmesgs may confuse admins to > > think there is dimm error on host but actually it is not the case. In > > this sense, silencing these emulated by UFFDIO_POISON makes sense (if > > not too complicated to do). > > Now we have three types of such error: (1) PFN poisoned, (2) swapin error, > (3) emulated. Both 1+2 should deserve a global message dump, while (3) > should be process-internal, and nobody else should need to care except the > process itself (via the signal + meta info). > > If we want to differenciate (2) v.s. (3), we may need 1 more pte marker bit > to show whether such poison is "global" or "local" (while as of now 2+3 > shares the usage of the same PTE_MARKER_POISONED bit); a swapin error can > still be seen as a "global" error (instead of a mem error, it can be a disk > error, and the err msg still applies to it describing a VA corrupt). > Another VM_FAULT_* flag is also needed to reflect that locality, then > ignore a global broadcast for "local" poison faults. It's easy to implement, as long as folks aren't too offended by taking one more bit. :) I can send a patch for this on Monday if there are no objections. > > > > > SIGBUS (and logged "MCE: Killing %s:%d due to hardware memory > > corruption fault at %lx\n") emit by fault handler due to UFFDIO_POISON > > are less useful to admins AFAIK. They are for sure crucial to > > userspace / vmm / hypervisor, but the SIGBUS sent already contains the > > poisoned address (in si_addr from force_sig_mceerr). > > > > > > > > It's possible for a sane hypervisor dealing with a buggy guest / guest > > > userspace to trigger lots of these pr_errs. Consider the case where a > > > guest userspace uses HugeTLB-1G, finds poison (which HugeTLB used to > > > ignore), and then ignores SIGBUS. It will keep getting MCEs / > > > SIGBUSes. > > > > > > The sane hypervisor will use UFFDIO_POISON to prevent the guest from > > > re-accessing *real* poison, but we will still get the pr_err, and we > > > still keep injecting MCEs into the guest. We have observed scenarios > > > like this before. > > > > > > > > > > > > Perhaps we can mitigate that concern by defining our own ratelimit > > > > > interval/burst configuration? > > > > > > > > Any details? > > > > > > > > > Another idea would be to only ratelimit it if !CONFIG_DEBUG_VM or > > > > > similar. Not sure if that's considered valid or not. :) > > > > > > > > This, OTOH, sounds like an overkill.. > > > > > > > > I just checked again on the detail of ratelimit code, where we by default > > > > it has: > > > > > > > > #define DEFAULT_RATELIMIT_INTERVAL (5 * HZ) > > > > #define DEFAULT_RATELIMIT_BURST 10 > > > > > > > > So it allows a 10 times burst rather than 2.. IIUC it means even if > > > > there're continous 10 MCEs it won't get suppressed, until the 11th came, in > > > > 5 seconds interval. I think it means it's possibly even less of a concern > > > > to directly use pr_err_ratelimited(). > > > > > > I'm okay with any rate limiting everyone agrees on. IMO, silencing > > > these pr_errs if they came from UFFDIO_POISON (or, perhaps, if they > > > did not come from real hardware MCE events) sounds like the most > > > correct thing to do, but I don't mind. Just don't make UFFDIO_POISON > > > require CAP_SYS_ADMIN. :) > > > > > > Thanks. > > > > -- > Peter Xu >