On Fri, Aug 27 2021 at 16:22, Tony Luck wrote: > On Fri, Aug 27, 2021 at 09:57:10PM +0000, Al Viro wrote: >> On Fri, Aug 27, 2021 at 09:48:55PM +0000, Al Viro wrote: >> >> > [btrfs]search_ioctl() >> > Broken with memory poisoning, for either variant of semantics. Same for >> > arm64 sub-page permission differences, I think. >> >> >> > So we have 3 callers where we want all-or-nothing semantics - two in >> > arch/x86/kernel/fpu/signal.c and one in btrfs. HWPOISON will be a problem >> > for all 3, AFAICS... >> > >> > IOW, it looks like we have two different things mixed here - one that wants >> > to try and fault stuff in, with callers caring only about having _something_ >> > faulted in (most of the users) and one that wants to make sure we *can* do >> > stores or loads on each byte in the affected area. >> > >> > Just accessing a byte in each page really won't suffice for the second kind. >> > Neither will g-u-p use, unless we teach it about HWPOISON and other fun >> > beasts... Looks like we want that thing to be a separate primitive; for >> > btrfs I'd probably replace fault_in_pages_writeable() with clear_user() >> > as a quick fix for now... >> > >> > Comments? >> >> Wait a sec... Wasn't HWPOISON a per-page thing? arm64 definitely does have >> smaller-than-page areas with different permissions, so btrfs search_ioctl() >> has a problem there, but arch/x86/kernel/fpu/signal.c doesn't have to deal >> with that... >> >> Sigh... I really need more coffee... > > On Intel poison is tracked at the cache line granularity. Linux > inflates that to per-page (because it can only take a whole page away). > For faults triggered in ring3 this is pretty much the same thing because > mm/memory_failure.c unmaps the page ... so while you see a #MC on first > access, you get #PF when you retry. The x86 fault handler sees a magic > signature in the page table and sends a SIGBUS. > > But it's all different if the #MC is triggerd from ring0. The machine > check handler can't unmap the page. It just schedules task_work to do > the unmap when next returning to the user. > > But if your kernel code loops and tries again without a return to user, > then your get another #MC. But that's not the case for restore_fpregs_from_user() when it hits #MC. restore_fpregs_from_user() ... ret = __restore_fpregs_from_user(buf, xrestore, fx_only) /* Try to handle #PF, but anything else is fatal. */ if (ret != -EFAULT) return -EINVAL; Now let's look at __restore_fpregs_from_user() __restore_fpregs_from_user() return $FPUVARIANT_rstor_from_user_sigframe() which all end up in user_insn(). user_insn() returns 0 or the negated trap number, which results in -EFAULT for #PF, but for #MC the negated trap number is -18 i.e. != -EFAULT. IOW, there is no endless loop. This used to be a problem before commit: aee8c67a4faa ("x86/fpu: Return proper error codes from user access functions") and as the changelog says the initial reason for this was #GP going into the fault path, but I'm pretty sure that I also discussed the #MC angle with Borislav back then. Should have added some more comments there obviously. Thanks, tglx