On Thu, Mar 23, 2023 at 06:56:34PM -0700, Hugh Dickins wrote: > On Thu, 23 Mar 2023, Song Liu wrote: > > On Thu, Mar 23, 2023 at 2:56 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > > > > > On Thu, Mar 23, 2023 at 12:07:46PM -0700, Hugh Dickins wrote: > > > > On an earlier audit, for different reasons, I did also run across > > > > lib/buildid.c build_id_parse() using find_get_page() without checking > > > > PageUptodate() - looks as if it might do the wrong thing if it races > > > > with khugepaged collapsing text to huge, and should probably have a > > > > similar fix. > > > > > > That shouldn't be using find_get_page(). It should probably use > > > read_cache_folio() which will actually read in the data if it's not > > > present in the page cache, and return an ERR_PTR if the data couldn't > > > be read. > > > > build_id_parse() can be called from NMI, so I don't think we can let > > read_cache_folio() read-in the data. > > Interesting. > > This being the same Layering_Violation_ID which is asking for a home in > everyone's struct file? (Okay, I'm being disagreeable, no need to answer!) Yes, that's the one. Part of the BPF "splatter stuff all across the kernel that we don't really understand" (see, I can be just as disagreeable!) > I think even the current find_get_page() is unsafe from NMI: imagine that > NMI interrupting a sequence (maybe THP collapse or splitting, maybe page > migration, maybe others) when the page refcount has been frozen to 0: > you'll just have to reboot the machine? Correct; if it's been frozen by this CPU, we'll spin forever. I think the other conditions where we retry are temporary and caused by something _another_ CPU is doing. For example, if _this_ CPU is in the middle of modifying the tree when an NMI happens, we won't hit the xas_retry() case. That can only happen if we've observed something another CPU did, and then a second write happened from that same other CPU. The easiest example of that would be that we hit this kind of race: CPU 0 CPU 1 load root of tree, points to node store entry in root of tree wmb(); store RETRY entry in node load entry from node, observe RETRY entry The retry will happen and we'll observe the new state of the tree with the entry we're looking for in the root. If CPU 1 takes an NMI and interrupts itself, it will always see a consistent tree. > I guess the RCU-safety of find_get_page() implies that its XArray basics > are safe in NMI; but you need a low-level variant (xas_find()?) which > does none of the "goto retry"s, and fails immediately if anything is > wrong - including !PageUptodate. The Uptodate flag check needs to be done by the caller; the find_get_page() family return !uptodate pages. But find_get_page() does not advertise itself as NMI-safe. And I think it's wrong to try to make it NMI-safe. Most of the kernel is not NMI-safe. I think it's incumbent on the BPF people to get the information they need ahead of taking the NMI. NMI handlers are not supposed to be doing a huge amount of work! I don't really understand why it needs to do work in NMI context; surely it can note the location of the fault and queue work to be done later (eg on irq-enable, task-switch or return-to-user)