On Fri, Sep 13, 2024 at 12:33:49PM -0400, Chris Mason wrote: > > If you could get the precise index numbers, that would be an important > > clue. It would be interesting to know the index number in the xarray > > where the folio was found rather than folio->index (as I suspect that > > folio->index is completely bogus because folio->mapping is wrong). > > But gathering that info is going to be hard. > > This particular debug session was late at night while we were urgently > trying to roll out some NFS features. I didn't really save many of the > details because my plan was to reproduce it and make a full bug report. > > Also, I was explaining the details to people in workplace chat, which is > wildly bad at rendering long lines of structured text, especially when > half the people in the chat are on a mobile device. > > You're probably wondering why all of that is important...what I'm really > trying to say is that I've attached a screenshot of the debugging output. > > It came from a older drgn script, where I'm still clinging to "radix", > and you probably can't trust the string representation of the page flags > because I wasn't yet using Omar's helpers and may have hard coded them > from an older kernel. That's all _fine_. This is enormously helpful. First, we see the same folio appear three times. I think that's particularly significant. Modulo 64 (number of entries/node), the indices the bad folio are found at is 16, 32 and 48. So I think the _current_ order of folio is 4, but at the time the folio was put in the xarray, it was order 6. Except ... at order-6 we elide a level of the xarray. So we shouldn't be able to see this. Hm. Oh! I think split is the key. Let's say we have an order-6 (or larger) folio. And we call split_huge_page() (whatever it's called in your kernel version). That calls xas_split_alloc() followed by xas_split(). xas_split_alloc() puts entry in node->slots[0] and initialises node->slots[1..XA_CHUNK_SIZE] to a sibling entry. Now, if we do allocate those node in xas_split_alloc(), we're supposed to free them with radix_tree_node_rcu_free() which zeroes all the slots. But what if we don't, somehow? (this is my best current theory). Then we allocate the node to a different tree, but any time we try to look something up, unless it's the index for which we allocated the node, we find a sibling entry and it points to a stale pointer. I'm going to think on this a bit more, but so far this is all good evidence for my leading theory.