On Fri, Sep 20, 2024 at 03:54:55PM +0200, Chris Mason wrote: > On 9/19/24 12:36 AM, Matthew Wilcox wrote: > > My brain is still mushy, but I think there is still a problem (both with > > the simple fix for 6.9 and indeed with 6.10). > > > > For splitting a folio, we have the folio locked, so we know it's not > > going anywhere. The tree may get rearranged around it while we don't > > have the xa_lock, but we're somewhat protected. > > > > In this case we're splitting something that was, at one point, a shadow > > entry. There's no struct there to lock. So I think we can have a > > situation where we replicate 'old' (in 6.10) or xa_load() (in 6.9) > > into the nodes we allocate in xas_split_alloc(). In 6.10, that's at > > least guaranteed to be a shadow entry, but in 6.9, it might already be a > > folio by this point because we've raced with something else also doing a > > split. > > > > Probably xas_split_alloc() needs to just do the alloc, like the name > > says, and drop the 'entry' argument. ICBW, but I think it explains > > what you're seeing? Maybe it doesn't? > > Jens and I went through a lot of iterations making the repro more > reliable, and we were able to pretty consistently show a UAF with > the debug code that Willy suggested: > > XA_NODE_BUG_ON(xas->xa_alloc, memchr_inv(&xas->xa_alloc->slots, 0, sizeof(void *) * XA_CHUNK_SIZE)); > > But, I didn't really catch what Willy was saying about xas_split_alloc() > until this morning. > > xas_split_alloc() does the allocation and also shoves an entry into some of > the slots. When the tree changes, the entry we've stored is wildly > wrong, but xas_reset() doesn't undo any of that. So when we actually > use the xas->xa_alloc nodes we've setup, they are pointing to the > wrong things. > > Which is probably why the commits in 6.10 added this: > > /* entry may have changed before we re-acquire the lock */ > if (alloced_order && (old != alloced_shadow || order != alloced_order)) { > xas_destroy(&xas); > alloced_order = 0; > } > > The only way to undo the work done by xas_split_alloc() is to call > xas_destroy(). I hadn't fully understood this until today. Here's what the code in 6.9 did (grossly simplified): do { unsigned int order = xa_get_order(xas.xa, xas.xa_index); if (order > folio_order(folio)) xas_split_alloc(&xas, xa_load(xas.xa, xas.xa_index), order, gfp); xas_lock_irq(&xas); if (old) { order = xa_get_order(xas.xa, xas.xa_index); if (order > folio_order(folio)) { xas_split(&xas, old, order); } } xas_store(&xas, folio); xas_unlock_irq(&xas); } while (xas_nomem(&xas, gfp)); The intent was that xas_store() would use the node allocated by xas_nomem() and xas_split() would use the nodes allocated by xas_split_alloc(). That doesn't end up happening if the split already happened before getting the lock. So if we were looking for a minimal fix for pre-6.10, calling xas_destroy if we don't call xas_split() would fix the problem. But I think we're better off backporting the 6.10 patches. For 6.12, I'm going to put this in -next: http://git.infradead.org/?p=users/willy/xarray.git;a=commitdiff;h=6684aba0780da9f505c202f27e68ee6d18c0aa66 and then send it to Linus in a couple of weeks as an "obviously correct" bit of hardening. We really should have called xas_reset() before retaking the lock. Beyond that, I really want to revisit how, when and what we split. A few months ago we came to the realisation that splitting order-9 folios to 512 order-0 folios was just legacy thinking. What each user really wants is to specify a precise page and say "I want this page to end up in a folio that is of order N" (where N is smaller than the order of the folio that it's currently in). That is, if we truncate a file which is currently a multiple of 2MB in size to one which has a tail of, say, 13377ea bytes, we'd want to create a 1MB folio which we leave at the end of the file, then a 512kB folio which we free, then a 256kB folio which we keep, a 128kB folio which we discard, a 64kB folio which we discard, ... So we need to do that first, then all this code becomes way easier and xas_split_alloc() no longer needs to fill in the node at the wrong time.