On 9/20/24 3:54 PM, Chris Mason wrote: [ ... ] > xas_split_alloc() does the allocation and also shoves an entry into some of > the slots. When the tree changes, the entry we've stored is wildly > wrong, but xas_reset() doesn't undo any of that. So when we actually > use the xas->xa_alloc nodes we've setup, they are pointing to the > wrong things. > > Which is probably why the commits in 6.10 added this: > > /* entry may have changed before we re-acquire the lock */ > if (alloced_order && (old != alloced_shadow || order != alloced_order)) { > xas_destroy(&xas); > alloced_order = 0; > } > > The only way to undo the work done by xas_split_alloc() is to call > xas_destroy(). > > To prove this theory, I tried making a minimal version that also > called destroy, but it all ended up less minimal than the code > that's actually in 6.10. I've got a long test going now with > an extra cond_resched() to make the race bigger, and a printk of victory. > > It hasn't fired yet, and I need to hop on an airplane, so I'll just leave > it running for now. But long story short, I think we should probably > just tag all of these for stable: > > https://lore.kernel.org/all/20240415171857.19244-2-ryncsn@xxxxxxxxx/T/#mdb85922624c39ea7efb775a044af4731890ff776 > > Also, Willy's proposed changes to xas_split_alloc() seem like a good > idea. A few days of load later and some extra printks, it turns out that taking the writer lock in __filemap_add_folio() makes us dramatically more likely to just return EEXIST than go into the xas_split_alloc() dance. With the changes in 6.10, we only get into that xas_destroy() case above when the conflicting entry is a shadow entry, so I changed my repro to use memory pressure instead of fadvise. I also added a schedule_timeout(1) after the split alloc, and with all of that I'm able to consistently make the xas_destroy() case trigger without causing any system instability. Kairui Song's patches do seem to have fixed things nicely. -chris