On Mon, May 02, 2022 at 08:18:24AM -0400, Brian Foster wrote: > On Sat, Apr 30, 2022 at 04:44:07AM +0100, Matthew Wilcox wrote: > > On Thu, Apr 28, 2022 at 11:53:18AM -0400, Brian Foster wrote: > > > The above is the variant of generic/068 failure I was reproducing and > > > used to bisect [1]. With some additional tracing added to ioend > > > completion, what I'm seeing is that the bio_for_each_folio_all() bvec > > > iteration basically seems to go off the rails. What happens more > > > specifically is that at some point during the loop, bio_next_folio() > > > actually lands into the second page of the just processed folio instead > > > of the actual next folio (i.e. as if it's walking to the next page from > > > the head page of the folio instead of to the next 16k folio). I suspect > > > completion is racing with some form of truncation/reclaim/invalidation > > > here, what exactly I don't know, that perhaps breaks down the folio and > > > renders the iteration (bio_next_folio() -> folio_next()) unsafe. To test > > > that theory, I open coded and modified the loop to something like the > > > following: > > > > > > for (bio_first_folio(&fi, bio, 0); fi.folio; ) { > > > f = fi.folio; > > > l = fi.length; > > > bio_next_folio(&fi, bio); > > > iomap_finish_folio_write(inode, f, l, error); > > > folio_count++; > > > } > > > > > > ... to avoid accessing folio metadata after writeback is cleared on it > > > and this seems to make the problem disappear (so far, I'll need to let > > > this spin for a while longer to be completely confident in that). > > > > _Oh_. > > > > It's not even a terribly weird race, then. It's just this: > > > > CPU 0 CPU 1 > > truncate_inode_partial_folio() > > folio_wait_writeback(); > > bio_next_folio(&fi, bio) > > iomap_finish_folio_write(fi.folio) > > folio_end_writeback(folio) > > split_huge_page() > > bio_next_folio() > > ... oops, now we only walked forward one page instead of the entire folio. > > > > Yep, though once I noticed and turned on the mm_page_free tracepoint, it > looked like it was actually the I/O completion path breaking down the > compound folio: > > kworker/10:1-440 [010] ..... 355.369899: iomap_finish_ioend: 1090: bio 00000000bc8445c7 index 192 fi (00000000dc8c03bd 0 16384 32768 27) > ... > kworker/10:1-440 [010] ..... 355.369905: mm_page_free: page=00000000dc8c03bd pfn=0x182190 order=2 > kworker/10:1-440 [010] ..... 355.369907: iomap_finish_ioend: 1090: bio 00000000bc8445c7 index 1 fi (00000000f8b5d9b3 0 4096 16384 27) > > I take that to mean the truncate path executes while the completion side > holds a reference, folio_end_writeback() ends up dropping the last > reference, falls into the free/split path and the iteration breaks from > there. Same idea either way, I think. Absolutely. That's probably the more common path anyway; we truncate an entire folio instead of a partial one, so it could be: truncate_inode_partial_folio(): folio_wait_writeback(folio); if (length == folio_size(folio)) { truncate_inode_folio(folio->mapping, folio); or basically the same code in truncate_inode_pages_range() or invalidate_inode_pages2_range().