Re: [PATCH v3 25/25] xfs: Support large folios

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 28 Jun 2022 09:35:15 +1000

On Mon, Jun 27, 2022 at 03:16:19PM -0700, Darrick J. Wong wrote:
> On Mon, Jun 27, 2022 at 03:10:40PM +0100, Matthew Wilcox wrote:
> > On Sun, Jun 26, 2022 at 09:15:27PM -0700, Darrick J. Wong wrote:
> > > On Wed, Jun 22, 2022 at 05:42:11PM -0700, Darrick J. Wong wrote:
> > > > [resend with shorter 522.out file to keep us under the 300k maximum]
> > > > 
> > > > On Thu, Dec 16, 2021 at 09:07:15PM +0000, Matthew Wilcox (Oracle) wrote:
> > > > > Now that iomap has been converted, XFS is large folio safe.
> > > > > Indicate to the VFS that it can now create large folios for XFS.
> > > > > 
> > > > > Signed-off-by: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>
> > > > > Reviewed-by: Christoph Hellwig <hch@xxxxxx>
> > > > > Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> > > > > ---
> > > > >  fs/xfs/xfs_icache.c | 2 ++
> > > > >  1 file changed, 2 insertions(+)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> > > > > index da4af2142a2b..cdc39f576ca1 100644
> > > > > --- a/fs/xfs/xfs_icache.c
> > > > > +++ b/fs/xfs/xfs_icache.c
> > > > > @@ -87,6 +87,7 @@ xfs_inode_alloc(
> > > > >  	/* VFS doesn't initialise i_mode or i_state! */
> > > > >  	VFS_I(ip)->i_mode = 0;
> > > > >  	VFS_I(ip)->i_state = 0;
> > > > > +	mapping_set_large_folios(VFS_I(ip)->i_mapping);
> > > > >  
> > > > >  	XFS_STATS_INC(mp, vn_active);
> > > > >  	ASSERT(atomic_read(&ip->i_pincount) == 0);
> > > > > @@ -320,6 +321,7 @@ xfs_reinit_inode(
> > > > >  	inode->i_rdev = dev;
> > > > >  	inode->i_uid = uid;
> > > > >  	inode->i_gid = gid;
> > > > > +	mapping_set_large_folios(inode->i_mapping);
> > > > 
> > > > Hmm.  Ever since 5.19-rc1, I've noticed that fsx in generic/522 now
> > > > reports file corruption after 20 minutes of runtime.  The corruption is
> > > > surprisingly reproducible (522.out.bad attached below) in that I ran it
> > > > three times and always got the same bad offset (0x6e000) and always the
> > > > same opcode (6213798(166 mod 256) MAPREAD).
> > > > 
> > > > I turned off multipage folios and now 522 has run for over an hour
> > > > without problems, so before I go do more debugging, does this ring a
> > > > bell to anyone?
> > > 
> > > I tried bisecting, but that didn't yield anything productive and
> > > 5.19-rc4 still fails after 25 minutes; however, it seems that g/522 will
> > > run without problems for at least 3-4 days after reverting this patch
> > > from -rc3.
> > > 
> > > So I guess I have a blunt force fix if we can't figure this one out
> > > before 5.19 final, but I'd really rather not.  Will keep trying this
> > > week.
> > 
> > I'm on holiday for the next week, so I'm not going to be able to spend
> > any time on this until then.  I have a suspicion that this may be the
> > same bug Zorro is seeing here:
> > 
> > https://lore.kernel.org/linux-mm/20220613010850.6kmpenitmuct2osb@zlang-mailbox/
> > 
> > At least I hope it is, and finding a folio that has been freed would
> > explain (apparent) file corruption.
> 
> Hm.  I suppose it /could/ be a lost folio getting into the works
> somewhere.
> 
> Today I remembered fsx -X, which makes this reproduce a bit faster (~3-8
> minutes instead of 20-25).  That has helped me to narrow things down a
> little more:
> 
> - Turning off INSERT/COLLAPSE_RANGE doesn't make the problem go away,
>   but it does make reading the fsx log much easier.
> 
> - Turning off clone/dedupe (either via -J -B or formatting with -m
>   reflink=0) makes the problem go away completely.  If you define
>   letting fsx run for 90 minutes as "completely".
> 
> - Neutering vfs_dedupe_file_range_compare by replacing it with an -EBADE
>   return doesn't affect the reproducibility, so it's not the comparison
>   function misbehaving.
> - I modified fsx.c so that when there's file corruption, it'll report
>   both the first 16 bytes of corruption as well as every corruption that
>   happens on a page boundary.
> 
> - I also modified run_fsx() to diff the good and junk files, and
>   complain about any corruption happening on a page boundary.  Now I see
>   things like this:
> 
> 2153984(  0 mod 256): SKIPPED (no operation)
> 2153985(  1 mod 256): DEDUPE 0xf000 thru 0x23fff        (0x15000 bytes) to 0x2a000 thru 0x3efff ******BBBB
> 2153986(  2 mod 256): COPY 0xe794 thru 0x2ae41  (0x1c6ae bytes) to 0x60ac4 thru 0x7d171
> 2153987(  3 mod 256): TRUNCATE DOWN     from 0x7d172 to 0x535da
> 2153988(  4 mod 256): SKIPPED (no operation)
> 2153989(  5 mod 256): MAPREAD  0x40b93 thru 0x535d9     (0x12a47 bytes)
> 2153990(  6 mod 256): COPY 0x5edd thru 0x20282  (0x1a3a6 bytes) to 0x3a9aa thru 0x54d4f
> 2153991(  7 mod 256): SKIPPED (no operation)
> 2153992(  8 mod 256): SKIPPED (no operation)
> 2153993(  9 mod 256): ZERO     0x542d3 thru 0x67006     (0x12d34 bytes)
> 2153994( 10 mod 256): COPY 0x42cf6 thru 0x538a7 (0x10bb2 bytes) to 0x23fe7 thru 0x34b98 ******EEEE
> 2153995( 11 mod 256): MAPWRITE 0x5a1fc thru 0x6b067     (0x10e6c bytes)
> 2153996( 12 mod 256): SKIPPED (no operation)
> 2153997( 13 mod 256): CLONE 0x38000 thru 0x38fff        (0x1000 bytes) to 0x77000 thru 0x77fff
> 2153998( 14 mod 256): FALLOC   0x49bdd thru 0x62a55     (0x18e78 bytes) INTERIOR
> 2153999( 15 mod 256): CLONE 0xf000 thru 0x1bfff (0xd000 bytes) to 0x2c000 thru 0x38fff  ******JJJJ
> Log of operations saved to "/mnt/junk.fsxops"; replay with --replay-ops
> Correct content saved for comparison
> (maybe hexdump "/mnt/junk" vs "/mnt/junk.fsxgood")
> junk file 177
> -02e000  ec  20  ec  5a  ec  78  ec  b2  ec  e6  ec  1e  ec  43  ec  0f
> -02f000  ec  30  ec  32  ec  4c  ec  ac  ec  5c  ec  d2  ec  62  ec  d3
> -030000  ec  73  ec  ce  ec  8c  ec  cb  ec  94  ec  59  ec  81  ec  34
> +02e000  77  db  f1  db  ba  db  01  db  d5  db  9c  db  4d  db  de  db
> +02f000  b3  d8  35  d8  e2  d8  bb  d8  a4  d8  c8  d8  5b  d8  83  d8
> +030000  23  d8  c8  d8  22  d8  da  d8  97  d8  e0  d8  7e  d8  61  d8
> 
> When I remount the test filesystem, I see further corruption:
> 
> $ diff -Naurp <(od -tx1 -Ax -c $TEST_DIR/junk.fsxgood) <(od -tx1 -Ax -c $TEST_DIR/junk) | grep '^[+-]0..000'
> -011000  ec  20  ec  5a  ec  78  ec  b2  ec  e6  ec  1e  ec  43  ec  0f
> -012000  ec  30  ec  32  ec  4c  ec  ac  ec  5c  ec  d2  ec  62  ec  d3
> -013000  ec  73  ec  ce  ec  8c  ec  cb  ec  94  ec  59  ec  81  ec  34
> +011000  77  db  f1  db  ba  db  01  db  d5  db  9c  db  4d  db  de  db
> +012000  b3  d8  35  d8  e2  d8  bb  d8  a4  d8  c8  d8  5b  d8  83  d8
> +013000  23  d8  c8  d8  22  d8  da  d8  97  d8  e0  d8  7e  d8  61  d8
> -02e000  ec  20  ec  5a  ec  78  ec  b2  ec  e6  ec  1e  ec  43  ec  0f
> -02f000  ec  30  ec  32  ec  4c  ec  ac  ec  5c  ec  d2  ec  62  ec  d3
> -030000  ec  73  ec  ce  ec  8c  ec  cb  ec  94  ec  59  ec  81  ec  34
> +02e000  77  db  f1  db  ba  db  01  db  d5  db  9c  db  4d  db  de  db
> +02f000  b3  d8  35  d8  e2  d8  bb  d8  a4  d8  c8  d8  5b  d8  83  d8
> +030000  23  d8  c8  d8  22  d8  da  d8  97  d8  e0  d8  7e  d8  61  d8
> 
> This is really quite strange!  The same corruption patterns we saw at
> pages 0x2e - 0x30 now appear at 0x11-0x13 after the remount!

Hmmmm - look at what the last operation before failure
does - it clones 0xf000-0x1bfff to 0x2c000-0x38fff. IOWs, those
ranges *should* be identical and the the corruption is actually
occuring at 0x11000-0x13fff. It's not until that range gets cloned
to 0x2e000-0x30fff that the corruption is detected.

So we're looking in the wrong spot for the page cache corruption -
we need to be looking at operations over the range 0x11000-0x13fff
for misbehaviour, not where fsx detected the corrupt data.

> By comparison, the junk.fsxgood file only contains this 77/db/f1
> sequence at:
> 
> $ od -tx1 -Ax -c $TEST_DIR/junk.fsxgood | grep '77  db  f1'
> 008530  db  34  db  77  db  f1  db  ba  db  01  db  d5  db  9c  db  4d
> 03d000  77  db  f1  db  ba  db  01  db  d5  db  9c  db  4d  db  de  db
> 
> Curiously, the same byte trios at 0x2f000 and 0x30000 have similar
> repetitions at similar looking offsets:
> 
> $ od -tx1 -Ax -c $TEST_DIR/junk.fsxgood | grep 'b3  d8  35'
> 009530  d8  bb  d8  b3  d8  35  d8  e2  d8  bb  d8  a4  d8  c8  d8  5b
> 03e000  b3  d8  35  d8  e2  d8  bb  d8  a4  d8  c8  d8  5b  d8  83  d8
> $ od -tx1 -Ax -c $TEST_DIR/junk.fsxgood | grep '23  d8  c8'
> 00a530  d8  f5  d8  23  d8  c8  d8  22  d8  da  d8  97  d8  e0  d8  7e
> 03f000  23  d8  c8  d8  22  d8  da  d8  97  d8  e0  d8  7e  d8  61  d8
> 
> Though the only pattern that happens consistently is that garbage bytes
> end up at the reflink dest, and later at the reflink source.  I never
> see any VM_BUG_ON_FOLIO asserts, nor does KASAN report anything.

Smells like the page cache over the clone source is not getting
marked dirty and/or flushed to disk correctly before the clone is
run. It then shares the extent with stale data to the new location
(the destination) which then fails the contents validation.

Do we have a case where we are only writing back the head page of
the multipage folio?

> I also added a debug function to dump the folios it finds in the
> pagecache for the fsx junk file, but nothing looks odd:
> 
>      522-5099  [001]   491.954659: console:              [U] FSX FAILURE
>   xfs_io-5125  [002]   491.961232: console:              XFS (sda): EXPERIMENTAL online scrub feature in use. Use at your own risk!
>   xfs_io-5125  [002]   491.961238: bprint:               filemap_dump: ino 0xb1 pos 0x0 pfn 0x515cc order 0
>   xfs_io-5125  [002]   491.961238: bprint:               filemap_dump: ino 0xb1 pos 0x1000 pfn 0x515cd order 0
>   xfs_io-5125  [002]   491.961239: bprint:               filemap_dump: ino 0xb1 pos 0x2000 pfn 0x515ce order 0
>   xfs_io-5125  [002]   491.961239: bprint:               filemap_dump: ino 0xb1 pos 0x3000 pfn 0x515cf order 0
>   xfs_io-5125  [002]   491.961239: bprint:               filemap_dump: ino 0xb1 pos 0x4000 pfn 0x50c48 order 0
>   xfs_io-5125  [002]   491.961240: bprint:               filemap_dump: ino 0xb1 pos 0x5000 pfn 0x50c49 order 0
>   xfs_io-5125  [002]   491.961240: bprint:               filemap_dump: ino 0xb1 pos 0x6000 pfn 0x50c4a order 0
>   xfs_io-5125  [002]   491.961241: bprint:               filemap_dump: ino 0xb1 pos 0x7000 pfn 0xc8a8 order 0
>   xfs_io-5125  [002]   491.961241: bprint:               filemap_dump: ino 0xb1 pos 0x8000 pfn 0x50988 order 2
>   xfs_io-5125  [002]   491.961241: bprint:               filemap_dump: ino 0xb1 pos 0xc000 pfn 0x509e0 order 2
>   xfs_io-5125  [002]   491.961242: bprint:               filemap_dump: ino 0xb1 pos 0x10000 pfn 0x4db64 order 2

So this is the folio that likely has the problem (the source)...

>   xfs_io-5125  [002]   491.961242: bprint:               filemap_dump: ino 0xb1 pos 0x14000 pfn 0x50c4c order 0
>   xfs_io-5125  [002]   491.961243: bprint:               filemap_dump: ino 0xb1 pos 0x15000 pfn 0x12485 order 0
>   xfs_io-5125  [002]   491.961243: bprint:               filemap_dump: ino 0xb1 pos 0x16000 pfn 0x50c4d order 0
>   xfs_io-5125  [002]   491.961243: bprint:               filemap_dump: ino 0xb1 pos 0x17000 pfn 0x50c4e order 0
>   xfs_io-5125  [002]   491.961244: bprint:               filemap_dump: ino 0xb1 pos 0x18000 pfn 0x4eef8 order 2
>   xfs_io-5125  [002]   491.961244: bprint:               filemap_dump: ino 0xb1 pos 0x1c000 pfn 0x4eefc order 2
>   xfs_io-5125  [002]   491.961245: bprint:               filemap_dump: ino 0xb1 pos 0x20000 pfn 0x4eef0 order 2
>   xfs_io-5125  [002]   491.961245: bprint:               filemap_dump: ino 0xb1 pos 0x24000 pfn 0x50f2c order 2
>   xfs_io-5125  [002]   491.961245: bprint:               filemap_dump: ino 0xb1 pos 0x28000 pfn 0x50f28 order 2
>   xfs_io-5125  [002]   491.961246: bprint:               filemap_dump: ino 0xb1 pos 0x2c000 pfn 0x50f24 order 2

... not the one at the destination here.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx