Re: [PATCH] nilfs2: Fix race condition that causes file system corruption

Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx> · Sun, 29 Oct 2017 14:09:40 +0900

Unfortunately the patch didn't fix this issue:

[  612.614615] NILFS (sda1): segctord starting. Construction interval
= 5 seconds, CP frequency < 30 seconds
[ 4298.841376] NILFS (sda1): nilfs_ioctl_move_inode_block: conflicting
data buffer: ino=1155, cno=11, offset=1, blocknr=1216036,
vblocknr=1198436
[ 4298.841877] NILFS (sda1): error -17 preparing GC: cannot read source blocks

Look like the bugsimulator hit other bug.

When this happened, the disk usage was 100%:

$ df /dev/sda1
Filesystem     1K-blocks     Used Available Use% Mounted on
/dev/sda1      104849404 99655676         0 100% /test

And, the error killed cleanerd.   However, I could cleanly unmount the
partition, and
I didn't see any errors after remount the partition.

As you commented, I have changed some parameters in /etc/nilfs_cleanerd.conf:
-------------------------------------------
mc_nsegments_per_clean  16
nsegments_per_clean     16
max_clean_segments              30%
cleaning_interval       0.1
mc_cleaning_interval    0.1
-------------------------------------------

I will look into this issue with bugsimulator.

Anyone can reproduce the original corruption issue in other way?

Regards,
Ryusuke Konishi

2017-10-29 2:16 GMT+09:00 Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>:
> Hi,
>
> Sorry for my late reponse.
> I could reproduce a corruption issue with the bug simulator.
>
> -----
> [88805.039545] NILFS (sda1): segctord starting. Construction interval = 5 second
> s, CP frequency < 30 seconds
> [92817.941644] NILFS (sda1): nilfs_ioctl_move_inode_block: conflicting data buff
> er: ino=10846, cno=92, offset=2, blocknr=11425709, vblocknr=11227714
> [92817.941936] NILFS (sda1): error -17 preparing GC: cannot read source blocks
> -----
>
> Now I'm testing the patch.   I will send it to upstream after it turned
> out to suppress the issue.
>
> The patch itself looks good to me though I guess there may be similar bugs.
> What makes things complicated seems that inode block is not marked as dirty
> just when "propagate" function carries a dirty state from a dirty
> node/data block.
>
> Anyway, I hasten to apply this patch.
>
> Thanks,
> Ryusuke Konishi
>
>
> 2017-07-13 15:44 GMT+09:00 Andreas Rohner <andreas.rohner@xxxxxxx>:
>> Hi,
>>
>> Am Wed, 12 Jul 2017 09:08:57 +0900
>> schrieb Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>:
>>
>>> Hi,
>>>
>>> 2017-07-11 15:19 GMT+09:00 Andreas Rohner <andreas.rohner@xxxxxxx>:
>>> >
>>> > There is a race condition between the function nilfs_dirty_inode()
>>> > and nilfs_set_file_dirty().
>>> >
>>> > When a file is opened, nilfs_dirty_inode() is called to update the
>>> > access timestamp in the inode. It calls __nilfs_mark_inode_dirty()
>>> > in a separate transaction. __nilfs_mark_inode_dirty() caches the
>>> > ifile buffer_head in the i_bh field of the inode info structure and
>>> > marks it as dirty.
>>> >
>>> > After some data was written to the file in another transaction, the
>>> > function nilfs_set_file_dirty() is called, which adds the inode to
>>> > the ns_dirty_files list.
>>> >
>>> > Then the segment construction calls
>>> > nilfs_segctor_collect_dirty_files(), which goes through the
>>> > ns_dirty_files list and checks the i_bh field. If there is a cached
>>> > buffer_head in i_bh it is not marked as dirty again.
>>> >
>>> > Since nilfs_dirty_inode() and nilfs_set_file_dirty() use separate
>>> > transactions, it is possible that a segment construction that
>>> > writes out the ifile occurs in-between the two. If this happens the
>>> > inode is not on the ns_dirty_files list, but its ifile block is
>>> > still marked as dirty and written out.
>>> >
>>> > In the next segment construction, the data for the file is written
>>> > out and nilfs_bmap_propagate() updates the b-tree. Eventually the
>>> > bmap root is written into the i_bh block, which is not dirty,
>>> > because it was written out in another segment construction.
>>> >
>>> > As a result the bmap update can be lost, which leads to file system
>>> > corruption. Either the virtual block address points to an
>>> > unallocated DAT block, or the DAT entry will be reused for
>>> > something different.
>>> >
>>> > The error can remain undetected for a long time. A typical error
>>> > message would be one of the "bad btree" errors or a warning that a
>>> > DAT entry could not be found.
>>> >
>>> > This bug can be reproduced reliably by a simple benchmark that
>>> > creates and overwrites millions of 4k files.
>>> >
>>> > Signed-off-by: Andreas Rohner <andreas.rohner@xxxxxxx>
>>>
>>> Will review, please wait for a while.
>>
>> Thank you, I have uploaded my benchmark on GitHub:
>>
>> https://github.com/zeitgeist87/bugsimulator
>>
>> The benchmark is very simple. It writes out 13107200 4k files in a two
>> layer directory tree, so that every directory contains about 1000 files.
>> Then it overwrites these files in an infinite loop.
>>
>> You need quite aggressive cleaner settings for it to keep up with the
>> bugsimulator:
>>
>> mc_nsegments_per_clean  16
>> nsegments_per_clean     16
>> max_clean_segments              30%
>> cleaning_interval       0.1
>> mc_cleaning_interval    0.1
>>
>> If you run this benchmark on a fresh 100GB NILFS2 partition with the
>> above cleaner settings, the file system will be corrupt in about half an
>> hour. At least on my machine.
>>
>> With the patch for the race condition, it can run indefinetly.
>>
>> Regards,
>> Andreas
>>
>>> Regards,
>>> Ryusuke Konishi
>>>
>>> >
>>> > ---
>>> >  fs/nilfs2/segment.c | 6 ++++--
>>> >  1 file changed, 4 insertions(+), 2 deletions(-)
>>> >
>>> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>>> > index 70ded52dc1dd..50e12956c737 100644
>>> > --- a/fs/nilfs2/segment.c
>>> > +++ b/fs/nilfs2/segment.c
>>> > @@ -1958,8 +1958,6 @@ static int
>>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci, err,
>>> > ii->vfs_inode.i_ino); return err;
>>> >                         }
>>> > -                       mark_buffer_dirty(ibh);
>>> > -                       nilfs_mdt_mark_dirty(ifile);
>>> >                         spin_lock(&nilfs->ns_inode_lock);
>>> >                         if (likely(!ii->i_bh))
>>> >                                 ii->i_bh = ibh;
>>> > @@ -1968,6 +1966,10 @@ static int
>>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci, goto
>>> > retry; }
>>> >
>>> > +               // Always redirty the buffer to avoid race condition
>>> > +               mark_buffer_dirty(ii->i_bh);
>>> > +               nilfs_mdt_mark_dirty(ifile);
>>> > +
>>> >                 clear_bit(NILFS_I_QUEUED, &ii->i_state);
>>> >                 set_bit(NILFS_I_BUSY, &ii->i_state);
>>> >                 list_move_tail(&ii->i_dirty, &sci->sc_dirty_files);
>>> > --
>>> > 2.13.2
>>> >
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe
>>> > linux-nilfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-nilfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html