Re: [PATCH] nilfs2: Fix race condition that causes file system corruption

Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx> · Sun, 29 Oct 2017 14:24:46 +0900

If someone could confirm the patch by Andreas fixed the original corruption
issue,  I will send the patch to upstream anyway.

Verification (a "Tested-by" tag) is needed to let it go to stable trees.

Ryusuke Konishi

2017-10-29 14:09 GMT+09:00 Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>:
> Unfortunately the patch didn't fix this issue:
>
> [  612.614615] NILFS (sda1): segctord starting. Construction interval
> = 5 seconds, CP frequency < 30 seconds
> [ 4298.841376] NILFS (sda1): nilfs_ioctl_move_inode_block: conflicting
> data buffer: ino=1155, cno=11, offset=1, blocknr=1216036,
> vblocknr=1198436
> [ 4298.841877] NILFS (sda1): error -17 preparing GC: cannot read source blocks
>
> Look like the bugsimulator hit other bug.
>
> When this happened, the disk usage was 100%:
>
> $ df /dev/sda1
> Filesystem     1K-blocks     Used Available Use% Mounted on
> /dev/sda1      104849404 99655676         0 100% /test
>
> And, the error killed cleanerd.   However, I could cleanly unmount the
> partition, and
> I didn't see any errors after remount the partition.
>
> As you commented, I have changed some parameters in /etc/nilfs_cleanerd.conf:
> -------------------------------------------
> mc_nsegments_per_clean  16
> nsegments_per_clean     16
> max_clean_segments              30%
> cleaning_interval       0.1
> mc_cleaning_interval    0.1
> -------------------------------------------
>
> I will look into this issue with bugsimulator.
>
> Anyone can reproduce the original corruption issue in other way?
>
> Regards,
> Ryusuke Konishi
>
>
> 2017-10-29 2:16 GMT+09:00 Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>:
>> Hi,
>>
>> Sorry for my late reponse.
>> I could reproduce a corruption issue with the bug simulator.
>>
>> -----
>> [88805.039545] NILFS (sda1): segctord starting. Construction interval = 5 second
>> s, CP frequency < 30 seconds
>> [92817.941644] NILFS (sda1): nilfs_ioctl_move_inode_block: conflicting data buff
>> er: ino=10846, cno=92, offset=2, blocknr=11425709, vblocknr=11227714
>> [92817.941936] NILFS (sda1): error -17 preparing GC: cannot read source blocks
>> -----
>>
>> Now I'm testing the patch.   I will send it to upstream after it turned
>> out to suppress the issue.
>>
>> The patch itself looks good to me though I guess there may be similar bugs.
>> What makes things complicated seems that inode block is not marked as dirty
>> just when "propagate" function carries a dirty state from a dirty
>> node/data block.
>>
>> Anyway, I hasten to apply this patch.
>>
>> Thanks,
>> Ryusuke Konishi
>>
>>
>> 2017-07-13 15:44 GMT+09:00 Andreas Rohner <andreas.rohner@xxxxxxx>:
>>> Hi,
>>>
>>> Am Wed, 12 Jul 2017 09:08:57 +0900
>>> schrieb Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>:
>>>
>>>> Hi,
>>>>
>>>> 2017-07-11 15:19 GMT+09:00 Andreas Rohner <andreas.rohner@xxxxxxx>:
>>>> >
>>>> > There is a race condition between the function nilfs_dirty_inode()
>>>> > and nilfs_set_file_dirty().
>>>> >
>>>> > When a file is opened, nilfs_dirty_inode() is called to update the
>>>> > access timestamp in the inode. It calls __nilfs_mark_inode_dirty()
>>>> > in a separate transaction. __nilfs_mark_inode_dirty() caches the
>>>> > ifile buffer_head in the i_bh field of the inode info structure and
>>>> > marks it as dirty.
>>>> >
>>>> > After some data was written to the file in another transaction, the
>>>> > function nilfs_set_file_dirty() is called, which adds the inode to
>>>> > the ns_dirty_files list.
>>>> >
>>>> > Then the segment construction calls
>>>> > nilfs_segctor_collect_dirty_files(), which goes through the
>>>> > ns_dirty_files list and checks the i_bh field. If there is a cached
>>>> > buffer_head in i_bh it is not marked as dirty again.
>>>> >
>>>> > Since nilfs_dirty_inode() and nilfs_set_file_dirty() use separate
>>>> > transactions, it is possible that a segment construction that
>>>> > writes out the ifile occurs in-between the two. If this happens the
>>>> > inode is not on the ns_dirty_files list, but its ifile block is
>>>> > still marked as dirty and written out.
>>>> >
>>>> > In the next segment construction, the data for the file is written
>>>> > out and nilfs_bmap_propagate() updates the b-tree. Eventually the
>>>> > bmap root is written into the i_bh block, which is not dirty,
>>>> > because it was written out in another segment construction.
>>>> >
>>>> > As a result the bmap update can be lost, which leads to file system
>>>> > corruption. Either the virtual block address points to an
>>>> > unallocated DAT block, or the DAT entry will be reused for
>>>> > something different.
>>>> >
>>>> > The error can remain undetected for a long time. A typical error
>>>> > message would be one of the "bad btree" errors or a warning that a
>>>> > DAT entry could not be found.
>>>> >
>>>> > This bug can be reproduced reliably by a simple benchmark that
>>>> > creates and overwrites millions of 4k files.
>>>> >
>>>> > Signed-off-by: Andreas Rohner <andreas.rohner@xxxxxxx>
>>>>
>>>> Will review, please wait for a while.
>>>
>>> Thank you, I have uploaded my benchmark on GitHub:
>>>
>>> https://github.com/zeitgeist87/bugsimulator
>>>
>>> The benchmark is very simple. It writes out 13107200 4k files in a two
>>> layer directory tree, so that every directory contains about 1000 files.
>>> Then it overwrites these files in an infinite loop.
>>>
>>> You need quite aggressive cleaner settings for it to keep up with the
>>> bugsimulator:
>>>
>>> mc_nsegments_per_clean  16
>>> nsegments_per_clean     16
>>> max_clean_segments              30%
>>> cleaning_interval       0.1
>>> mc_cleaning_interval    0.1
>>>
>>> If you run this benchmark on a fresh 100GB NILFS2 partition with the
>>> above cleaner settings, the file system will be corrupt in about half an
>>> hour. At least on my machine.
>>>
>>> With the patch for the race condition, it can run indefinetly.
>>>
>>> Regards,
>>> Andreas
>>>
>>>> Regards,
>>>> Ryusuke Konishi
>>>>
>>>> >
>>>> > ---
>>>> >  fs/nilfs2/segment.c | 6 ++++--
>>>> >  1 file changed, 4 insertions(+), 2 deletions(-)
>>>> >
>>>> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>>>> > index 70ded52dc1dd..50e12956c737 100644
>>>> > --- a/fs/nilfs2/segment.c
>>>> > +++ b/fs/nilfs2/segment.c
>>>> > @@ -1958,8 +1958,6 @@ static int
>>>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci, err,
>>>> > ii->vfs_inode.i_ino); return err;
>>>> >                         }
>>>> > -                       mark_buffer_dirty(ibh);
>>>> > -                       nilfs_mdt_mark_dirty(ifile);
>>>> >                         spin_lock(&nilfs->ns_inode_lock);
>>>> >                         if (likely(!ii->i_bh))
>>>> >                                 ii->i_bh = ibh;
>>>> > @@ -1968,6 +1966,10 @@ static int
>>>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci, goto
>>>> > retry; }
>>>> >
>>>> > +               // Always redirty the buffer to avoid race condition
>>>> > +               mark_buffer_dirty(ii->i_bh);
>>>> > +               nilfs_mdt_mark_dirty(ifile);
>>>> > +
>>>> >                 clear_bit(NILFS_I_QUEUED, &ii->i_state);
>>>> >                 set_bit(NILFS_I_BUSY, &ii->i_state);
>>>> >                 list_move_tail(&ii->i_dirty, &sci->sc_dirty_files);
>>>> > --
>>>> > 2.13.2
>>>> >
>>>> > --
>>>> > To unsubscribe from this list: send the line "unsubscribe
>>>> > linux-nilfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> linux-nilfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html