Re: [PATCH] nilfs2: Fix race condition that causes file system corruption

Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx> · Sun, 29 Oct 2017 21:53:38 +0900

2017-10-29 17:07 GMT+09:00 Andreas Rohner <andreas.rohner@xxxxxxx>:
> Hi Ryusuke,
>
> Thank you for reviewing my patch. I will rerun my tests to make sure
> my conclusions were correct.
>
> The log output suggests to me, that you are hitting a totally unrelated
> bug. I have seen that message before on one of my experimental branches,
> but I didn't think it could happen with stock NILFS2.
>
> My solution was to add the following piece of code to
> nilfs_vdesc_is_live() in nilfs-utils:
>
> if (vdesc->vd_period.p_end == vdesc->vd_cno) {
>         /*
>          * This block was overwritten in the same logical segment, but
>          * in a different partial segment. Probably because of
>          * fdatasync() or a flush to disk.
>          * Without this check, gc will cause buffer confliction error
>          * if both partial segments are cleaned at the same time.
>          * In that case there will be two vdesc with the same ino,
>          * cno and offset.
>          */
>         return 0;
> }
>
> I can't tell for sure if that will fix your problem. I am just guessing.
> It isn't necessary with my setup, to reproduce the race condition bug.

Thank you, Andreas.  This snippet looks to have fixed the error so far.

Could you shape it to a patch (preferably a separate one) for nilfs-utils repo ?

I haven't yet succeeded to reproduce the original fs corruption error.
Will continue to try reproducing the bug waiting for report from you
and/or Clemens.

Thanks,
Ryusuke Konishi

>
> I am rerunning my tests now to make sure I didn't make a mistake.
>
> Best regards,
>
> Andreas
>
> Am Sun, 29 Oct 2017 14:09:40 +0900
> schrieb Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>:
>
>> Unfortunately the patch didn't fix this issue:
>>
>> [  612.614615] NILFS (sda1): segctord starting. Construction interval
>> = 5 seconds, CP frequency < 30 seconds
>> [ 4298.841376] NILFS (sda1): nilfs_ioctl_move_inode_block: conflicting
>> data buffer: ino=1155, cno=11, offset=1, blocknr=1216036,
>> vblocknr=1198436
>> [ 4298.841877] NILFS (sda1): error -17 preparing GC: cannot read
>> source blocks
>>
>> Look like the bugsimulator hit other bug.
>>
>> When this happened, the disk usage was 100%:
>>
>> $ df /dev/sda1
>> Filesystem     1K-blocks     Used Available Use% Mounted on
>> /dev/sda1      104849404 99655676         0 100% /test
>>
>> And, the error killed cleanerd.   However, I could cleanly unmount the
>> partition, and
>> I didn't see any errors after remount the partition.
>>
>> As you commented, I have changed some parameters
>> in /etc/nilfs_cleanerd.conf:
>> ------------------------------------------- mc_nsegments_per_clean  16
>> nsegments_per_clean     16
>> max_clean_segments              30%
>> cleaning_interval       0.1
>> mc_cleaning_interval    0.1
>> -------------------------------------------
>>
>> I will look into this issue with bugsimulator.
>>
>> Anyone can reproduce the original corruption issue in other way?
>>
>> Regards,
>> Ryusuke Konishi
>>
>>
>> 2017-10-29 2:16 GMT+09:00 Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>:
>> > Hi,
>> >
>> > Sorry for my late reponse.
>> > I could reproduce a corruption issue with the bug simulator.
>> >
>> > -----
>> > [88805.039545] NILFS (sda1): segctord starting. Construction
>> > interval = 5 second s, CP frequency < 30 seconds
>> > [92817.941644] NILFS (sda1): nilfs_ioctl_move_inode_block:
>> > conflicting data buff er: ino=10846, cno=92, offset=2,
>> > blocknr=11425709, vblocknr=11227714 [92817.941936] NILFS (sda1):
>> > error -17 preparing GC: cannot read source blocks -----
>> >
>> > Now I'm testing the patch.   I will send it to upstream after it
>> > turned out to suppress the issue.
>> >
>> > The patch itself looks good to me though I guess there may be
>> > similar bugs. What makes things complicated seems that inode block
>> > is not marked as dirty just when "propagate" function carries a
>> > dirty state from a dirty node/data block.
>> >
>> > Anyway, I hasten to apply this patch.
>> >
>> > Thanks,
>> > Ryusuke Konishi
>> >
>> >
>> > 2017-07-13 15:44 GMT+09:00 Andreas Rohner
>> > <andreas.rohner@xxxxxxx>:
>> >> Hi,
>> >>
>> >> Am Wed, 12 Jul 2017 09:08:57 +0900
>> >> schrieb Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>:
>> >>
>> >>> Hi,
>> >>>
>> >>> 2017-07-11 15:19 GMT+09:00 Andreas Rohner
>> >>> <andreas.rohner@xxxxxxx>:
>> >>> >
>> >>> > There is a race condition between the function
>> >>> > nilfs_dirty_inode() and nilfs_set_file_dirty().
>> >>> >
>> >>> > When a file is opened, nilfs_dirty_inode() is called to update
>> >>> > the access timestamp in the inode. It calls
>> >>> > __nilfs_mark_inode_dirty() in a separate transaction.
>> >>> > __nilfs_mark_inode_dirty() caches the ifile buffer_head in the
>> >>> > i_bh field of the inode info structure and marks it as dirty.
>> >>> >
>> >>> > After some data was written to the file in another transaction,
>> >>> > the function nilfs_set_file_dirty() is called, which adds the
>> >>> > inode to the ns_dirty_files list.
>> >>> >
>> >>> > Then the segment construction calls
>> >>> > nilfs_segctor_collect_dirty_files(), which goes through the
>> >>> > ns_dirty_files list and checks the i_bh field. If there is a
>> >>> > cached buffer_head in i_bh it is not marked as dirty again.
>> >>> >
>> >>> > Since nilfs_dirty_inode() and nilfs_set_file_dirty() use
>> >>> > separate transactions, it is possible that a segment
>> >>> > construction that writes out the ifile occurs in-between the
>> >>> > two. If this happens the inode is not on the ns_dirty_files
>> >>> > list, but its ifile block is still marked as dirty and written
>> >>> > out.
>> >>> >
>> >>> > In the next segment construction, the data for the file is
>> >>> > written out and nilfs_bmap_propagate() updates the b-tree.
>> >>> > Eventually the bmap root is written into the i_bh block, which
>> >>> > is not dirty, because it was written out in another segment
>> >>> > construction.
>> >>> >
>> >>> > As a result the bmap update can be lost, which leads to file
>> >>> > system corruption. Either the virtual block address points to an
>> >>> > unallocated DAT block, or the DAT entry will be reused for
>> >>> > something different.
>> >>> >
>> >>> > The error can remain undetected for a long time. A typical error
>> >>> > message would be one of the "bad btree" errors or a warning
>> >>> > that a DAT entry could not be found.
>> >>> >
>> >>> > This bug can be reproduced reliably by a simple benchmark that
>> >>> > creates and overwrites millions of 4k files.
>> >>> >
>> >>> > Signed-off-by: Andreas Rohner <andreas.rohner@xxxxxxx>
>> >>>
>> >>> Will review, please wait for a while.
>> >>
>> >> Thank you, I have uploaded my benchmark on GitHub:
>> >>
>> >> https://github.com/zeitgeist87/bugsimulator
>> >>
>> >> The benchmark is very simple. It writes out 13107200 4k files in a
>> >> two layer directory tree, so that every directory contains about
>> >> 1000 files. Then it overwrites these files in an infinite loop.
>> >>
>> >> You need quite aggressive cleaner settings for it to keep up with
>> >> the bugsimulator:
>> >>
>> >> mc_nsegments_per_clean  16
>> >> nsegments_per_clean     16
>> >> max_clean_segments              30%
>> >> cleaning_interval       0.1
>> >> mc_cleaning_interval    0.1
>> >>
>> >> If you run this benchmark on a fresh 100GB NILFS2 partition with
>> >> the above cleaner settings, the file system will be corrupt in
>> >> about half an hour. At least on my machine.
>> >>
>> >> With the patch for the race condition, it can run indefinetly.
>> >>
>> >> Regards,
>> >> Andreas
>> >>
>> >>> Regards,
>> >>> Ryusuke Konishi
>> >>>
>> >>> >
>> >>> > ---
>> >>> >  fs/nilfs2/segment.c | 6 ++++--
>> >>> >  1 file changed, 4 insertions(+), 2 deletions(-)
>> >>> >
>> >>> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>> >>> > index 70ded52dc1dd..50e12956c737 100644
>> >>> > --- a/fs/nilfs2/segment.c
>> >>> > +++ b/fs/nilfs2/segment.c
>> >>> > @@ -1958,8 +1958,6 @@ static int
>> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
>> >>> > err, ii->vfs_inode.i_ino); return err;
>> >>> >                         }
>> >>> > -                       mark_buffer_dirty(ibh);
>> >>> > -                       nilfs_mdt_mark_dirty(ifile);
>> >>> >                         spin_lock(&nilfs->ns_inode_lock);
>> >>> >                         if (likely(!ii->i_bh))
>> >>> >                                 ii->i_bh = ibh;
>> >>> > @@ -1968,6 +1966,10 @@ static int
>> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
>> >>> > goto retry; }
>> >>> >
>> >>> > +               // Always redirty the buffer to avoid race
>> >>> > condition
>> >>> > +               mark_buffer_dirty(ii->i_bh);
>> >>> > +               nilfs_mdt_mark_dirty(ifile);
>> >>> > +
>> >>> >                 clear_bit(NILFS_I_QUEUED, &ii->i_state);
>> >>> >                 set_bit(NILFS_I_BUSY, &ii->i_state);
>> >>> >                 list_move_tail(&ii->i_dirty,
>> >>> > &sci->sc_dirty_files); --
>> >>> > 2.13.2
>> >>> >
>> >>> > --
>> >>> > To unsubscribe from this list: send the line "unsubscribe
>> >>> > linux-nilfs" in the body of a message to
>> >>> > majordomo@xxxxxxxxxxxxxxx More majordomo info at
>> >>> > http://vger.kernel.org/majordomo-info.html
>> >>> --
>> >>> To unsubscribe from this list: send the line "unsubscribe
>> >>> linux-nilfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx
>> >>> More majordomo info at
>> >>> http://vger.kernel.org/majordomo-info.html
>> >>
>> --
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-nilfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html