Re: [PATCH] nilfs2: Fix race condition that causes file system corruption

Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx> · Mon, 30 Oct 2017 01:29:12 +0900

>> I haven't yet succeeded to reproduce the original fs corruption error.
>> Will continue to try reproducing the bug waiting for report from you
>> and/or Clemens.
>
> I think I forgot to mention, that I use a protection period of 30
> seconds. Here is my full nilfs_cleanerd.conf:

Thanks.   This is likely to affect the test.

Regards,
Ryusuke Konishi

2017-10-29 23:06 GMT+09:00 Andreas Rohner <andreas.rohner@xxxxxxx>:
> Am Sun, 29 Oct 2017 21:53:38 +0900
> schrieb Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>:
>
>> 2017-10-29 17:07 GMT+09:00 Andreas Rohner <andreas.rohner@xxxxxxx>:
>> > Hi Ryusuke,
>> >
>> > Thank you for reviewing my patch. I will rerun my tests to make sure
>> > my conclusions were correct.
>> >
>> > The log output suggests to me, that you are hitting a totally
>> > unrelated bug. I have seen that message before on one of my
>> > experimental branches, but I didn't think it could happen with
>> > stock NILFS2.
>> >
>> > My solution was to add the following piece of code to
>> > nilfs_vdesc_is_live() in nilfs-utils:
>> >
>> > if (vdesc->vd_period.p_end == vdesc->vd_cno) {
>> >         /*
>> >          * This block was overwritten in the same logical segment,
>> > but
>> >          * in a different partial segment. Probably because of
>> >          * fdatasync() or a flush to disk.
>> >          * Without this check, gc will cause buffer confliction
>> > error
>> >          * if both partial segments are cleaned at the same time.
>> >          * In that case there will be two vdesc with the same ino,
>> >          * cno and offset.
>> >          */
>> >         return 0;
>> > }
>> >
>> > I can't tell for sure if that will fix your problem. I am just
>> > guessing. It isn't necessary with my setup, to reproduce the race
>> > condition bug.
>>
>> Thank you, Andreas.  This snippet looks to have fixed the error so
>> far.
>>
>> Could you shape it to a patch (preferably a separate one) for
>> nilfs-utils repo ?
>
> Of course, but I am not sure how this error can occur exactly.
> I tried to reproduce it with a small shell script using
> fdatasync(), but I wasn't able to.
>
> However, I am pretty sure, that the patch does no harm and is safe.
>
>> I haven't yet succeeded to reproduce the original fs corruption error.
>> Will continue to try reproducing the bug waiting for report from you
>> and/or Clemens.
>
> I think I forgot to mention, that I use a protection period of 30
> seconds. Here is my full nilfs_cleanerd.conf:
>
> protection_period       30
> min_clean_segments      10%
> max_clean_segments 30%
> clean_check_interval    10
> nsegments_per_clean 16
> mc_nsegments_per_clean 16
> cleaning_interval 0.1
> mc_cleaning_interval 0.1
> retry_interval          60
> min_reclaimable_blocks  10%
> mc_min_reclaimable_blocks       1%
> use_set_suinfo
> use_mmap
> log_priority            info
>
> I can reproduce the corruption error with the latest kernel 4.14.0, and
> I can confirm that it is fixed after applying the patch.
> I used an older six core AMD Phenom(tm) II X6 1090T Processor and a
> Samsung 850 EVO SSD.
>
> Best regards,
>
> Andreas
>
>> Thanks,
>> Ryusuke Konishi
>>
>> >
>> > I am rerunning my tests now to make sure I didn't make a mistake.
>> >
>> > Best regards,
>> >
>> > Andreas
>> >
>> > Am Sun, 29 Oct 2017 14:09:40 +0900
>> > schrieb Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>:
>> >
>> >> Unfortunately the patch didn't fix this issue:
>> >>
>> >> [  612.614615] NILFS (sda1): segctord starting. Construction
>> >> interval = 5 seconds, CP frequency < 30 seconds
>> >> [ 4298.841376] NILFS (sda1): nilfs_ioctl_move_inode_block:
>> >> conflicting data buffer: ino=1155, cno=11, offset=1,
>> >> blocknr=1216036, vblocknr=1198436
>> >> [ 4298.841877] NILFS (sda1): error -17 preparing GC: cannot read
>> >> source blocks
>> >>
>> >> Look like the bugsimulator hit other bug.
>> >>
>> >> When this happened, the disk usage was 100%:
>> >>
>> >> $ df /dev/sda1
>> >> Filesystem     1K-blocks     Used Available Use% Mounted on
>> >> /dev/sda1      104849404 99655676         0 100% /test
>> >>
>> >> And, the error killed cleanerd.   However, I could cleanly unmount
>> >> the partition, and
>> >> I didn't see any errors after remount the partition.
>> >>
>> >> As you commented, I have changed some parameters
>> >> in /etc/nilfs_cleanerd.conf:
>> >> -------------------------------------------
>> >> mc_nsegments_per_clean  16 nsegments_per_clean     16
>> >> max_clean_segments              30%
>> >> cleaning_interval       0.1
>> >> mc_cleaning_interval    0.1
>> >> -------------------------------------------
>> >>
>> >> I will look into this issue with bugsimulator.
>> >>
>> >> Anyone can reproduce the original corruption issue in other way?
>> >>
>> >> Regards,
>> >> Ryusuke Konishi
>> >>
>> >>
>> >> 2017-10-29 2:16 GMT+09:00 Ryusuke Konishi
>> >> <konishi.ryusuke@xxxxxxxxx>:
>> >> > Hi,
>> >> >
>> >> > Sorry for my late reponse.
>> >> > I could reproduce a corruption issue with the bug simulator.
>> >> >
>> >> > -----
>> >> > [88805.039545] NILFS (sda1): segctord starting. Construction
>> >> > interval = 5 second s, CP frequency < 30 seconds
>> >> > [92817.941644] NILFS (sda1): nilfs_ioctl_move_inode_block:
>> >> > conflicting data buff er: ino=10846, cno=92, offset=2,
>> >> > blocknr=11425709, vblocknr=11227714 [92817.941936] NILFS (sda1):
>> >> > error -17 preparing GC: cannot read source blocks -----
>> >> >
>> >> > Now I'm testing the patch.   I will send it to upstream after it
>> >> > turned out to suppress the issue.
>> >> >
>> >> > The patch itself looks good to me though I guess there may be
>> >> > similar bugs. What makes things complicated seems that inode
>> >> > block is not marked as dirty just when "propagate" function
>> >> > carries a dirty state from a dirty node/data block.
>> >> >
>> >> > Anyway, I hasten to apply this patch.
>> >> >
>> >> > Thanks,
>> >> > Ryusuke Konishi
>> >> >
>> >> >
>> >> > 2017-07-13 15:44 GMT+09:00 Andreas Rohner
>> >> > <andreas.rohner@xxxxxxx>:
>> >> >> Hi,
>> >> >>
>> >> >> Am Wed, 12 Jul 2017 09:08:57 +0900
>> >> >> schrieb Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>:
>> >> >>
>> >> >>> Hi,
>> >> >>>
>> >> >>> 2017-07-11 15:19 GMT+09:00 Andreas Rohner
>> >> >>> <andreas.rohner@xxxxxxx>:
>> >> >>> >
>> >> >>> > There is a race condition between the function
>> >> >>> > nilfs_dirty_inode() and nilfs_set_file_dirty().
>> >> >>> >
>> >> >>> > When a file is opened, nilfs_dirty_inode() is called to
>> >> >>> > update the access timestamp in the inode. It calls
>> >> >>> > __nilfs_mark_inode_dirty() in a separate transaction.
>> >> >>> > __nilfs_mark_inode_dirty() caches the ifile buffer_head in
>> >> >>> > the i_bh field of the inode info structure and marks it as
>> >> >>> > dirty.
>> >> >>> >
>> >> >>> > After some data was written to the file in another
>> >> >>> > transaction, the function nilfs_set_file_dirty() is called,
>> >> >>> > which adds the inode to the ns_dirty_files list.
>> >> >>> >
>> >> >>> > Then the segment construction calls
>> >> >>> > nilfs_segctor_collect_dirty_files(), which goes through the
>> >> >>> > ns_dirty_files list and checks the i_bh field. If there is a
>> >> >>> > cached buffer_head in i_bh it is not marked as dirty again.
>> >> >>> >
>> >> >>> > Since nilfs_dirty_inode() and nilfs_set_file_dirty() use
>> >> >>> > separate transactions, it is possible that a segment
>> >> >>> > construction that writes out the ifile occurs in-between the
>> >> >>> > two. If this happens the inode is not on the ns_dirty_files
>> >> >>> > list, but its ifile block is still marked as dirty and
>> >> >>> > written out.
>> >> >>> >
>> >> >>> > In the next segment construction, the data for the file is
>> >> >>> > written out and nilfs_bmap_propagate() updates the b-tree.
>> >> >>> > Eventually the bmap root is written into the i_bh block,
>> >> >>> > which is not dirty, because it was written out in another
>> >> >>> > segment construction.
>> >> >>> >
>> >> >>> > As a result the bmap update can be lost, which leads to file
>> >> >>> > system corruption. Either the virtual block address points
>> >> >>> > to an unallocated DAT block, or the DAT entry will be reused
>> >> >>> > for something different.
>> >> >>> >
>> >> >>> > The error can remain undetected for a long time. A typical
>> >> >>> > error message would be one of the "bad btree" errors or a
>> >> >>> > warning that a DAT entry could not be found.
>> >> >>> >
>> >> >>> > This bug can be reproduced reliably by a simple benchmark
>> >> >>> > that creates and overwrites millions of 4k files.
>> >> >>> >
>> >> >>> > Signed-off-by: Andreas Rohner <andreas.rohner@xxxxxxx>
>> >> >>>
>> >> >>> Will review, please wait for a while.
>> >> >>
>> >> >> Thank you, I have uploaded my benchmark on GitHub:
>> >> >>
>> >> >> https://github.com/zeitgeist87/bugsimulator
>> >> >>
>> >> >> The benchmark is very simple. It writes out 13107200 4k files
>> >> >> in a two layer directory tree, so that every directory contains
>> >> >> about 1000 files. Then it overwrites these files in an infinite
>> >> >> loop.
>> >> >>
>> >> >> You need quite aggressive cleaner settings for it to keep up
>> >> >> with the bugsimulator:
>> >> >>
>> >> >> mc_nsegments_per_clean  16
>> >> >> nsegments_per_clean     16
>> >> >> max_clean_segments              30%
>> >> >> cleaning_interval       0.1
>> >> >> mc_cleaning_interval    0.1
>> >> >>
>> >> >> If you run this benchmark on a fresh 100GB NILFS2 partition with
>> >> >> the above cleaner settings, the file system will be corrupt in
>> >> >> about half an hour. At least on my machine.
>> >> >>
>> >> >> With the patch for the race condition, it can run indefinetly.
>> >> >>
>> >> >> Regards,
>> >> >> Andreas
>> >> >>
>> >> >>> Regards,
>> >> >>> Ryusuke Konishi
>> >> >>>
>> >> >>> >
>> >> >>> > ---
>> >> >>> >  fs/nilfs2/segment.c | 6 ++++--
>> >> >>> >  1 file changed, 4 insertions(+), 2 deletions(-)
>> >> >>> >
>> >> >>> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>> >> >>> > index 70ded52dc1dd..50e12956c737 100644
>> >> >>> > --- a/fs/nilfs2/segment.c
>> >> >>> > +++ b/fs/nilfs2/segment.c
>> >> >>> > @@ -1958,8 +1958,6 @@ static int
>> >> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
>> >> >>> > err, ii->vfs_inode.i_ino); return err;
>> >> >>> >                         }
>> >> >>> > -                       mark_buffer_dirty(ibh);
>> >> >>> > -                       nilfs_mdt_mark_dirty(ifile);
>> >> >>> >                         spin_lock(&nilfs->ns_inode_lock);
>> >> >>> >                         if (likely(!ii->i_bh))
>> >> >>> >                                 ii->i_bh = ibh;
>> >> >>> > @@ -1968,6 +1966,10 @@ static int
>> >> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
>> >> >>> > goto retry; }
>> >> >>> >
>> >> >>> > +               // Always redirty the buffer to avoid race
>> >> >>> > condition
>> >> >>> > +               mark_buffer_dirty(ii->i_bh);
>> >> >>> > +               nilfs_mdt_mark_dirty(ifile);
>> >> >>> > +
>> >> >>> >                 clear_bit(NILFS_I_QUEUED, &ii->i_state);
>> >> >>> >                 set_bit(NILFS_I_BUSY, &ii->i_state);
>> >> >>> >                 list_move_tail(&ii->i_dirty,
>> >> >>> > &sci->sc_dirty_files); --
>> >> >>> > 2.13.2
>> >> >>> >
>> >> >>> > --
>> >> >>> > To unsubscribe from this list: send the line "unsubscribe
>> >> >>> > linux-nilfs" in the body of a message to
>> >> >>> > majordomo@xxxxxxxxxxxxxxx More majordomo info at
>> >> >>> > http://vger.kernel.org/majordomo-info.html
>> >> >>> --
>> >> >>> To unsubscribe from this list: send the line "unsubscribe
>> >> >>> linux-nilfs" in the body of a message to
>> >> >>> majordomo@xxxxxxxxxxxxxxx More majordomo info at
>> >> >>> http://vger.kernel.org/majordomo-info.html
>> >> >>
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe
>> >> linux-nilfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx
>> >> More majordomo info at
>> >> http://vger.kernel.org/majordomo-info.html
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-nilfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html