Re: [PATCH] nilfs2: Fix race condition that causes file system corruption

Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx> · Mon, 30 Oct 2017 07:19:01 +0900

I could reproduce the issue in two and a half hours.
Will proceed verification of the patch later.

Regards,
Ryusuke Konishi

2017-10-30 1:29 GMT+09:00 Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>:
>>> I haven't yet succeeded to reproduce the original fs corruption error.
>>> Will continue to try reproducing the bug waiting for report from you
>>> and/or Clemens.
>>
>> I think I forgot to mention, that I use a protection period of 30
>> seconds. Here is my full nilfs_cleanerd.conf:
>
> Thanks.   This is likely to affect the test.
>
> Regards,
> Ryusuke Konishi
>
> 2017-10-29 23:06 GMT+09:00 Andreas Rohner <andreas.rohner@xxxxxxx>:
>> Am Sun, 29 Oct 2017 21:53:38 +0900
>> schrieb Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>:
>>
>>> 2017-10-29 17:07 GMT+09:00 Andreas Rohner <andreas.rohner@xxxxxxx>:
>>> > Hi Ryusuke,
>>> >
>>> > Thank you for reviewing my patch. I will rerun my tests to make sure
>>> > my conclusions were correct.
>>> >
>>> > The log output suggests to me, that you are hitting a totally
>>> > unrelated bug. I have seen that message before on one of my
>>> > experimental branches, but I didn't think it could happen with
>>> > stock NILFS2.
>>> >
>>> > My solution was to add the following piece of code to
>>> > nilfs_vdesc_is_live() in nilfs-utils:
>>> >
>>> > if (vdesc->vd_period.p_end == vdesc->vd_cno) {
>>> >         /*
>>> >          * This block was overwritten in the same logical segment,
>>> > but
>>> >          * in a different partial segment. Probably because of
>>> >          * fdatasync() or a flush to disk.
>>> >          * Without this check, gc will cause buffer confliction
>>> > error
>>> >          * if both partial segments are cleaned at the same time.
>>> >          * In that case there will be two vdesc with the same ino,
>>> >          * cno and offset.
>>> >          */
>>> >         return 0;
>>> > }
>>> >
>>> > I can't tell for sure if that will fix your problem. I am just
>>> > guessing. It isn't necessary with my setup, to reproduce the race
>>> > condition bug.
>>>
>>> Thank you, Andreas.  This snippet looks to have fixed the error so
>>> far.
>>>
>>> Could you shape it to a patch (preferably a separate one) for
>>> nilfs-utils repo ?
>>
>> Of course, but I am not sure how this error can occur exactly.
>> I tried to reproduce it with a small shell script using
>> fdatasync(), but I wasn't able to.
>>
>> However, I am pretty sure, that the patch does no harm and is safe.
>>
>>> I haven't yet succeeded to reproduce the original fs corruption error.
>>> Will continue to try reproducing the bug waiting for report from you
>>> and/or Clemens.
>>
>> I think I forgot to mention, that I use a protection period of 30
>> seconds. Here is my full nilfs_cleanerd.conf:
>>
>> protection_period       30
>> min_clean_segments      10%
>> max_clean_segments 30%
>> clean_check_interval    10
>> nsegments_per_clean 16
>> mc_nsegments_per_clean 16
>> cleaning_interval 0.1
>> mc_cleaning_interval 0.1
>> retry_interval          60
>> min_reclaimable_blocks  10%
>> mc_min_reclaimable_blocks       1%
>> use_set_suinfo
>> use_mmap
>> log_priority            info
>>
>> I can reproduce the corruption error with the latest kernel 4.14.0, and
>> I can confirm that it is fixed after applying the patch.
>> I used an older six core AMD Phenom(tm) II X6 1090T Processor and a
>> Samsung 850 EVO SSD.
>>
>> Best regards,
>>
>> Andreas
>>
>>> Thanks,
>>> Ryusuke Konishi
>>>
>>> >
>>> > I am rerunning my tests now to make sure I didn't make a mistake.
>>> >
>>> > Best regards,
>>> >
>>> > Andreas
>>> >
>>> > Am Sun, 29 Oct 2017 14:09:40 +0900
>>> > schrieb Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>:
>>> >
>>> >> Unfortunately the patch didn't fix this issue:
>>> >>
>>> >> [  612.614615] NILFS (sda1): segctord starting. Construction
>>> >> interval = 5 seconds, CP frequency < 30 seconds
>>> >> [ 4298.841376] NILFS (sda1): nilfs_ioctl_move_inode_block:
>>> >> conflicting data buffer: ino=1155, cno=11, offset=1,
>>> >> blocknr=1216036, vblocknr=1198436
>>> >> [ 4298.841877] NILFS (sda1): error -17 preparing GC: cannot read
>>> >> source blocks
>>> >>
>>> >> Look like the bugsimulator hit other bug.
>>> >>
>>> >> When this happened, the disk usage was 100%:
>>> >>
>>> >> $ df /dev/sda1
>>> >> Filesystem     1K-blocks     Used Available Use% Mounted on
>>> >> /dev/sda1      104849404 99655676         0 100% /test
>>> >>
>>> >> And, the error killed cleanerd.   However, I could cleanly unmount
>>> >> the partition, and
>>> >> I didn't see any errors after remount the partition.
>>> >>
>>> >> As you commented, I have changed some parameters
>>> >> in /etc/nilfs_cleanerd.conf:
>>> >> -------------------------------------------
>>> >> mc_nsegments_per_clean  16 nsegments_per_clean     16
>>> >> max_clean_segments              30%
>>> >> cleaning_interval       0.1
>>> >> mc_cleaning_interval    0.1
>>> >> -------------------------------------------
>>> >>
>>> >> I will look into this issue with bugsimulator.
>>> >>
>>> >> Anyone can reproduce the original corruption issue in other way?
>>> >>
>>> >> Regards,
>>> >> Ryusuke Konishi
>>> >>
>>> >>
>>> >> 2017-10-29 2:16 GMT+09:00 Ryusuke Konishi
>>> >> <konishi.ryusuke@xxxxxxxxx>:
>>> >> > Hi,
>>> >> >
>>> >> > Sorry for my late reponse.
>>> >> > I could reproduce a corruption issue with the bug simulator.
>>> >> >
>>> >> > -----
>>> >> > [88805.039545] NILFS (sda1): segctord starting. Construction
>>> >> > interval = 5 second s, CP frequency < 30 seconds
>>> >> > [92817.941644] NILFS (sda1): nilfs_ioctl_move_inode_block:
>>> >> > conflicting data buff er: ino=10846, cno=92, offset=2,
>>> >> > blocknr=11425709, vblocknr=11227714 [92817.941936] NILFS (sda1):
>>> >> > error -17 preparing GC: cannot read source blocks -----
>>> >> >
>>> >> > Now I'm testing the patch.   I will send it to upstream after it
>>> >> > turned out to suppress the issue.
>>> >> >
>>> >> > The patch itself looks good to me though I guess there may be
>>> >> > similar bugs. What makes things complicated seems that inode
>>> >> > block is not marked as dirty just when "propagate" function
>>> >> > carries a dirty state from a dirty node/data block.
>>> >> >
>>> >> > Anyway, I hasten to apply this patch.
>>> >> >
>>> >> > Thanks,
>>> >> > Ryusuke Konishi
>>> >> >
>>> >> >
>>> >> > 2017-07-13 15:44 GMT+09:00 Andreas Rohner
>>> >> > <andreas.rohner@xxxxxxx>:
>>> >> >> Hi,
>>> >> >>
>>> >> >> Am Wed, 12 Jul 2017 09:08:57 +0900
>>> >> >> schrieb Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>:
>>> >> >>
>>> >> >>> Hi,
>>> >> >>>
>>> >> >>> 2017-07-11 15:19 GMT+09:00 Andreas Rohner
>>> >> >>> <andreas.rohner@xxxxxxx>:
>>> >> >>> >
>>> >> >>> > There is a race condition between the function
>>> >> >>> > nilfs_dirty_inode() and nilfs_set_file_dirty().
>>> >> >>> >
>>> >> >>> > When a file is opened, nilfs_dirty_inode() is called to
>>> >> >>> > update the access timestamp in the inode. It calls
>>> >> >>> > __nilfs_mark_inode_dirty() in a separate transaction.
>>> >> >>> > __nilfs_mark_inode_dirty() caches the ifile buffer_head in
>>> >> >>> > the i_bh field of the inode info structure and marks it as
>>> >> >>> > dirty.
>>> >> >>> >
>>> >> >>> > After some data was written to the file in another
>>> >> >>> > transaction, the function nilfs_set_file_dirty() is called,
>>> >> >>> > which adds the inode to the ns_dirty_files list.
>>> >> >>> >
>>> >> >>> > Then the segment construction calls
>>> >> >>> > nilfs_segctor_collect_dirty_files(), which goes through the
>>> >> >>> > ns_dirty_files list and checks the i_bh field. If there is a
>>> >> >>> > cached buffer_head in i_bh it is not marked as dirty again.
>>> >> >>> >
>>> >> >>> > Since nilfs_dirty_inode() and nilfs_set_file_dirty() use
>>> >> >>> > separate transactions, it is possible that a segment
>>> >> >>> > construction that writes out the ifile occurs in-between the
>>> >> >>> > two. If this happens the inode is not on the ns_dirty_files
>>> >> >>> > list, but its ifile block is still marked as dirty and
>>> >> >>> > written out.
>>> >> >>> >
>>> >> >>> > In the next segment construction, the data for the file is
>>> >> >>> > written out and nilfs_bmap_propagate() updates the b-tree.
>>> >> >>> > Eventually the bmap root is written into the i_bh block,
>>> >> >>> > which is not dirty, because it was written out in another
>>> >> >>> > segment construction.
>>> >> >>> >
>>> >> >>> > As a result the bmap update can be lost, which leads to file
>>> >> >>> > system corruption. Either the virtual block address points
>>> >> >>> > to an unallocated DAT block, or the DAT entry will be reused
>>> >> >>> > for something different.
>>> >> >>> >
>>> >> >>> > The error can remain undetected for a long time. A typical
>>> >> >>> > error message would be one of the "bad btree" errors or a
>>> >> >>> > warning that a DAT entry could not be found.
>>> >> >>> >
>>> >> >>> > This bug can be reproduced reliably by a simple benchmark
>>> >> >>> > that creates and overwrites millions of 4k files.
>>> >> >>> >
>>> >> >>> > Signed-off-by: Andreas Rohner <andreas.rohner@xxxxxxx>
>>> >> >>>
>>> >> >>> Will review, please wait for a while.
>>> >> >>
>>> >> >> Thank you, I have uploaded my benchmark on GitHub:
>>> >> >>
>>> >> >> https://github.com/zeitgeist87/bugsimulator
>>> >> >>
>>> >> >> The benchmark is very simple. It writes out 13107200 4k files
>>> >> >> in a two layer directory tree, so that every directory contains
>>> >> >> about 1000 files. Then it overwrites these files in an infinite
>>> >> >> loop.
>>> >> >>
>>> >> >> You need quite aggressive cleaner settings for it to keep up
>>> >> >> with the bugsimulator:
>>> >> >>
>>> >> >> mc_nsegments_per_clean  16
>>> >> >> nsegments_per_clean     16
>>> >> >> max_clean_segments              30%
>>> >> >> cleaning_interval       0.1
>>> >> >> mc_cleaning_interval    0.1
>>> >> >>
>>> >> >> If you run this benchmark on a fresh 100GB NILFS2 partition with
>>> >> >> the above cleaner settings, the file system will be corrupt in
>>> >> >> about half an hour. At least on my machine.
>>> >> >>
>>> >> >> With the patch for the race condition, it can run indefinetly.
>>> >> >>
>>> >> >> Regards,
>>> >> >> Andreas
>>> >> >>
>>> >> >>> Regards,
>>> >> >>> Ryusuke Konishi
>>> >> >>>
>>> >> >>> >
>>> >> >>> > ---
>>> >> >>> >  fs/nilfs2/segment.c | 6 ++++--
>>> >> >>> >  1 file changed, 4 insertions(+), 2 deletions(-)
>>> >> >>> >
>>> >> >>> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>>> >> >>> > index 70ded52dc1dd..50e12956c737 100644
>>> >> >>> > --- a/fs/nilfs2/segment.c
>>> >> >>> > +++ b/fs/nilfs2/segment.c
>>> >> >>> > @@ -1958,8 +1958,6 @@ static int
>>> >> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
>>> >> >>> > err, ii->vfs_inode.i_ino); return err;
>>> >> >>> >                         }
>>> >> >>> > -                       mark_buffer_dirty(ibh);
>>> >> >>> > -                       nilfs_mdt_mark_dirty(ifile);
>>> >> >>> >                         spin_lock(&nilfs->ns_inode_lock);
>>> >> >>> >                         if (likely(!ii->i_bh))
>>> >> >>> >                                 ii->i_bh = ibh;
>>> >> >>> > @@ -1968,6 +1966,10 @@ static int
>>> >> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
>>> >> >>> > goto retry; }
>>> >> >>> >
>>> >> >>> > +               // Always redirty the buffer to avoid race
>>> >> >>> > condition
>>> >> >>> > +               mark_buffer_dirty(ii->i_bh);
>>> >> >>> > +               nilfs_mdt_mark_dirty(ifile);
>>> >> >>> > +
>>> >> >>> >                 clear_bit(NILFS_I_QUEUED, &ii->i_state);
>>> >> >>> >                 set_bit(NILFS_I_BUSY, &ii->i_state);
>>> >> >>> >                 list_move_tail(&ii->i_dirty,
>>> >> >>> > &sci->sc_dirty_files); --
>>> >> >>> > 2.13.2
>>> >> >>> >
>>> >> >>> > --
>>> >> >>> > To unsubscribe from this list: send the line "unsubscribe
>>> >> >>> > linux-nilfs" in the body of a message to
>>> >> >>> > majordomo@xxxxxxxxxxxxxxx More majordomo info at
>>> >> >>> > http://vger.kernel.org/majordomo-info.html
>>> >> >>> --
>>> >> >>> To unsubscribe from this list: send the line "unsubscribe
>>> >> >>> linux-nilfs" in the body of a message to
>>> >> >>> majordomo@xxxxxxxxxxxxxxx More majordomo info at
>>> >> >>> http://vger.kernel.org/majordomo-info.html
>>> >> >>
>>> >> --
>>> >> To unsubscribe from this list: send the line "unsubscribe
>>> >> linux-nilfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> >> More majordomo info at
>>> >> http://vger.kernel.org/majordomo-info.html
>>> >
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-nilfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html