Re: [PATCH] nilfs2: Fix race condition that causes file system corruption

Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx> · Mon, 30 Oct 2017 20:25:45 +0900

Hi,

The patch by Andreas seems to have fixed the corruption issue
that I could reproduced yesterday.   In the stress test at my
environment,  it prevented fs corruption more than 9 hours.

Therefore, I applied the patch to nilfs repo with Tested-by tags.

Thanks,
Ryusuke Konishi

2017-10-30 7:19 GMT+09:00 Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>:
> I could reproduce the issue in two and a half hours.
> Will proceed verification of the patch later.
>
> Regards,
> Ryusuke Konishi
>
> 2017-10-30 1:29 GMT+09:00 Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>:
>>>> I haven't yet succeeded to reproduce the original fs corruption error.
>>>> Will continue to try reproducing the bug waiting for report from you
>>>> and/or Clemens.
>>>
>>> I think I forgot to mention, that I use a protection period of 30
>>> seconds. Here is my full nilfs_cleanerd.conf:
>>
>> Thanks.   This is likely to affect the test.
>>
>> Regards,
>> Ryusuke Konishi
>>
>> 2017-10-29 23:06 GMT+09:00 Andreas Rohner <andreas.rohner@xxxxxxx>:
>>> Am Sun, 29 Oct 2017 21:53:38 +0900
>>> schrieb Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>:
>>>
>>>> 2017-10-29 17:07 GMT+09:00 Andreas Rohner <andreas.rohner@xxxxxxx>:
>>>> > Hi Ryusuke,
>>>> >
>>>> > Thank you for reviewing my patch. I will rerun my tests to make sure
>>>> > my conclusions were correct.
>>>> >
>>>> > The log output suggests to me, that you are hitting a totally
>>>> > unrelated bug. I have seen that message before on one of my
>>>> > experimental branches, but I didn't think it could happen with
>>>> > stock NILFS2.
>>>> >
>>>> > My solution was to add the following piece of code to
>>>> > nilfs_vdesc_is_live() in nilfs-utils:
>>>> >
>>>> > if (vdesc->vd_period.p_end == vdesc->vd_cno) {
>>>> >         /*
>>>> >          * This block was overwritten in the same logical segment,
>>>> > but
>>>> >          * in a different partial segment. Probably because of
>>>> >          * fdatasync() or a flush to disk.
>>>> >          * Without this check, gc will cause buffer confliction
>>>> > error
>>>> >          * if both partial segments are cleaned at the same time.
>>>> >          * In that case there will be two vdesc with the same ino,
>>>> >          * cno and offset.
>>>> >          */
>>>> >         return 0;
>>>> > }
>>>> >
>>>> > I can't tell for sure if that will fix your problem. I am just
>>>> > guessing. It isn't necessary with my setup, to reproduce the race
>>>> > condition bug.
>>>>
>>>> Thank you, Andreas.  This snippet looks to have fixed the error so
>>>> far.
>>>>
>>>> Could you shape it to a patch (preferably a separate one) for
>>>> nilfs-utils repo ?
>>>
>>> Of course, but I am not sure how this error can occur exactly.
>>> I tried to reproduce it with a small shell script using
>>> fdatasync(), but I wasn't able to.
>>>
>>> However, I am pretty sure, that the patch does no harm and is safe.
>>>
>>>> I haven't yet succeeded to reproduce the original fs corruption error.
>>>> Will continue to try reproducing the bug waiting for report from you
>>>> and/or Clemens.
>>>
>>> I think I forgot to mention, that I use a protection period of 30
>>> seconds. Here is my full nilfs_cleanerd.conf:
>>>
>>> protection_period       30
>>> min_clean_segments      10%
>>> max_clean_segments 30%
>>> clean_check_interval    10
>>> nsegments_per_clean 16
>>> mc_nsegments_per_clean 16
>>> cleaning_interval 0.1
>>> mc_cleaning_interval 0.1
>>> retry_interval          60
>>> min_reclaimable_blocks  10%
>>> mc_min_reclaimable_blocks       1%
>>> use_set_suinfo
>>> use_mmap
>>> log_priority            info
>>>
>>> I can reproduce the corruption error with the latest kernel 4.14.0, and
>>> I can confirm that it is fixed after applying the patch.
>>> I used an older six core AMD Phenom(tm) II X6 1090T Processor and a
>>> Samsung 850 EVO SSD.
>>>
>>> Best regards,
>>>
>>> Andreas
>>>
>>>> Thanks,
>>>> Ryusuke Konishi
>>>>
>>>> >
>>>> > I am rerunning my tests now to make sure I didn't make a mistake.
>>>> >
>>>> > Best regards,
>>>> >
>>>> > Andreas
>>>> >
>>>> > Am Sun, 29 Oct 2017 14:09:40 +0900
>>>> > schrieb Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>:
>>>> >
>>>> >> Unfortunately the patch didn't fix this issue:
>>>> >>
>>>> >> [  612.614615] NILFS (sda1): segctord starting. Construction
>>>> >> interval = 5 seconds, CP frequency < 30 seconds
>>>> >> [ 4298.841376] NILFS (sda1): nilfs_ioctl_move_inode_block:
>>>> >> conflicting data buffer: ino=1155, cno=11, offset=1,
>>>> >> blocknr=1216036, vblocknr=1198436
>>>> >> [ 4298.841877] NILFS (sda1): error -17 preparing GC: cannot read
>>>> >> source blocks
>>>> >>
>>>> >> Look like the bugsimulator hit other bug.
>>>> >>
>>>> >> When this happened, the disk usage was 100%:
>>>> >>
>>>> >> $ df /dev/sda1
>>>> >> Filesystem     1K-blocks     Used Available Use% Mounted on
>>>> >> /dev/sda1      104849404 99655676         0 100% /test
>>>> >>
>>>> >> And, the error killed cleanerd.   However, I could cleanly unmount
>>>> >> the partition, and
>>>> >> I didn't see any errors after remount the partition.
>>>> >>
>>>> >> As you commented, I have changed some parameters
>>>> >> in /etc/nilfs_cleanerd.conf:
>>>> >> -------------------------------------------
>>>> >> mc_nsegments_per_clean  16 nsegments_per_clean     16
>>>> >> max_clean_segments              30%
>>>> >> cleaning_interval       0.1
>>>> >> mc_cleaning_interval    0.1
>>>> >> -------------------------------------------
>>>> >>
>>>> >> I will look into this issue with bugsimulator.
>>>> >>
>>>> >> Anyone can reproduce the original corruption issue in other way?
>>>> >>
>>>> >> Regards,
>>>> >> Ryusuke Konishi
>>>> >>
>>>> >>
>>>> >> 2017-10-29 2:16 GMT+09:00 Ryusuke Konishi
>>>> >> <konishi.ryusuke@xxxxxxxxx>:
>>>> >> > Hi,
>>>> >> >
>>>> >> > Sorry for my late reponse.
>>>> >> > I could reproduce a corruption issue with the bug simulator.
>>>> >> >
>>>> >> > -----
>>>> >> > [88805.039545] NILFS (sda1): segctord starting. Construction
>>>> >> > interval = 5 second s, CP frequency < 30 seconds
>>>> >> > [92817.941644] NILFS (sda1): nilfs_ioctl_move_inode_block:
>>>> >> > conflicting data buff er: ino=10846, cno=92, offset=2,
>>>> >> > blocknr=11425709, vblocknr=11227714 [92817.941936] NILFS (sda1):
>>>> >> > error -17 preparing GC: cannot read source blocks -----
>>>> >> >
>>>> >> > Now I'm testing the patch.   I will send it to upstream after it
>>>> >> > turned out to suppress the issue.
>>>> >> >
>>>> >> > The patch itself looks good to me though I guess there may be
>>>> >> > similar bugs. What makes things complicated seems that inode
>>>> >> > block is not marked as dirty just when "propagate" function
>>>> >> > carries a dirty state from a dirty node/data block.
>>>> >> >
>>>> >> > Anyway, I hasten to apply this patch.
>>>> >> >
>>>> >> > Thanks,
>>>> >> > Ryusuke Konishi
>>>> >> >
>>>> >> >
>>>> >> > 2017-07-13 15:44 GMT+09:00 Andreas Rohner
>>>> >> > <andreas.rohner@xxxxxxx>:
>>>> >> >> Hi,
>>>> >> >>
>>>> >> >> Am Wed, 12 Jul 2017 09:08:57 +0900
>>>> >> >> schrieb Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>:
>>>> >> >>
>>>> >> >>> Hi,
>>>> >> >>>
>>>> >> >>> 2017-07-11 15:19 GMT+09:00 Andreas Rohner
>>>> >> >>> <andreas.rohner@xxxxxxx>:
>>>> >> >>> >
>>>> >> >>> > There is a race condition between the function
>>>> >> >>> > nilfs_dirty_inode() and nilfs_set_file_dirty().
>>>> >> >>> >
>>>> >> >>> > When a file is opened, nilfs_dirty_inode() is called to
>>>> >> >>> > update the access timestamp in the inode. It calls
>>>> >> >>> > __nilfs_mark_inode_dirty() in a separate transaction.
>>>> >> >>> > __nilfs_mark_inode_dirty() caches the ifile buffer_head in
>>>> >> >>> > the i_bh field of the inode info structure and marks it as
>>>> >> >>> > dirty.
>>>> >> >>> >
>>>> >> >>> > After some data was written to the file in another
>>>> >> >>> > transaction, the function nilfs_set_file_dirty() is called,
>>>> >> >>> > which adds the inode to the ns_dirty_files list.
>>>> >> >>> >
>>>> >> >>> > Then the segment construction calls
>>>> >> >>> > nilfs_segctor_collect_dirty_files(), which goes through the
>>>> >> >>> > ns_dirty_files list and checks the i_bh field. If there is a
>>>> >> >>> > cached buffer_head in i_bh it is not marked as dirty again.
>>>> >> >>> >
>>>> >> >>> > Since nilfs_dirty_inode() and nilfs_set_file_dirty() use
>>>> >> >>> > separate transactions, it is possible that a segment
>>>> >> >>> > construction that writes out the ifile occurs in-between the
>>>> >> >>> > two. If this happens the inode is not on the ns_dirty_files
>>>> >> >>> > list, but its ifile block is still marked as dirty and
>>>> >> >>> > written out.
>>>> >> >>> >
>>>> >> >>> > In the next segment construction, the data for the file is
>>>> >> >>> > written out and nilfs_bmap_propagate() updates the b-tree.
>>>> >> >>> > Eventually the bmap root is written into the i_bh block,
>>>> >> >>> > which is not dirty, because it was written out in another
>>>> >> >>> > segment construction.
>>>> >> >>> >
>>>> >> >>> > As a result the bmap update can be lost, which leads to file
>>>> >> >>> > system corruption. Either the virtual block address points
>>>> >> >>> > to an unallocated DAT block, or the DAT entry will be reused
>>>> >> >>> > for something different.
>>>> >> >>> >
>>>> >> >>> > The error can remain undetected for a long time. A typical
>>>> >> >>> > error message would be one of the "bad btree" errors or a
>>>> >> >>> > warning that a DAT entry could not be found.
>>>> >> >>> >
>>>> >> >>> > This bug can be reproduced reliably by a simple benchmark
>>>> >> >>> > that creates and overwrites millions of 4k files.
>>>> >> >>> >
>>>> >> >>> > Signed-off-by: Andreas Rohner <andreas.rohner@xxxxxxx>
>>>> >> >>>
>>>> >> >>> Will review, please wait for a while.
>>>> >> >>
>>>> >> >> Thank you, I have uploaded my benchmark on GitHub:
>>>> >> >>
>>>> >> >> https://github.com/zeitgeist87/bugsimulator
>>>> >> >>
>>>> >> >> The benchmark is very simple. It writes out 13107200 4k files
>>>> >> >> in a two layer directory tree, so that every directory contains
>>>> >> >> about 1000 files. Then it overwrites these files in an infinite
>>>> >> >> loop.
>>>> >> >>
>>>> >> >> You need quite aggressive cleaner settings for it to keep up
>>>> >> >> with the bugsimulator:
>>>> >> >>
>>>> >> >> mc_nsegments_per_clean  16
>>>> >> >> nsegments_per_clean     16
>>>> >> >> max_clean_segments              30%
>>>> >> >> cleaning_interval       0.1
>>>> >> >> mc_cleaning_interval    0.1
>>>> >> >>
>>>> >> >> If you run this benchmark on a fresh 100GB NILFS2 partition with
>>>> >> >> the above cleaner settings, the file system will be corrupt in
>>>> >> >> about half an hour. At least on my machine.
>>>> >> >>
>>>> >> >> With the patch for the race condition, it can run indefinetly.
>>>> >> >>
>>>> >> >> Regards,
>>>> >> >> Andreas
>>>> >> >>
>>>> >> >>> Regards,
>>>> >> >>> Ryusuke Konishi
>>>> >> >>>
>>>> >> >>> >
>>>> >> >>> > ---
>>>> >> >>> >  fs/nilfs2/segment.c | 6 ++++--
>>>> >> >>> >  1 file changed, 4 insertions(+), 2 deletions(-)
>>>> >> >>> >
>>>> >> >>> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>>>> >> >>> > index 70ded52dc1dd..50e12956c737 100644
>>>> >> >>> > --- a/fs/nilfs2/segment.c
>>>> >> >>> > +++ b/fs/nilfs2/segment.c
>>>> >> >>> > @@ -1958,8 +1958,6 @@ static int
>>>> >> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
>>>> >> >>> > err, ii->vfs_inode.i_ino); return err;
>>>> >> >>> >                         }
>>>> >> >>> > -                       mark_buffer_dirty(ibh);
>>>> >> >>> > -                       nilfs_mdt_mark_dirty(ifile);
>>>> >> >>> >                         spin_lock(&nilfs->ns_inode_lock);
>>>> >> >>> >                         if (likely(!ii->i_bh))
>>>> >> >>> >                                 ii->i_bh = ibh;
>>>> >> >>> > @@ -1968,6 +1966,10 @@ static int
>>>> >> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci,
>>>> >> >>> > goto retry; }
>>>> >> >>> >
>>>> >> >>> > +               // Always redirty the buffer to avoid race
>>>> >> >>> > condition
>>>> >> >>> > +               mark_buffer_dirty(ii->i_bh);
>>>> >> >>> > +               nilfs_mdt_mark_dirty(ifile);
>>>> >> >>> > +
>>>> >> >>> >                 clear_bit(NILFS_I_QUEUED, &ii->i_state);
>>>> >> >>> >                 set_bit(NILFS_I_BUSY, &ii->i_state);
>>>> >> >>> >                 list_move_tail(&ii->i_dirty,
>>>> >> >>> > &sci->sc_dirty_files); --
>>>> >> >>> > 2.13.2
>>>> >> >>> >
>>>> >> >>> > --
>>>> >> >>> > To unsubscribe from this list: send the line "unsubscribe
>>>> >> >>> > linux-nilfs" in the body of a message to
>>>> >> >>> > majordomo@xxxxxxxxxxxxxxx More majordomo info at
>>>> >> >>> > http://vger.kernel.org/majordomo-info.html
>>>> >> >>> --
>>>> >> >>> To unsubscribe from this list: send the line "unsubscribe
>>>> >> >>> linux-nilfs" in the body of a message to
>>>> >> >>> majordomo@xxxxxxxxxxxxxxx More majordomo info at
>>>> >> >>> http://vger.kernel.org/majordomo-info.html
>>>> >> >>
>>>> >> --
>>>> >> To unsubscribe from this list: send the line "unsubscribe
>>>> >> linux-nilfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> >> More majordomo info at
>>>> >> http://vger.kernel.org/majordomo-info.html
>>>> >
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> linux-nilfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html