Hi, The patch by Andreas seems to have fixed the corruption issue that I could reproduced yesterday. In the stress test at my environment, it prevented fs corruption more than 9 hours. Therefore, I applied the patch to nilfs repo with Tested-by tags. Thanks, Ryusuke Konishi 2017-10-30 7:19 GMT+09:00 Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>: > I could reproduce the issue in two and a half hours. > Will proceed verification of the patch later. > > Regards, > Ryusuke Konishi > > 2017-10-30 1:29 GMT+09:00 Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>: >>>> I haven't yet succeeded to reproduce the original fs corruption error. >>>> Will continue to try reproducing the bug waiting for report from you >>>> and/or Clemens. >>> >>> I think I forgot to mention, that I use a protection period of 30 >>> seconds. Here is my full nilfs_cleanerd.conf: >> >> Thanks. This is likely to affect the test. >> >> Regards, >> Ryusuke Konishi >> >> 2017-10-29 23:06 GMT+09:00 Andreas Rohner <andreas.rohner@xxxxxxx>: >>> Am Sun, 29 Oct 2017 21:53:38 +0900 >>> schrieb Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>: >>> >>>> 2017-10-29 17:07 GMT+09:00 Andreas Rohner <andreas.rohner@xxxxxxx>: >>>> > Hi Ryusuke, >>>> > >>>> > Thank you for reviewing my patch. I will rerun my tests to make sure >>>> > my conclusions were correct. >>>> > >>>> > The log output suggests to me, that you are hitting a totally >>>> > unrelated bug. I have seen that message before on one of my >>>> > experimental branches, but I didn't think it could happen with >>>> > stock NILFS2. >>>> > >>>> > My solution was to add the following piece of code to >>>> > nilfs_vdesc_is_live() in nilfs-utils: >>>> > >>>> > if (vdesc->vd_period.p_end == vdesc->vd_cno) { >>>> > /* >>>> > * This block was overwritten in the same logical segment, >>>> > but >>>> > * in a different partial segment. Probably because of >>>> > * fdatasync() or a flush to disk. >>>> > * Without this check, gc will cause buffer confliction >>>> > error >>>> > * if both partial segments are cleaned at the same time. >>>> > * In that case there will be two vdesc with the same ino, >>>> > * cno and offset. >>>> > */ >>>> > return 0; >>>> > } >>>> > >>>> > I can't tell for sure if that will fix your problem. I am just >>>> > guessing. It isn't necessary with my setup, to reproduce the race >>>> > condition bug. >>>> >>>> Thank you, Andreas. This snippet looks to have fixed the error so >>>> far. >>>> >>>> Could you shape it to a patch (preferably a separate one) for >>>> nilfs-utils repo ? >>> >>> Of course, but I am not sure how this error can occur exactly. >>> I tried to reproduce it with a small shell script using >>> fdatasync(), but I wasn't able to. >>> >>> However, I am pretty sure, that the patch does no harm and is safe. >>> >>>> I haven't yet succeeded to reproduce the original fs corruption error. >>>> Will continue to try reproducing the bug waiting for report from you >>>> and/or Clemens. >>> >>> I think I forgot to mention, that I use a protection period of 30 >>> seconds. Here is my full nilfs_cleanerd.conf: >>> >>> protection_period 30 >>> min_clean_segments 10% >>> max_clean_segments 30% >>> clean_check_interval 10 >>> nsegments_per_clean 16 >>> mc_nsegments_per_clean 16 >>> cleaning_interval 0.1 >>> mc_cleaning_interval 0.1 >>> retry_interval 60 >>> min_reclaimable_blocks 10% >>> mc_min_reclaimable_blocks 1% >>> use_set_suinfo >>> use_mmap >>> log_priority info >>> >>> I can reproduce the corruption error with the latest kernel 4.14.0, and >>> I can confirm that it is fixed after applying the patch. >>> I used an older six core AMD Phenom(tm) II X6 1090T Processor and a >>> Samsung 850 EVO SSD. >>> >>> Best regards, >>> >>> Andreas >>> >>>> Thanks, >>>> Ryusuke Konishi >>>> >>>> > >>>> > I am rerunning my tests now to make sure I didn't make a mistake. >>>> > >>>> > Best regards, >>>> > >>>> > Andreas >>>> > >>>> > Am Sun, 29 Oct 2017 14:09:40 +0900 >>>> > schrieb Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>: >>>> > >>>> >> Unfortunately the patch didn't fix this issue: >>>> >> >>>> >> [ 612.614615] NILFS (sda1): segctord starting. Construction >>>> >> interval = 5 seconds, CP frequency < 30 seconds >>>> >> [ 4298.841376] NILFS (sda1): nilfs_ioctl_move_inode_block: >>>> >> conflicting data buffer: ino=1155, cno=11, offset=1, >>>> >> blocknr=1216036, vblocknr=1198436 >>>> >> [ 4298.841877] NILFS (sda1): error -17 preparing GC: cannot read >>>> >> source blocks >>>> >> >>>> >> Look like the bugsimulator hit other bug. >>>> >> >>>> >> When this happened, the disk usage was 100%: >>>> >> >>>> >> $ df /dev/sda1 >>>> >> Filesystem 1K-blocks Used Available Use% Mounted on >>>> >> /dev/sda1 104849404 99655676 0 100% /test >>>> >> >>>> >> And, the error killed cleanerd. However, I could cleanly unmount >>>> >> the partition, and >>>> >> I didn't see any errors after remount the partition. >>>> >> >>>> >> As you commented, I have changed some parameters >>>> >> in /etc/nilfs_cleanerd.conf: >>>> >> ------------------------------------------- >>>> >> mc_nsegments_per_clean 16 nsegments_per_clean 16 >>>> >> max_clean_segments 30% >>>> >> cleaning_interval 0.1 >>>> >> mc_cleaning_interval 0.1 >>>> >> ------------------------------------------- >>>> >> >>>> >> I will look into this issue with bugsimulator. >>>> >> >>>> >> Anyone can reproduce the original corruption issue in other way? >>>> >> >>>> >> Regards, >>>> >> Ryusuke Konishi >>>> >> >>>> >> >>>> >> 2017-10-29 2:16 GMT+09:00 Ryusuke Konishi >>>> >> <konishi.ryusuke@xxxxxxxxx>: >>>> >> > Hi, >>>> >> > >>>> >> > Sorry for my late reponse. >>>> >> > I could reproduce a corruption issue with the bug simulator. >>>> >> > >>>> >> > ----- >>>> >> > [88805.039545] NILFS (sda1): segctord starting. Construction >>>> >> > interval = 5 second s, CP frequency < 30 seconds >>>> >> > [92817.941644] NILFS (sda1): nilfs_ioctl_move_inode_block: >>>> >> > conflicting data buff er: ino=10846, cno=92, offset=2, >>>> >> > blocknr=11425709, vblocknr=11227714 [92817.941936] NILFS (sda1): >>>> >> > error -17 preparing GC: cannot read source blocks ----- >>>> >> > >>>> >> > Now I'm testing the patch. I will send it to upstream after it >>>> >> > turned out to suppress the issue. >>>> >> > >>>> >> > The patch itself looks good to me though I guess there may be >>>> >> > similar bugs. What makes things complicated seems that inode >>>> >> > block is not marked as dirty just when "propagate" function >>>> >> > carries a dirty state from a dirty node/data block. >>>> >> > >>>> >> > Anyway, I hasten to apply this patch. >>>> >> > >>>> >> > Thanks, >>>> >> > Ryusuke Konishi >>>> >> > >>>> >> > >>>> >> > 2017-07-13 15:44 GMT+09:00 Andreas Rohner >>>> >> > <andreas.rohner@xxxxxxx>: >>>> >> >> Hi, >>>> >> >> >>>> >> >> Am Wed, 12 Jul 2017 09:08:57 +0900 >>>> >> >> schrieb Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx>: >>>> >> >> >>>> >> >>> Hi, >>>> >> >>> >>>> >> >>> 2017-07-11 15:19 GMT+09:00 Andreas Rohner >>>> >> >>> <andreas.rohner@xxxxxxx>: >>>> >> >>> > >>>> >> >>> > There is a race condition between the function >>>> >> >>> > nilfs_dirty_inode() and nilfs_set_file_dirty(). >>>> >> >>> > >>>> >> >>> > When a file is opened, nilfs_dirty_inode() is called to >>>> >> >>> > update the access timestamp in the inode. It calls >>>> >> >>> > __nilfs_mark_inode_dirty() in a separate transaction. >>>> >> >>> > __nilfs_mark_inode_dirty() caches the ifile buffer_head in >>>> >> >>> > the i_bh field of the inode info structure and marks it as >>>> >> >>> > dirty. >>>> >> >>> > >>>> >> >>> > After some data was written to the file in another >>>> >> >>> > transaction, the function nilfs_set_file_dirty() is called, >>>> >> >>> > which adds the inode to the ns_dirty_files list. >>>> >> >>> > >>>> >> >>> > Then the segment construction calls >>>> >> >>> > nilfs_segctor_collect_dirty_files(), which goes through the >>>> >> >>> > ns_dirty_files list and checks the i_bh field. If there is a >>>> >> >>> > cached buffer_head in i_bh it is not marked as dirty again. >>>> >> >>> > >>>> >> >>> > Since nilfs_dirty_inode() and nilfs_set_file_dirty() use >>>> >> >>> > separate transactions, it is possible that a segment >>>> >> >>> > construction that writes out the ifile occurs in-between the >>>> >> >>> > two. If this happens the inode is not on the ns_dirty_files >>>> >> >>> > list, but its ifile block is still marked as dirty and >>>> >> >>> > written out. >>>> >> >>> > >>>> >> >>> > In the next segment construction, the data for the file is >>>> >> >>> > written out and nilfs_bmap_propagate() updates the b-tree. >>>> >> >>> > Eventually the bmap root is written into the i_bh block, >>>> >> >>> > which is not dirty, because it was written out in another >>>> >> >>> > segment construction. >>>> >> >>> > >>>> >> >>> > As a result the bmap update can be lost, which leads to file >>>> >> >>> > system corruption. Either the virtual block address points >>>> >> >>> > to an unallocated DAT block, or the DAT entry will be reused >>>> >> >>> > for something different. >>>> >> >>> > >>>> >> >>> > The error can remain undetected for a long time. A typical >>>> >> >>> > error message would be one of the "bad btree" errors or a >>>> >> >>> > warning that a DAT entry could not be found. >>>> >> >>> > >>>> >> >>> > This bug can be reproduced reliably by a simple benchmark >>>> >> >>> > that creates and overwrites millions of 4k files. >>>> >> >>> > >>>> >> >>> > Signed-off-by: Andreas Rohner <andreas.rohner@xxxxxxx> >>>> >> >>> >>>> >> >>> Will review, please wait for a while. >>>> >> >> >>>> >> >> Thank you, I have uploaded my benchmark on GitHub: >>>> >> >> >>>> >> >> https://github.com/zeitgeist87/bugsimulator >>>> >> >> >>>> >> >> The benchmark is very simple. It writes out 13107200 4k files >>>> >> >> in a two layer directory tree, so that every directory contains >>>> >> >> about 1000 files. Then it overwrites these files in an infinite >>>> >> >> loop. >>>> >> >> >>>> >> >> You need quite aggressive cleaner settings for it to keep up >>>> >> >> with the bugsimulator: >>>> >> >> >>>> >> >> mc_nsegments_per_clean 16 >>>> >> >> nsegments_per_clean 16 >>>> >> >> max_clean_segments 30% >>>> >> >> cleaning_interval 0.1 >>>> >> >> mc_cleaning_interval 0.1 >>>> >> >> >>>> >> >> If you run this benchmark on a fresh 100GB NILFS2 partition with >>>> >> >> the above cleaner settings, the file system will be corrupt in >>>> >> >> about half an hour. At least on my machine. >>>> >> >> >>>> >> >> With the patch for the race condition, it can run indefinetly. >>>> >> >> >>>> >> >> Regards, >>>> >> >> Andreas >>>> >> >> >>>> >> >>> Regards, >>>> >> >>> Ryusuke Konishi >>>> >> >>> >>>> >> >>> > >>>> >> >>> > --- >>>> >> >>> > fs/nilfs2/segment.c | 6 ++++-- >>>> >> >>> > 1 file changed, 4 insertions(+), 2 deletions(-) >>>> >> >>> > >>>> >> >>> > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c >>>> >> >>> > index 70ded52dc1dd..50e12956c737 100644 >>>> >> >>> > --- a/fs/nilfs2/segment.c >>>> >> >>> > +++ b/fs/nilfs2/segment.c >>>> >> >>> > @@ -1958,8 +1958,6 @@ static int >>>> >> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci, >>>> >> >>> > err, ii->vfs_inode.i_ino); return err; >>>> >> >>> > } >>>> >> >>> > - mark_buffer_dirty(ibh); >>>> >> >>> > - nilfs_mdt_mark_dirty(ifile); >>>> >> >>> > spin_lock(&nilfs->ns_inode_lock); >>>> >> >>> > if (likely(!ii->i_bh)) >>>> >> >>> > ii->i_bh = ibh; >>>> >> >>> > @@ -1968,6 +1966,10 @@ static int >>>> >> >>> > nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci, >>>> >> >>> > goto retry; } >>>> >> >>> > >>>> >> >>> > + // Always redirty the buffer to avoid race >>>> >> >>> > condition >>>> >> >>> > + mark_buffer_dirty(ii->i_bh); >>>> >> >>> > + nilfs_mdt_mark_dirty(ifile); >>>> >> >>> > + >>>> >> >>> > clear_bit(NILFS_I_QUEUED, &ii->i_state); >>>> >> >>> > set_bit(NILFS_I_BUSY, &ii->i_state); >>>> >> >>> > list_move_tail(&ii->i_dirty, >>>> >> >>> > &sci->sc_dirty_files); -- >>>> >> >>> > 2.13.2 >>>> >> >>> > >>>> >> >>> > -- >>>> >> >>> > To unsubscribe from this list: send the line "unsubscribe >>>> >> >>> > linux-nilfs" in the body of a message to >>>> >> >>> > majordomo@xxxxxxxxxxxxxxx More majordomo info at >>>> >> >>> > http://vger.kernel.org/majordomo-info.html >>>> >> >>> -- >>>> >> >>> To unsubscribe from this list: send the line "unsubscribe >>>> >> >>> linux-nilfs" in the body of a message to >>>> >> >>> majordomo@xxxxxxxxxxxxxxx More majordomo info at >>>> >> >>> http://vger.kernel.org/majordomo-info.html >>>> >> >> >>>> >> -- >>>> >> To unsubscribe from this list: send the line "unsubscribe >>>> >> linux-nilfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx >>>> >> More majordomo info at >>>> >> http://vger.kernel.org/majordomo-info.html >>>> > >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe >>>> linux-nilfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html