On Wed, Nov 9, 2022 at 12:26 PM Liu Shixin wrote: > On 2022/11/9 1:50, Ryusuke Konishi wrote: > > Hi Liu Shixin, > > > > I'm sorry for my repeated emails. > > > > On Wed, Nov 9, 2022 at 12:13 AM Ryusuke Konishi wrote: > >>> I think the most likely cause is metadata corruption. If so, we > >>> should fix it by adding a proper sanity check, yes. > >>> However, there is still the possibility that the error retry logic > >>> after detecting errors has a flaw. (I believe at least this is not a > >>> problem with normal paths.) > >>> The sufile state inconsistency above is hypothetical for now. > >>> Either way, I'd like to make sure what's actually happening (and where > >>> the anomaly is coming from) so we can decide how to fix it. > >> I noticed that this syzbot report has a disk image "mount_0.gz", so I > >> tried to mount it read-only. > >> The result was as follows: > >> > >> $ sudo mount -t nilfs2 -r ./mount_0 /mnt/test > >> $ lssu > >> SEGNUM DATE TIME STAT NBLOCKS > >> 0 26760730-10-29 19:40:00 -de 527958016 > >> 1 26760730-11-01 20:29:04 -de 527958016 > >> 2 1176433641-11-01 02:01:52 -de 2983038235 > >> 3 -1158249729-11-01 04:57:19 a-- 25375 > >> 4 1970-01-02 21:24:32 a-- 0 > >> 5 -1415215929-01-02 00:19:15 --e 1631451365 > >> 6 841067190-01-02 13:02:59 -d- 3101461260 > >> 7 1495233207-01-02 01:31:11 -de 1697748441 > >> 8 1875753393-01-02 21:54:14 -de 337757600 > >> 9 648878284-01-02 21:06:08 --- 2133388152 > >> 10 2122778986-01-02 17:49:43 --e 874605800 > >> 11 55846197-01-02 09:50:53 -de 4262365368 > >> 12 1872396026-01-02 06:45:18 --- 1051768258 > >> 13 820920473-01-02 19:28:35 --e 2932336675 > >> 14 2128306755-01-02 10:17:45 --- 3568501216 > >> 15 1457063063-01-02 01:24:17 --e 2330511560 > >> 16 586864775-01-02 16:08:15 --- 355283425 > >> 17 -824870041-01-02 22:47:26 -d- 4177999057 > >> 18 1562176264-01-02 08:06:45 --- 1312597355 > >> 19 -392925420-01-02 17:08:27 -d- 3811075948 > >> 20 1927575458-01-02 21:26:51 -de 1384562525 > >> 21 2139923505-01-02 08:16:04 -d- 41861305 > >> > >> Here, we can see that neither segment #3 (= ns_segnum) nor #4 (= > >> ns_nextnum) has the dirty flag > >> ("d" in STAT). So, as expected, this seems to be the root cause of > >> the duplicate assignments and oops. > >> The proper correction would be, therefore, to check and reject (or > >> fix) this anomaly while outputting an error message > >> (or warning if fix it at the same time). > >> It should be inserted in mount time logic because the segments that > >> nilfs2 itself allocates are always marked dirty with > >> nilfs_sufile_alloc(). > > I will change my opinion. > > > > Considering the possibility of sufile corruption at runtime, it seems > > that the sanity check should be done on every nilfs_sufile_alloc() > > call. > > > > I now think nilfs_sufile_alloc() should call nilfs_error() and return > > -EIO if the number of a newly found vacant segment matches > > nilfs->ns_segnum or nilfs->ns_nextnum. > Before we add the first segbuf into sci->sc_segbufs in nilfs_segctor_begin_construction(), > there is no possibility existing duplicate segnum. And the subsequent process should > not be affected by sufile corruption. So I think it's enough to only check for case alloc==0. > I don't find any other possible wrong scenarios. I think the problem is not yet fixed. With your patch, I still think there are scenarios that cause duplicate segment assignments: (1) If the segment at nilfs->ns_segnum is in a clean state due to sufile data corruption, and if sci->sc_write_logs is empty and nilfs->ns_segnum == nilfs->nextnum to start writing a segment from the beginning, then nilfs_sufile_alloc() is called in nilfs_segctor_begin_construction() and nextnum can be equal to nilfs->ns_segnum. This nextnum is stored in segbuf->sb_nextnum, and if segbuf is spliced with nilfs_segctor_extend_segments(), the segbuf->sb_nextnum will be used for sb_segnum of the added segbuf, and the list corruption of sb_segsum_buffers can happen between these two segbufs. (2) If the segment at nilfs->ns_segnum is in a clean state due to sufile data corruption, and if sci->sc_write_logs is not empty and segbuf->sb_rest_blocks < NILFS_PSEG_MIN_BLOCKS, then nilfs_sufile_alloc() is called in nilfs_segctor_begin_construction(), and the retrieved nextnum is set to sb_nextnum of the appended segbuf there, and this sb_nextnum can be equal to nilfs->ns_segnum. Then, if the segbufs are spliced with nilfs_segctor_extend_segments(), the last segbuf->sb_nextnum will be used for sb_segnum of the appended segbuf. Therefore, list corruption of sb_segsum_buffers can happen. (3) If the segment at nilfs->ns_segnum is in a clean state due to sufile data corruption, and then the segbuf is spliced with nilfs_segctor_extend_segments(), nilfs_sufile_alloc() is called, and the retrieved "nextnextnum" is set to sb_nextnum of the appended segbuf, but this can be equal to nilfs->ns_segnum. If one more segbuf is appended to the segbufs in nilfs_segctor_extend_segments(), list corruption of sb_segsum_buffers can happen due to the segnum duplication. And please call nilfs_error() after detecting the metadata anomaly to output this critical situation and degrade the filesystem appropriately (read-only for example). This metadata corruption is a critical situation and writes will continually fail. File system operations should not continue as if nothing happened. I would like to ask you to reconsider how to fix it, but will also comment on the current patch below. > > diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c > index 7be632c15f91..7b91c790b798 100644 > --- a/fs/nilfs2/segment.c > +++ b/fs/nilfs2/segment.c > @@ -1326,7 +1326,13 @@ static int nilfs_segctor_begin_construction(struct nilfs_sc_info *sci, > err = nilfs_sufile_alloc(nilfs->ns_sufile, &nextnum); > if (err) > goto failed; > + } else { > + /* Check the next segment has alreadly been allocated */ Here is a typo: "alreadly" -> "already" > + err = nilfs_sufile_test_allocated(nilfs->ns_sufile, nextnum); > + if (err) > + goto failed; > } > + > nilfs_segbuf_set_next_segnum(segbuf, nextnum, nilfs); > > BUG_ON(!list_empty(&sci->sc_segbufs)); > diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c > index 853a8212114f..8dff12c56891 100644 > --- a/fs/nilfs2/sufile.c > +++ b/fs/nilfs2/sufile.c > @@ -399,6 +399,36 @@ int nilfs_sufile_alloc(struct inode *sufile, __u64 *segnump) > return ret; > } > > +int nilfs_sufile_test_allocated(struct inode *sufile, __u64 *segnump) The second argument should be "__u64 segnum". The type is different from the above caller. > +{ > + struct buffer_head *su_bh; > + struct nilfs_segment_usage *su; > + void *kaddr; > + int ret; > + > + down_write(&NILFS_MDT(sufile)->mi_sem); Please use down_read()/up_read() since the operation of this function is not mutative. See nilfs_sufile_get_stat() for an example. > + > + ret = nilfs_sufile_get_segment_usage_block(sufile, segnump, 1, "segnump" should be "segnum". If you use a pointer, this must be "*segnump". Either way this is wrong. > + &su_bh); > + if (ret < 0) > + goto out_sem; > + > + kaddr = kmap_atomic(su_bh->b_page); > + su = nilfs_sufile_block_get_segment_usage( > + sufile, segnump, su_bh, kaddr); Ditto. "segnump" here should be "segnum". > + > + if (nilfs_segment_usage_clean(su)) > + ret = -EIO; > + > + kunmap_atomic(kaddr); > + > + brelse(su_bh); > + > +out_sem: > + up_write(&NILFS_MDT(sufile)->mi_sem); > + return ret; > +} > + > void nilfs_sufile_do_cancel_free(struct inode *sufile, __u64 segnum, > struct buffer_head *header_bh, > struct buffer_head *su_bh) > diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h > index 8e8a1a5a0402..02b61ca6f318 100644 > --- a/fs/nilfs2/sufile.h > +++ b/fs/nilfs2/sufile.h > @@ -24,6 +24,7 @@ unsigned long nilfs_sufile_get_ncleansegs(struct inode *sufile); > > int nilfs_sufile_set_alloc_range(struct inode *sufile, __u64 start, __u64 end); > int nilfs_sufile_alloc(struct inode *, __u64 *); > +int nilfs_sufile_test_allocated(struct inode *, __u64 *); This should be: int nilfs_sufile_test_allocated(struct inode *sufile, __u64 segnum); Please include the variable name in the argument for newly added or modified declarations. Regards, Ryusuke Konishi > int nilfs_sufile_mark_dirty(struct inode *sufile, __u64 segnum); > int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum, > unsigned long nblocks, time64_t modtime); >