Re: [PATCH] nilfs2: fix NULL pointer dereference in nilfs_segctor_prepare_write()

Ryusuke Konishi <konishi.ryusuke@xxxxxxxxx> · Wed, 9 Nov 2022 23:00:01 +0900

On Wed, Nov 9, 2022 at 12:26 PM Liu Shixin wrote:
> On 2022/11/9 1:50, Ryusuke Konishi wrote:
> > Hi Liu Shixin,
> >
> > I'm sorry for my repeated emails.
> >
> > On Wed, Nov 9, 2022 at 12:13 AM Ryusuke Konishi wrote:
> >>> I think the most likely cause is metadata corruption.  If so, we
> >>> should fix it by adding a proper sanity check, yes.
> >>> However, there is still the possibility that the error retry logic
> >>> after detecting errors has a flaw. (I believe at least this is not a
> >>> problem with normal paths.)
> >>> The sufile state inconsistency above is hypothetical for now.
> >>> Either way, I'd like to make sure what's actually happening (and where
> >>> the anomaly is coming from) so we can decide how to fix it.
> >> I noticed that this syzbot report has a disk image "mount_0.gz", so I
> >> tried to mount it read-only.
> >> The result was as follows:
> >>
> >> $ sudo mount -t nilfs2 -r ./mount_0 /mnt/test
> >> $ lssu
> >>               SEGNUM        DATE     TIME STAT     NBLOCKS
> >>                    0  26760730-10-29 19:40:00  -de   527958016
> >>                    1  26760730-11-01 20:29:04  -de   527958016
> >>                    2  1176433641-11-01 02:01:52  -de  2983038235
> >>                    3  -1158249729-11-01 04:57:19  a--       25375
> >>                    4  1970-01-02 21:24:32  a--           0
> >>                    5  -1415215929-01-02 00:19:15  --e  1631451365
> >>                    6  841067190-01-02 13:02:59  -d-  3101461260
> >>                    7  1495233207-01-02 01:31:11  -de  1697748441
> >>                    8  1875753393-01-02 21:54:14  -de   337757600
> >>                    9  648878284-01-02 21:06:08  ---  2133388152
> >>                   10  2122778986-01-02 17:49:43  --e   874605800
> >>                   11  55846197-01-02 09:50:53  -de  4262365368
> >>                   12  1872396026-01-02 06:45:18  ---  1051768258
> >>                   13  820920473-01-02 19:28:35  --e  2932336675
> >>                   14  2128306755-01-02 10:17:45  ---  3568501216
> >>                   15  1457063063-01-02 01:24:17  --e  2330511560
> >>                   16  586864775-01-02 16:08:15  ---   355283425
> >>                   17  -824870041-01-02 22:47:26  -d-  4177999057
> >>                   18  1562176264-01-02 08:06:45  ---  1312597355
> >>                   19  -392925420-01-02 17:08:27  -d-  3811075948
> >>                   20  1927575458-01-02 21:26:51  -de  1384562525
> >>                   21  2139923505-01-02 08:16:04  -d-    41861305
> >>
> >> Here,  we can see that neither segment #3 (= ns_segnum) nor #4 (=
> >> ns_nextnum) has the dirty flag
> >> ("d" in STAT).   So, as expected, this seems to be the root cause of
> >> the duplicate assignments and oops.
> >> The proper correction would be, therefore, to check and reject (or
> >> fix) this anomaly while outputting an error message
> >> (or warning if fix it at the same time).
> >> It should be inserted in mount time logic because the segments that
> >> nilfs2 itself allocates are always marked dirty with
> >> nilfs_sufile_alloc().
> > I will change my opinion.
> >
> > Considering the possibility of sufile corruption at runtime, it seems
> > that the sanity check should be done on every nilfs_sufile_alloc()
> > call.
> >
> > I now think nilfs_sufile_alloc() should call nilfs_error() and return
> > -EIO if the number of a newly found vacant segment matches
> > nilfs->ns_segnum or nilfs->ns_nextnum.

> Before we add the first segbuf into sci->sc_segbufs in nilfs_segctor_begin_construction(),
> there is no possibility existing duplicate segnum. And the subsequent process should
> not be affected by sufile corruption. So I think it's enough to only check for case alloc==0.
> I don't find any other possible wrong scenarios.

I think the problem is not yet fixed.
With your patch, I still think there are scenarios that cause
duplicate segment assignments:
(1)  If the segment at nilfs->ns_segnum is in a clean state due to
sufile data corruption, and
if sci->sc_write_logs is empty and nilfs->ns_segnum == nilfs->nextnum
to start writing a segment from the beginning, then
nilfs_sufile_alloc() is called in nilfs_segctor_begin_construction()
and nextnum can be
equal to nilfs->ns_segnum.
This nextnum is stored in segbuf->sb_nextnum, and if segbuf is spliced
with nilfs_segctor_extend_segments(), the segbuf->sb_nextnum will be
used for sb_segnum of the added segbuf, and the list corruption of
sb_segsum_buffers can happen between these two segbufs.

(2)  If the segment at nilfs->ns_segnum is in a clean state due to
sufile data corruption, and
if sci->sc_write_logs is not empty and segbuf->sb_rest_blocks <
NILFS_PSEG_MIN_BLOCKS,
then nilfs_sufile_alloc() is called in
nilfs_segctor_begin_construction(), and the retrieved nextnum is set
to
sb_nextnum of the appended segbuf there, and this sb_nextnum can be
equal to nilfs->ns_segnum.
Then, if the segbufs are spliced with nilfs_segctor_extend_segments(),
the last segbuf->sb_nextnum will be used for sb_segnum of the appended
segbuf.  Therefore, list corruption of sb_segsum_buffers can happen.

(3) If the segment at nilfs->ns_segnum is in a clean state due to
sufile data corruption, and then the segbuf is spliced with
nilfs_segctor_extend_segments(), nilfs_sufile_alloc() is called, and
the retrieved "nextnextnum" is set to sb_nextnum of the appended
segbuf, but this can be equal to nilfs->ns_segnum.
If one more segbuf is appended to the segbufs in
nilfs_segctor_extend_segments(), list corruption of
sb_segsum_buffers can happen due to the segnum duplication.

And please call nilfs_error() after detecting the metadata anomaly to
output this critical situation and  degrade the filesystem
appropriately (read-only for example).  This metadata corruption is a
critical situation and writes will continually fail.  File system
operations should not continue as if nothing happened.

I would like to ask you to reconsider how to fix it, but will also
comment on the current patch below.

>
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index 7be632c15f91..7b91c790b798 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1326,7 +1326,13 @@ static int nilfs_segctor_begin_construction(struct nilfs_sc_info *sci,
>                 err = nilfs_sufile_alloc(nilfs->ns_sufile, &nextnum);
>                 if (err)
>                         goto failed;
> +       } else {
> +               /* Check the next segment has alreadly been allocated */

Here is a typo:  "alreadly" -> "already"

> +               err = nilfs_sufile_test_allocated(nilfs->ns_sufile, nextnum);
> +               if (err)
> +                       goto failed;
>         }
> +
>         nilfs_segbuf_set_next_segnum(segbuf, nextnum, nilfs);
>
>         BUG_ON(!list_empty(&sci->sc_segbufs));
> diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
> index 853a8212114f..8dff12c56891 100644
> --- a/fs/nilfs2/sufile.c
> +++ b/fs/nilfs2/sufile.c
> @@ -399,6 +399,36 @@ int nilfs_sufile_alloc(struct inode *sufile, __u64 *segnump)
>         return ret;
>  }
>
> +int nilfs_sufile_test_allocated(struct inode *sufile, __u64 *segnump)

The second argument should be "__u64 segnum".  The type is different
from the above caller.

> +{
> +       struct buffer_head *su_bh;
> +       struct nilfs_segment_usage *su;
> +       void *kaddr;
> +       int ret;
> +
> +       down_write(&NILFS_MDT(sufile)->mi_sem);

Please use down_read()/up_read() since the operation of this function
is not mutative.
See nilfs_sufile_get_stat() for an example.

> +
> +       ret = nilfs_sufile_get_segment_usage_block(sufile, segnump, 1,

"segnump" should be "segnum".  If you use a pointer, this must be "*segnump".
Either way this is wrong.

> +                                                  &su_bh);
> +       if (ret < 0)
> +               goto out_sem;
> +
> +       kaddr = kmap_atomic(su_bh->b_page);
> +       su = nilfs_sufile_block_get_segment_usage(
> +               sufile, segnump, su_bh, kaddr);

Ditto.   "segnump" here should be "segnum".

> +
> +       if (nilfs_segment_usage_clean(su))
> +               ret = -EIO;
> +
> +       kunmap_atomic(kaddr);
> +
> +       brelse(su_bh);
> +
> +out_sem:
> +       up_write(&NILFS_MDT(sufile)->mi_sem);
> +       return ret;
> +}
> +
>  void nilfs_sufile_do_cancel_free(struct inode *sufile, __u64 segnum,
>                                  struct buffer_head *header_bh,
>                                  struct buffer_head *su_bh)
> diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
> index 8e8a1a5a0402..02b61ca6f318 100644
> --- a/fs/nilfs2/sufile.h
> +++ b/fs/nilfs2/sufile.h
> @@ -24,6 +24,7 @@ unsigned long nilfs_sufile_get_ncleansegs(struct inode *sufile);
>
>  int nilfs_sufile_set_alloc_range(struct inode *sufile, __u64 start, __u64 end);
>  int nilfs_sufile_alloc(struct inode *, __u64 *);

> +int nilfs_sufile_test_allocated(struct inode *, __u64 *);

This should be:

int nilfs_sufile_test_allocated(struct inode *sufile, __u64 segnum);

Please include the variable name in the argument for newly added or
modified declarations.

Regards,
Ryusuke Konishi

>  int nilfs_sufile_mark_dirty(struct inode *sufile, __u64 segnum);
>  int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
>                                    unsigned long nblocks, time64_t modtime);
>