Re: [PATCH] fs: fix guard_bio_eod to check for real EOD errors

Carlos Maiolino <cmaiolino@xxxxxxxxxx> · Tue, 26 Feb 2019 10:37:37 +0100

On Tue, Feb 26, 2019 at 10:03:04AM +0800, Ming Lei wrote:
> On Mon, Feb 25, 2019 at 9:26 PM Carlos Maiolino <cmaiolino@xxxxxxxxxx> wrote:
> >
> > Hi,
> >
> > On Sat, Feb 23, 2019 at 07:33:21AM +0800, Ming Lei wrote:
> > > Hi Carlos,
> > >
> > > Cc block list given it is related with interface between fs and block layer.
> > >
> > > On Fri, Feb 22, 2019 at 10:14 PM Carlos Maiolino <cmaiolino@xxxxxxxxxx> wrote:
> > > >
> > > > guard_bio_eod() can truncate a segment in bio to allow it to do IO on
> > > > odd last sectors of a device.
> > > >
> > > > It already checks if the IO starts past EOD, but it does not consider
> > > > the possibility of an IO request starting within device boundaries can
> > > > contain more than one segment past EOD.
> > > >
> > > > In such cases, truncated_bytes can be bigger than PAGE_SIZE, and will
> > > > underflow bvec->bv_len.
> > >
> > > It can cause memory corruption even for < PAGE_SIZE, also it can be correct
> > > to see > PAGE_SIZE truncated_bytes:
> > >
> > > - xfs is going to support big block size which may be 64k
> > > - multi-page bvec patches have been in block tree, and it is planned to land
> > > v5.1, so > PAGE_SIZE truncating may come because .bv_len can be much
> > > bigger than PAGE_SIZE
> > >
> > > - suppose fs block size is 4k, bio sector is 1022 and size is 4k, and
> > > disk size is
> > > 1024, last bvec is (0, 1024), then 'truncated_bytes' will be 3k, so
> > > the check can't
> > > capture this case, and memory corruption still may be caused.
> > >
> > > >
> > > > Fix this by checking if truncated_bytes is lower than PAGE_SIZE.
> > > >
> > > > This situation has been found on filesystems such as isofs and vfat,
> > > > which doesn't check the device size before mount, if the device is
> > > > smaller than the filesystem itself, a readahead on such filesystem,
> > > > which spans EOD, can trigger this situation, leading a call to
> > > > zero_user() with a wrong size possibly corrupting memory.
> > > >
> > > > I didn't see any crash, or didn't let the system run long enough to
> > > > check if memory corruption will be hit somewhere, but adding
> > > > instrumentation to guard_bio_end() to check truncated_bytes size, was
> > > > enough to see the error.
> > > >
> > > > The following script can trigger the error.
> > > >
> > > > MNT=/mnt
> > > > IMG=./DISK.img
> > > > DEV=/dev/loop0
> > > >
> > > > mkfs.vfat $IMG
> > > > mount $IMG $MNT
> > > > cp -R /etc $MNT &> /dev/null
> > > > umount $MNT
> > > >
> > > > losetup -D
> > > >
> > > > losetup --find --show --sizelimit 16247280 $IMG
> > > > mount $DEV $MNT
> > > >
> > > > find $MNT -type f -exec cat {} + >/dev/null
> > >
> > > BTW, which kernel is  for this issue? Did you investigate why the bad bio
> > > is made?
> >
> > The problem was reproduced, and the patch also wrote based on linux-next tree,
> > tag next-20190207.
> >
> > The bad bio is made while attempting to read the end of a filesystem, which goes
> > beyond EOD, as described in the patch description itself, it may happen for
> > filesystems which doesn't check the device size before mounting.
> 
> OK.
> 
> >
> > Despite the lack of validation of the block size from the filesystem part, the
> > block layer should avoid and deny such invalid IO requests. The fact is, the
> > block layer already validates such IO request via
> > generic_make_request_checks()->bio_check_eod(), but, before the IO layer could
> > reach there, guard_bio_eod() has already corrupted the bio.
> >
> > >
> > > >
> > > > Kudos to Eric Sandeen for coming up with the reproducer above
> > > >
> > > > Signed-off-by: Carlos Maiolino <cmaiolino@xxxxxxxxxx>
> > > > ---
> > > >
> > > > P.S. I'm not 100% proficient in bio internals, so I'm not sure if this is the
> > > > right fix, so comments are much appreciated.
> > > > Thanks
> > > >
> > > >  fs/buffer.c | 7 +++++++
> > > >  1 file changed, 7 insertions(+)
> > > >
> > > > diff --git a/fs/buffer.c b/fs/buffer.c
> > > > index 784de3dbcf28..32dc5cd2f6ba 100644
> > > > --- a/fs/buffer.c
> > > > +++ b/fs/buffer.c
> > > > @@ -3063,6 +3063,13 @@ void guard_bio_eod(int op, struct bio *bio)
> > > >         /* Uhhuh. We've got a bio that straddles the device size! */
> > > >         truncated_bytes = bio->bi_iter.bi_size - (maxsector << 9);
> > > >
> > > > +       /*
> > > > +        * The bio contains more than one segment which spans EOD, just return
> > > > +        * and let IO layer turn it into an EIO
> > > > +        */
> > > > +       if (truncated_bytes > PAGE_SIZE)
> > > > +               return;
> > > > +
> > >
> > > The correct check should be:
> > >
> > > +       if (truncated_bytes > bvec->bv_len)
> > > +               return;
> >
> > Thanks, I'll test it.
> >
> > >
> > > Also it should be helpful to warn on this bad bio, but it is fine to not do it
> > > here cause block layer logs this bad bio too.
> > >
> > > BTW, this is just a workaround, not one real fix on the reported problem.
> >
> > A suggestion of what would be the real fix is much appreciated then.
> 
> Seems this issue can't be reproduced in ISO, maybe it is related with
> the specific truncated size.
> 
> In theory, it is better to return failure during mounting because this fs image
> is broken, or the fs code should check the bio during making the bio given
> fs has the disk size info. However, maybe either one is hard to implement.

I do agree the filesystem should the checking it, however, I do also believe the
bio layer should still validate it.
Notice though, this was not reproduced using any crafted filesystem image, all
it needs is a filesystem which does not validate the underlying block device
size, and the same block device was shrank. The filesystem does not necessarily
need to be corrupted.

> 
> So I think this patch is good, if the check is fixed.
> 

Ok, I'll resubmit the patch with your suggestion. Thanks

> Thanks,
> Ming Lei

-- 
Carlos