Re: silent data corruption in fuse in rc1

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Dec 09, 2024 at 09:14:33PM -0800, Joanne Koong wrote:
> On Mon, Dec 9, 2024 at 11:52 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
> >
> > On Mon, Dec 9, 2024 at 10:47 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
> > >
> > > On Mon, Dec 9, 2024 at 9:07 AM Malte Schröder <malte.schroeder@xxxxxxxx> wrote:
> > > >
> > > > On 09/12/2024 16:48, Josef Bacik wrote:
> > > > > On Mon, Dec 09, 2024 at 03:28:14PM +0000, Matthew Wilcox wrote:
> > > > >> On Mon, Dec 09, 2024 at 09:49:48AM -0500, Josef Bacik wrote:
> > > > >>>> Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
> > > > >>>> with 3b97c3652d91 as the culprit.
> > > > >>> Willy, I've looked at this code and it does indeed look like a 1:1 conversion,
> > > > >>> EXCEPT I'm fuzzy about how how this works with large folios.  Previously, if we
> > > > >>> got a hugepage in, we'd get each individual struct page back for the whole range
> > > > >>> of the hugepage, so if for example we had a 2M hugepage, we'd fill in the
> > > > >>> ->offset for each "middle" struct page as 0, since obviously we're consuming
> > > > >>> PAGE_SIZE chunks at a time.
> > > > >>>
> > > > >>> But now we're doing this
> > > > >>>
> > > > >>>     for (i = 0; i < nfolios; i++)
> > > > >>>             ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> > > > >>>
> > > > >>> So if userspace handed us a 2M hugepage, page_folio() on each of the
> > > > >>> intermediary struct page's would return the same folio, correct?  So we'd end up
> > > > >>> with the wrong offsets for our fuse request, because they should be based from
> > > > >>> the start of the folio, correct?
> > > > >> I think you're 100% right.  We could put in some nice asserts to check
> > > > >> this is what's happening, but it does seem like a rather incautious
> > > > >> conversion.  Yes, all folios _in the page cache_ for fuse are small, but
> > > > >> that's not guaranteed to be the case for folios found in userspace for
> > > > >> directio.  At least the comment is wrong, and I'd suggest the code is too.
> > > > > Ok cool, Malte can you try the attached only compile tested patch and see if the
> > > > > problem goes away?  Thanks,
> > > > >
> > > > > Josef
> > > > >
> > > > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > > > > index 88d0946b5bc9..c4b93ead99a5 100644
> > > > > --- a/fs/fuse/file.c
> > > > > +++ b/fs/fuse/file.c
> > > > > @@ -1562,9 +1562,19 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
> > > > >               nfolios = DIV_ROUND_UP(ret, PAGE_SIZE);
> > > > >
> > > > >               ap->descs[ap->num_folios].offset = start;
> > > > > -             fuse_folio_descs_length_init(ap->descs, ap->num_folios, nfolios);
> > > > > -             for (i = 0; i < nfolios; i++)
> > > > > -                     ap->folios[i + ap->num_folios] = page_folio(pages[i]);
> > > > > +             for (i = 0; i < nfolios; i++) {
> > > > > +                     struct folio *folio = page_folio(pages[i]);
> > > > > +                     unsigned int offset = start +
> > > > > +                             (folio_page_idx(folio, pages[i]) << PAGE_SHIFT);
> > > > > +                     unsigned int len = min_t(unsigned int, ret, folio_size(folio) - offset);
> > > > > +
> > > > > +                     len = min_t(unsigned int, len, PAGE_SIZE);
> > > > > +
> > > > > +                     ap->descs[ap->num_folios + i].offset = offset;
> > > > > +                     ap->descs[ap->num_folios + i].length = len;
> > > > > +                     ap->folios[i + ap->num_folios] = folio;
> > > > > +                     start = 0;
> > > > > +             }
> > > > >
> > > > >               ap->num_folios += nfolios;
> > > > >               ap->descs[ap->num_folios - 1].length -=
> > > >
> > > > The problem persists with this patch.
> > > >
> >
> > Malte, could you try Josef's patch except with that last line
> > "ap->descs[ap->num_pages - 1].length  -= (PAGE_SIZE - ret) &
> > (PAGE_SIZE - 1);" also removed? I think we need that line removed as
> > well since that does a "-=" instead of a "=" and
> > ap->descs[ap->num_folios - 1].length gets set inside the for loop.
> >
> > In the meantime, I'll try to get a local repro running on fsx so that
> > you don't have to keep testing out repos for us.
> 
> I was able to repro this locally by doing:
> 
> -- start libfuse server --
> sudo ./libfuse/build/example/passthrough_hp --direct-io ~/src ~/fuse_mount
> 
> -- patch + compile this (rough / ugly-for-now) code snippet --
> diff --git a/ltp/fsx.c b/ltp/fsx.c
> index 777ba0de..9f040bc4 100644
> --- a/ltp/fsx.c
> +++ b/ltp/fsx.c
> @@ -1049,7 +1049,8 @@ dowrite(unsigned offset, unsigned size)
>         }
>  }
> 
> -
> +#define TWO_MIB (1 << 21)  // 2 MiB in bytes
> 
>  void
>  domapwrite(unsigned offset, unsigned size)
>  {
> @@ -1057,6 +1058,8 @@ domapwrite(unsigned offset, unsigned size)
>         unsigned map_size;
>         off_t    cur_filesize;
>         char    *p;
> +       int ret;
> +       unsigned size_2mib_aligned;
> 
>         offset -= offset % writebdy;
>         if (size == 0) {
> @@ -1101,6 +1104,41 @@ domapwrite(unsigned offset, unsigned size)
>         pg_offset = offset & PAGE_MASK;
>         map_size  = pg_offset + size;
> 
> +       size_2mib_aligned = (size + TWO_MIB - 1) & ~(TWO_MIB - 1);
> +       void *placeholder_map = mmap(NULL, size_2mib_aligned * 2, PROT_NONE,
> +                            MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +       if (!placeholder_map) {
> +               prterr("domapwrite: placeholder map");
> +               exit(202);
> +       }
> +
> +       /* align address to nearest 2 MiB */
> +       void *aligned_address =
> +               (void *)(((uintptr_t)placeholder_map + TWO_MIB - 1) &
> ~(TWO_MIB - 1));
> +
> +       void *map = mmap(aligned_address, size_2mib_aligned, PROT_READ
> | PROT_WRITE,
> +                         MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED |
> MAP_POPULATE, -1, 0);
> +
> +       ret = madvise(map, size_2mib_aligned, MADV_COLLAPSE);
> +       if (ret) {
> +               prterr("domapwrite: madvise collapse");
> +               exit(203);
> +       }
> +
> +       memcpy(map, good_buf + offset, size);
> +
> +       if (lseek(fd, offset, SEEK_SET) == -1) {
> +               prterr("domapwrite: lseek");
> +               exit(204);
> +       }
> +
> +       ret = write(fd, map, size);
> +       if (ret == -1) {
> +               prterr("domapwrite: write");
> +               exit(205);
> +       }
> +
> +       /*
>         if ((p = (char *)mmap(0, map_size, PROT_READ | PROT_WRITE,
>                               MAP_FILE | MAP_SHARED, fd,
>                               (off_t)(offset - pg_offset))) == (char *)-1) {
> @@ -1119,6 +1157,15 @@ domapwrite(unsigned offset, unsigned size)
>                 prterr("domapwrite: munmap");
>                 report_failure(204);
>         }
> +       */
> +       if (munmap(map, size_2mib_aligned) != 0) {
> +               prterr("domapwrite: munmap map");
> +               report_failure(206);
> +       }
> +       if (munmap(placeholder_map, size_2mib_aligned * 2) != 0) {
> +               prterr("domapwrite: munmap placeholder_map");
> +               report_failure(207);
> +       }
>  }
> 
> -- run fsx test --
> sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> 
> On the offending commit 3b97c3652, I'm seeing:
> [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> Will begin at operation 3
> Seed set to 1
> ...
> READ BAD DATA: offset = 0x1925f, size = 0xf7a3, fname =
> /home/user/fuse_mount/example.txt
> OFFSET      GOOD    BAD     RANGE
> 0x1e43f     0x4b4a  0x114a  0x0
> operation# (mod 256) for the bad data may be 74
> 0x1e441     0xa64a  0xeb4a  0x1
> operation# (mod 256) for the bad data may be 74
> 0x1e443     0x264a  0xe44a  0x2
> operation# (mod 256) for the bad data may be 74
> 0x1e445     0x254a  0x9e4a  0x3
> ...
> Correct content saved for comparison
> (maybe hexdump "/home/user/fuse_mount/example.txt" vs
> "/home/user/fuse_mount/example.txt.fsxgood")
> 
> 
> I tested Josef's patch with the "ap->descs[ap->num_pages - 1].length
> -= (PAGE_SIZE - ret) & (PAGE_SIZE - 1);" line removed and it fixed the
> issue:
> 
> [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> Will begin at operation 3
> Seed set to 1
> ...
> copying to largest ever: 0x3e19b
> copying to largest ever: 0x3e343
> fallocating to largest ever: 0x40000
> All 5000 operations completed A-OK!
> 
> 
> Malte, would you mind double-checking whether this fixes the issue
> you're seeing on your end?

I get the impression you might be flailing a bit, it seems you're not
exactly sure what's going on, and either Willy or Josef previously
alluded to a lack of assertions - so I'm going to echo that.

I've noticed a lot of (more junior?) kernel engineers are hesitant to
use assertions (because of e.g. checkpatch and "don't crash the
kernel!"), but assertions are one of the best tools we've got until we
get to languages where we can do embedded correctness proofs.

_Especially_ when you're doing tricky data structure work, like all the
folios stuff. As a lot of us have learned from painful experience, bugs
in low level data structures can easily translate into silent data
corruption bugs, so this is a situation where I'd even likely ignore the
"don't crash the kernel" guidance and prefer BUG_ON() to WARN_ON() -
crashing is better than silent data corruption, and it makes any bugs
noisier so they shake out quicker.

Haven't looked at the relevant patches yet, but if you'd like me to look
and offer suggestions I'd be happy to.




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux