Re: silent data corruption in fuse in rc1

Malte Schröder <malte.schroeder@xxxxxxxx> · Tue, 10 Dec 2024 18:53:06 +0100

On 10/12/2024 06:14, Joanne Koong wrote:
> On Mon, Dec 9, 2024 at 11:52 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
>> On Mon, Dec 9, 2024 at 10:47 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote:
>>> On Mon, Dec 9, 2024 at 9:07 AM Malte Schröder <malte.schroeder@xxxxxxxx> wrote:
>>>> On 09/12/2024 16:48, Josef Bacik wrote:
>>>>> On Mon, Dec 09, 2024 at 03:28:14PM +0000, Matthew Wilcox wrote:
>>>>>> On Mon, Dec 09, 2024 at 09:49:48AM -0500, Josef Bacik wrote:
>>>>>>>> Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up
>>>>>>>> with 3b97c3652d91 as the culprit.
>>>>>>> Willy, I've looked at this code and it does indeed look like a 1:1 conversion,
>>>>>>> EXCEPT I'm fuzzy about how how this works with large folios.  Previously, if we
>>>>>>> got a hugepage in, we'd get each individual struct page back for the whole range
>>>>>>> of the hugepage, so if for example we had a 2M hugepage, we'd fill in the
>>>>>>> ->offset for each "middle" struct page as 0, since obviously we're consuming
>>>>>>> PAGE_SIZE chunks at a time.
>>>>>>>
>>>>>>> But now we're doing this
>>>>>>>
>>>>>>>     for (i = 0; i < nfolios; i++)
>>>>>>>             ap->folios[i + ap->num_folios] = page_folio(pages[i]);
>>>>>>>
>>>>>>> So if userspace handed us a 2M hugepage, page_folio() on each of the
>>>>>>> intermediary struct page's would return the same folio, correct?  So we'd end up
>>>>>>> with the wrong offsets for our fuse request, because they should be based from
>>>>>>> the start of the folio, correct?
>>>>>> I think you're 100% right.  We could put in some nice asserts to check
>>>>>> this is what's happening, but it does seem like a rather incautious
>>>>>> conversion.  Yes, all folios _in the page cache_ for fuse are small, but
>>>>>> that's not guaranteed to be the case for folios found in userspace for
>>>>>> directio.  At least the comment is wrong, and I'd suggest the code is too.
>>>>> Ok cool, Malte can you try the attached only compile tested patch and see if the
>>>>> problem goes away?  Thanks,
>>>>>
>>>>> Josef
>>>>>
>>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>>>>> index 88d0946b5bc9..c4b93ead99a5 100644
>>>>> --- a/fs/fuse/file.c
>>>>> +++ b/fs/fuse/file.c
>>>>> @@ -1562,9 +1562,19 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii,
>>>>>               nfolios = DIV_ROUND_UP(ret, PAGE_SIZE);
>>>>>
>>>>>               ap->descs[ap->num_folios].offset = start;
>>>>> -             fuse_folio_descs_length_init(ap->descs, ap->num_folios, nfolios);
>>>>> -             for (i = 0; i < nfolios; i++)
>>>>> -                     ap->folios[i + ap->num_folios] = page_folio(pages[i]);
>>>>> +             for (i = 0; i < nfolios; i++) {
>>>>> +                     struct folio *folio = page_folio(pages[i]);
>>>>> +                     unsigned int offset = start +
>>>>> +                             (folio_page_idx(folio, pages[i]) << PAGE_SHIFT);
>>>>> +                     unsigned int len = min_t(unsigned int, ret, folio_size(folio) - offset);
>>>>> +
>>>>> +                     len = min_t(unsigned int, len, PAGE_SIZE);
>>>>> +
>>>>> +                     ap->descs[ap->num_folios + i].offset = offset;
>>>>> +                     ap->descs[ap->num_folios + i].length = len;
>>>>> +                     ap->folios[i + ap->num_folios] = folio;
>>>>> +                     start = 0;
>>>>> +             }
>>>>>
>>>>>               ap->num_folios += nfolios;
>>>>>               ap->descs[ap->num_folios - 1].length -=
>>>> The problem persists with this patch.
>>>>
>> Malte, could you try Josef's patch except with that last line
>> "ap->descs[ap->num_pages - 1].length  -= (PAGE_SIZE - ret) &
>> (PAGE_SIZE - 1);" also removed? I think we need that line removed as
>> well since that does a "-=" instead of a "=" and
>> ap->descs[ap->num_folios - 1].length gets set inside the for loop.
>>
>> In the meantime, I'll try to get a local repro running on fsx so that
>> you don't have to keep testing out repos for us.
> I was able to repro this locally by doing:
>
> -- start libfuse server --
> sudo ./libfuse/build/example/passthrough_hp --direct-io ~/src ~/fuse_mount
>
> -- patch + compile this (rough / ugly-for-now) code snippet --
> diff --git a/ltp/fsx.c b/ltp/fsx.c
> index 777ba0de..9f040bc4 100644
> --- a/ltp/fsx.c
> +++ b/ltp/fsx.c
> @@ -1049,7 +1049,8 @@ dowrite(unsigned offset, unsigned size)
>         }
>  }
>
> -
> +#define TWO_MIB (1 << 21)  // 2 MiB in bytes
>
>  void
>  domapwrite(unsigned offset, unsigned size)
>  {
> @@ -1057,6 +1058,8 @@ domapwrite(unsigned offset, unsigned size)
>         unsigned map_size;
>         off_t    cur_filesize;
>         char    *p;
> +       int ret;
> +       unsigned size_2mib_aligned;
>
>         offset -= offset % writebdy;
>         if (size == 0) {
> @@ -1101,6 +1104,41 @@ domapwrite(unsigned offset, unsigned size)
>         pg_offset = offset & PAGE_MASK;
>         map_size  = pg_offset + size;
>
> +       size_2mib_aligned = (size + TWO_MIB - 1) & ~(TWO_MIB - 1);
> +       void *placeholder_map = mmap(NULL, size_2mib_aligned * 2, PROT_NONE,
> +                            MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +       if (!placeholder_map) {
> +               prterr("domapwrite: placeholder map");
> +               exit(202);
> +       }
> +
> +       /* align address to nearest 2 MiB */
> +       void *aligned_address =
> +               (void *)(((uintptr_t)placeholder_map + TWO_MIB - 1) &
> ~(TWO_MIB - 1));
> +
> +       void *map = mmap(aligned_address, size_2mib_aligned, PROT_READ
> | PROT_WRITE,
> +                         MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED |
> MAP_POPULATE, -1, 0);
> +
> +       ret = madvise(map, size_2mib_aligned, MADV_COLLAPSE);
> +       if (ret) {
> +               prterr("domapwrite: madvise collapse");
> +               exit(203);
> +       }
> +
> +       memcpy(map, good_buf + offset, size);
> +
> +       if (lseek(fd, offset, SEEK_SET) == -1) {
> +               prterr("domapwrite: lseek");
> +               exit(204);
> +       }
> +
> +       ret = write(fd, map, size);
> +       if (ret == -1) {
> +               prterr("domapwrite: write");
> +               exit(205);
> +       }
> +
> +       /*
>         if ((p = (char *)mmap(0, map_size, PROT_READ | PROT_WRITE,
>                               MAP_FILE | MAP_SHARED, fd,
>                               (off_t)(offset - pg_offset))) == (char *)-1) {
> @@ -1119,6 +1157,15 @@ domapwrite(unsigned offset, unsigned size)
>                 prterr("domapwrite: munmap");
>                 report_failure(204);
>         }
> +       */
> +       if (munmap(map, size_2mib_aligned) != 0) {
> +               prterr("domapwrite: munmap map");
> +               report_failure(206);
> +       }
> +       if (munmap(placeholder_map, size_2mib_aligned * 2) != 0) {
> +               prterr("domapwrite: munmap placeholder_map");
> +               report_failure(207);
> +       }
>  }
>
> -- run fsx test --
> sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
>
> On the offending commit 3b97c3652, I'm seeing:
> [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> Will begin at operation 3
> Seed set to 1
> ...
> READ BAD DATA: offset = 0x1925f, size = 0xf7a3, fname =
> /home/user/fuse_mount/example.txt
> OFFSET      GOOD    BAD     RANGE
> 0x1e43f     0x4b4a  0x114a  0x0
> operation# (mod 256) for the bad data may be 74
> 0x1e441     0xa64a  0xeb4a  0x1
> operation# (mod 256) for the bad data may be 74
> 0x1e443     0x264a  0xe44a  0x2
> operation# (mod 256) for the bad data may be 74
> 0x1e445     0x254a  0x9e4a  0x3
> ...
> Correct content saved for comparison
> (maybe hexdump "/home/user/fuse_mount/example.txt" vs
> "/home/user/fuse_mount/example.txt.fsxgood")
>
>
> I tested Josef's patch with the "ap->descs[ap->num_pages - 1].length
> -= (PAGE_SIZE - ret) & (PAGE_SIZE - 1);" line removed and it fixed the
> issue:
>
> [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000
> Will begin at operation 3
> Seed set to 1
> ...
> copying to largest ever: 0x3e19b
> copying to largest ever: 0x3e343
> fallocating to largest ever: 0x40000
> All 5000 operations completed A-OK!
>
>
> Malte, would you mind double-checking whether this fixes the issue
> you're seeing on your end?

My test still fails.

>
>
> Thanks,
> Joanne
>
>> Thanks,
>> Joanne
>>> Catching up on this thread now. I'll investigate this today.
>>>
>>>> /Malte
>>>>