On 10/12/2024 06:14, Joanne Koong wrote: > On Mon, Dec 9, 2024 at 11:52 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote: >> On Mon, Dec 9, 2024 at 10:47 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote: >>> On Mon, Dec 9, 2024 at 9:07 AM Malte Schröder <malte.schroeder@xxxxxxxx> wrote: >>>> On 09/12/2024 16:48, Josef Bacik wrote: >>>>> On Mon, Dec 09, 2024 at 03:28:14PM +0000, Matthew Wilcox wrote: >>>>>> On Mon, Dec 09, 2024 at 09:49:48AM -0500, Josef Bacik wrote: >>>>>>>> Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up >>>>>>>> with 3b97c3652d91 as the culprit. >>>>>>> Willy, I've looked at this code and it does indeed look like a 1:1 conversion, >>>>>>> EXCEPT I'm fuzzy about how how this works with large folios. Previously, if we >>>>>>> got a hugepage in, we'd get each individual struct page back for the whole range >>>>>>> of the hugepage, so if for example we had a 2M hugepage, we'd fill in the >>>>>>> ->offset for each "middle" struct page as 0, since obviously we're consuming >>>>>>> PAGE_SIZE chunks at a time. >>>>>>> >>>>>>> But now we're doing this >>>>>>> >>>>>>> for (i = 0; i < nfolios; i++) >>>>>>> ap->folios[i + ap->num_folios] = page_folio(pages[i]); >>>>>>> >>>>>>> So if userspace handed us a 2M hugepage, page_folio() on each of the >>>>>>> intermediary struct page's would return the same folio, correct? So we'd end up >>>>>>> with the wrong offsets for our fuse request, because they should be based from >>>>>>> the start of the folio, correct? >>>>>> I think you're 100% right. We could put in some nice asserts to check >>>>>> this is what's happening, but it does seem like a rather incautious >>>>>> conversion. Yes, all folios _in the page cache_ for fuse are small, but >>>>>> that's not guaranteed to be the case for folios found in userspace for >>>>>> directio. At least the comment is wrong, and I'd suggest the code is too. >>>>> Ok cool, Malte can you try the attached only compile tested patch and see if the >>>>> problem goes away? Thanks, >>>>> >>>>> Josef >>>>> >>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c >>>>> index 88d0946b5bc9..c4b93ead99a5 100644 >>>>> --- a/fs/fuse/file.c >>>>> +++ b/fs/fuse/file.c >>>>> @@ -1562,9 +1562,19 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii, >>>>> nfolios = DIV_ROUND_UP(ret, PAGE_SIZE); >>>>> >>>>> ap->descs[ap->num_folios].offset = start; >>>>> - fuse_folio_descs_length_init(ap->descs, ap->num_folios, nfolios); >>>>> - for (i = 0; i < nfolios; i++) >>>>> - ap->folios[i + ap->num_folios] = page_folio(pages[i]); >>>>> + for (i = 0; i < nfolios; i++) { >>>>> + struct folio *folio = page_folio(pages[i]); >>>>> + unsigned int offset = start + >>>>> + (folio_page_idx(folio, pages[i]) << PAGE_SHIFT); >>>>> + unsigned int len = min_t(unsigned int, ret, folio_size(folio) - offset); >>>>> + >>>>> + len = min_t(unsigned int, len, PAGE_SIZE); >>>>> + >>>>> + ap->descs[ap->num_folios + i].offset = offset; >>>>> + ap->descs[ap->num_folios + i].length = len; >>>>> + ap->folios[i + ap->num_folios] = folio; >>>>> + start = 0; >>>>> + } >>>>> >>>>> ap->num_folios += nfolios; >>>>> ap->descs[ap->num_folios - 1].length -= >>>> The problem persists with this patch. >>>> >> Malte, could you try Josef's patch except with that last line >> "ap->descs[ap->num_pages - 1].length -= (PAGE_SIZE - ret) & >> (PAGE_SIZE - 1);" also removed? I think we need that line removed as >> well since that does a "-=" instead of a "=" and >> ap->descs[ap->num_folios - 1].length gets set inside the for loop. >> >> In the meantime, I'll try to get a local repro running on fsx so that >> you don't have to keep testing out repos for us. > I was able to repro this locally by doing: > > -- start libfuse server -- > sudo ./libfuse/build/example/passthrough_hp --direct-io ~/src ~/fuse_mount > > -- patch + compile this (rough / ugly-for-now) code snippet -- > diff --git a/ltp/fsx.c b/ltp/fsx.c > index 777ba0de..9f040bc4 100644 > --- a/ltp/fsx.c > +++ b/ltp/fsx.c > @@ -1049,7 +1049,8 @@ dowrite(unsigned offset, unsigned size) > } > } > > - > +#define TWO_MIB (1 << 21) // 2 MiB in bytes > > void > domapwrite(unsigned offset, unsigned size) > { > @@ -1057,6 +1058,8 @@ domapwrite(unsigned offset, unsigned size) > unsigned map_size; > off_t cur_filesize; > char *p; > + int ret; > + unsigned size_2mib_aligned; > > offset -= offset % writebdy; > if (size == 0) { > @@ -1101,6 +1104,41 @@ domapwrite(unsigned offset, unsigned size) > pg_offset = offset & PAGE_MASK; > map_size = pg_offset + size; > > + size_2mib_aligned = (size + TWO_MIB - 1) & ~(TWO_MIB - 1); > + void *placeholder_map = mmap(NULL, size_2mib_aligned * 2, PROT_NONE, > + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > + if (!placeholder_map) { > + prterr("domapwrite: placeholder map"); > + exit(202); > + } > + > + /* align address to nearest 2 MiB */ > + void *aligned_address = > + (void *)(((uintptr_t)placeholder_map + TWO_MIB - 1) & > ~(TWO_MIB - 1)); > + > + void *map = mmap(aligned_address, size_2mib_aligned, PROT_READ > | PROT_WRITE, > + MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED | > MAP_POPULATE, -1, 0); > + > + ret = madvise(map, size_2mib_aligned, MADV_COLLAPSE); > + if (ret) { > + prterr("domapwrite: madvise collapse"); > + exit(203); > + } > + > + memcpy(map, good_buf + offset, size); > + > + if (lseek(fd, offset, SEEK_SET) == -1) { > + prterr("domapwrite: lseek"); > + exit(204); > + } > + > + ret = write(fd, map, size); > + if (ret == -1) { > + prterr("domapwrite: write"); > + exit(205); > + } > + > + /* > if ((p = (char *)mmap(0, map_size, PROT_READ | PROT_WRITE, > MAP_FILE | MAP_SHARED, fd, > (off_t)(offset - pg_offset))) == (char *)-1) { > @@ -1119,6 +1157,15 @@ domapwrite(unsigned offset, unsigned size) > prterr("domapwrite: munmap"); > report_failure(204); > } > + */ > + if (munmap(map, size_2mib_aligned) != 0) { > + prterr("domapwrite: munmap map"); > + report_failure(206); > + } > + if (munmap(placeholder_map, size_2mib_aligned * 2) != 0) { > + prterr("domapwrite: munmap placeholder_map"); > + report_failure(207); > + } > } > > -- run fsx test -- > sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000 > > On the offending commit 3b97c3652, I'm seeing: > [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000 > Will begin at operation 3 > Seed set to 1 > ... > READ BAD DATA: offset = 0x1925f, size = 0xf7a3, fname = > /home/user/fuse_mount/example.txt > OFFSET GOOD BAD RANGE > 0x1e43f 0x4b4a 0x114a 0x0 > operation# (mod 256) for the bad data may be 74 > 0x1e441 0xa64a 0xeb4a 0x1 > operation# (mod 256) for the bad data may be 74 > 0x1e443 0x264a 0xe44a 0x2 > operation# (mod 256) for the bad data may be 74 > 0x1e445 0x254a 0x9e4a 0x3 > ... > Correct content saved for comparison > (maybe hexdump "/home/user/fuse_mount/example.txt" vs > "/home/user/fuse_mount/example.txt.fsxgood") > > > I tested Josef's patch with the "ap->descs[ap->num_pages - 1].length > -= (PAGE_SIZE - ret) & (PAGE_SIZE - 1);" line removed and it fixed the > issue: > > [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000 > Will begin at operation 3 > Seed set to 1 > ... > copying to largest ever: 0x3e19b > copying to largest ever: 0x3e343 > fallocating to largest ever: 0x40000 > All 5000 operations completed A-OK! > > > Malte, would you mind double-checking whether this fixes the issue > you're seeing on your end? My test still fails. > > > Thanks, > Joanne > >> Thanks, >> Joanne >>> Catching up on this thread now. I'll investigate this today. >>> >>>> /Malte >>>>