On Tue, Dec 10, 2024 at 9:53 AM Malte Schröder <malte.schroeder@xxxxxxxx> wrote: > > On 10/12/2024 06:14, Joanne Koong wrote: > > On Mon, Dec 9, 2024 at 11:52 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote: > >> On Mon, Dec 9, 2024 at 10:47 AM Joanne Koong <joannelkoong@xxxxxxxxx> wrote: > >>> On Mon, Dec 9, 2024 at 9:07 AM Malte Schröder <malte.schroeder@xxxxxxxx> wrote: > >>>> On 09/12/2024 16:48, Josef Bacik wrote: > >>>>> On Mon, Dec 09, 2024 at 03:28:14PM +0000, Matthew Wilcox wrote: > >>>>>> On Mon, Dec 09, 2024 at 09:49:48AM -0500, Josef Bacik wrote: > >>>>>>>> Ha! This time I bisected from f03b296e8b51 to d1dfb5f52ffc. I ended up > >>>>>>>> with 3b97c3652d91 as the culprit. > >>>>>>> Willy, I've looked at this code and it does indeed look like a 1:1 conversion, > >>>>>>> EXCEPT I'm fuzzy about how how this works with large folios. Previously, if we > >>>>>>> got a hugepage in, we'd get each individual struct page back for the whole range > >>>>>>> of the hugepage, so if for example we had a 2M hugepage, we'd fill in the > >>>>>>> ->offset for each "middle" struct page as 0, since obviously we're consuming > >>>>>>> PAGE_SIZE chunks at a time. > >>>>>>> > >>>>>>> But now we're doing this > >>>>>>> > >>>>>>> for (i = 0; i < nfolios; i++) > >>>>>>> ap->folios[i + ap->num_folios] = page_folio(pages[i]); > >>>>>>> > >>>>>>> So if userspace handed us a 2M hugepage, page_folio() on each of the > >>>>>>> intermediary struct page's would return the same folio, correct? So we'd end up > >>>>>>> with the wrong offsets for our fuse request, because they should be based from > >>>>>>> the start of the folio, correct? > >>>>>> I think you're 100% right. We could put in some nice asserts to check > >>>>>> this is what's happening, but it does seem like a rather incautious > >>>>>> conversion. Yes, all folios _in the page cache_ for fuse are small, but > >>>>>> that's not guaranteed to be the case for folios found in userspace for > >>>>>> directio. At least the comment is wrong, and I'd suggest the code is too. > >>>>> Ok cool, Malte can you try the attached only compile tested patch and see if the > >>>>> problem goes away? Thanks, > >>>>> > >>>>> Josef > >>>>> > >>>>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c > >>>>> index 88d0946b5bc9..c4b93ead99a5 100644 > >>>>> --- a/fs/fuse/file.c > >>>>> +++ b/fs/fuse/file.c > >>>>> @@ -1562,9 +1562,19 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii, > >>>>> nfolios = DIV_ROUND_UP(ret, PAGE_SIZE); > >>>>> > >>>>> ap->descs[ap->num_folios].offset = start; > >>>>> - fuse_folio_descs_length_init(ap->descs, ap->num_folios, nfolios); > >>>>> - for (i = 0; i < nfolios; i++) > >>>>> - ap->folios[i + ap->num_folios] = page_folio(pages[i]); > >>>>> + for (i = 0; i < nfolios; i++) { > >>>>> + struct folio *folio = page_folio(pages[i]); > >>>>> + unsigned int offset = start + > >>>>> + (folio_page_idx(folio, pages[i]) << PAGE_SHIFT); > >>>>> + unsigned int len = min_t(unsigned int, ret, folio_size(folio) - offset); > >>>>> + > >>>>> + len = min_t(unsigned int, len, PAGE_SIZE); > >>>>> + > >>>>> + ap->descs[ap->num_folios + i].offset = offset; > >>>>> + ap->descs[ap->num_folios + i].length = len; > >>>>> + ap->folios[i + ap->num_folios] = folio; > >>>>> + start = 0; > >>>>> + } > >>>>> > >>>>> ap->num_folios += nfolios; > >>>>> ap->descs[ap->num_folios - 1].length -= > >>>> The problem persists with this patch. > >>>> > >> Malte, could you try Josef's patch except with that last line > >> "ap->descs[ap->num_pages - 1].length -= (PAGE_SIZE - ret) & > >> (PAGE_SIZE - 1);" also removed? I think we need that line removed as > >> well since that does a "-=" instead of a "=" and > >> ap->descs[ap->num_folios - 1].length gets set inside the for loop. > >> > >> In the meantime, I'll try to get a local repro running on fsx so that > >> you don't have to keep testing out repos for us. > > I was able to repro this locally by doing: > > > > -- start libfuse server -- > > sudo ./libfuse/build/example/passthrough_hp --direct-io ~/src ~/fuse_mount > > > > -- patch + compile this (rough / ugly-for-now) code snippet -- > > diff --git a/ltp/fsx.c b/ltp/fsx.c > > index 777ba0de..9f040bc4 100644 > > --- a/ltp/fsx.c > > +++ b/ltp/fsx.c > > @@ -1049,7 +1049,8 @@ dowrite(unsigned offset, unsigned size) > > } > > } > > > > - > > +#define TWO_MIB (1 << 21) // 2 MiB in bytes > > > > void > > domapwrite(unsigned offset, unsigned size) > > { > > @@ -1057,6 +1058,8 @@ domapwrite(unsigned offset, unsigned size) > > unsigned map_size; > > off_t cur_filesize; > > char *p; > > + int ret; > > + unsigned size_2mib_aligned; > > > > offset -= offset % writebdy; > > if (size == 0) { > > @@ -1101,6 +1104,41 @@ domapwrite(unsigned offset, unsigned size) > > pg_offset = offset & PAGE_MASK; > > map_size = pg_offset + size; > > > > + size_2mib_aligned = (size + TWO_MIB - 1) & ~(TWO_MIB - 1); > > + void *placeholder_map = mmap(NULL, size_2mib_aligned * 2, PROT_NONE, > > + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > + if (!placeholder_map) { > > + prterr("domapwrite: placeholder map"); > > + exit(202); > > + } > > + > > + /* align address to nearest 2 MiB */ > > + void *aligned_address = > > + (void *)(((uintptr_t)placeholder_map + TWO_MIB - 1) & > > ~(TWO_MIB - 1)); > > + > > + void *map = mmap(aligned_address, size_2mib_aligned, PROT_READ > > | PROT_WRITE, > > + MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED | > > MAP_POPULATE, -1, 0); > > + > > + ret = madvise(map, size_2mib_aligned, MADV_COLLAPSE); > > + if (ret) { > > + prterr("domapwrite: madvise collapse"); > > + exit(203); > > + } > > + > > + memcpy(map, good_buf + offset, size); > > + > > + if (lseek(fd, offset, SEEK_SET) == -1) { > > + prterr("domapwrite: lseek"); > > + exit(204); > > + } > > + > > + ret = write(fd, map, size); > > + if (ret == -1) { > > + prterr("domapwrite: write"); > > + exit(205); > > + } > > + > > + /* > > if ((p = (char *)mmap(0, map_size, PROT_READ | PROT_WRITE, > > MAP_FILE | MAP_SHARED, fd, > > (off_t)(offset - pg_offset))) == (char *)-1) { > > @@ -1119,6 +1157,15 @@ domapwrite(unsigned offset, unsigned size) > > prterr("domapwrite: munmap"); > > report_failure(204); > > } > > + */ > > + if (munmap(map, size_2mib_aligned) != 0) { > > + prterr("domapwrite: munmap map"); > > + report_failure(206); > > + } > > + if (munmap(placeholder_map, size_2mib_aligned * 2) != 0) { > > + prterr("domapwrite: munmap placeholder_map"); > > + report_failure(207); > > + } > > } > > > > -- run fsx test -- > > sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000 > > > > On the offending commit 3b97c3652, I'm seeing: > > [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000 > > Will begin at operation 3 > > Seed set to 1 > > ... > > READ BAD DATA: offset = 0x1925f, size = 0xf7a3, fname = > > /home/user/fuse_mount/example.txt > > OFFSET GOOD BAD RANGE > > 0x1e43f 0x4b4a 0x114a 0x0 > > operation# (mod 256) for the bad data may be 74 > > 0x1e441 0xa64a 0xeb4a 0x1 > > operation# (mod 256) for the bad data may be 74 > > 0x1e443 0x264a 0xe44a 0x2 > > operation# (mod 256) for the bad data may be 74 > > 0x1e445 0x254a 0x9e4a 0x3 > > ... > > Correct content saved for comparison > > (maybe hexdump "/home/user/fuse_mount/example.txt" vs > > "/home/user/fuse_mount/example.txt.fsxgood") > > > > > > I tested Josef's patch with the "ap->descs[ap->num_pages - 1].length > > -= (PAGE_SIZE - ret) & (PAGE_SIZE - 1);" line removed and it fixed the > > issue: > > > > [user]$ sudo ./fsx -b 3 ~/fuse_mount/example.txt -N 5000 > > Will begin at operation 3 > > Seed set to 1 > > ... > > copying to largest ever: 0x3e19b > > copying to largest ever: 0x3e343 > > fallocating to largest ever: 0x40000 > > All 5000 operations completed A-OK! > > > > > > Malte, would you mind double-checking whether this fixes the issue > > you're seeing on your end? > > My test still fails. Hi Malte, I simulated your repro with installing bcachefs as the root filesystem on a VM running Arch, then running "sudo pacman -S flatpak" and then installing FreeCAD with "flatpak install flathub org.freecad.FreeCAD". On base commit 3b97c3652, I see the same corruption you noted: error: Failed to install org.kde.Platform: Error pulling from repo: While pulling runtime/org.kde.Platform/x86_64/6.7 from remote flathub: fsck content object 886fd60617b81e81475db5e62beda5846d3e85fe77562eae536d2dd2a7af5b33: Corrupted file object; checksum expected='886fd60617b81e81475db5e62beda5846d3e85fe77562eae536d2dd2a7af5b33' actual='67f5a60d19f7a65e1ee272d455fed138b864be73399816ad18fa71319614a418' i tried this patch on top of commit 3b97c3652 and it fixed it for me: diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 14af8c41fc83..0d213a22972b 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1560,18 +1560,24 @@ static int fuse_get_user_pages(struct fuse_args_pages *ap, struct iov_iter *ii, nbytes += ret; - ret += start; - /* Currently, all folios in FUSE are one page */ - nfolios = DIV_ROUND_UP(ret, PAGE_SIZE); - - ap->folio_descs[ap->num_folios].offset = start; - fuse_folio_descs_length_init(ap->folio_descs, ap->num_folios, nfolios); - for (i = 0; i < nfolios; i++) - ap->folios[i + ap->num_folios] = page_folio(pages[i]); - - ap->num_folios += nfolios; - ap->folio_descs[ap->num_folios - 1].length -= - (PAGE_SIZE - ret) & (PAGE_SIZE - 1); + nfolios = DIV_ROUND_UP(ret + start, PAGE_SIZE); + + for (i = 0; i < nfolios; i++) { + struct folio *folio = page_folio(pages[i]); + unsigned int offset = start + + (folio_page_idx(folio, pages[i]) << PAGE_SHIFT); + unsigned int len = min_t(unsigned int, ret, folio_size(folio) - offset); + + len = min_t(unsigned int, len, PAGE_SIZE - start); + + ap->descs[ap->num_folios].offset = offset; + ap->descs[ap->num_folios].length = len; + ap->folios[ap->num_folios] = folio; + start = 0; + ret -= len; + ap->num_folios++; + } + nr_pages += nfolios; } kfree(pages); I ran this multiple times and don't see the corruption issues. However, I do see another issue crop up on some of my VMs, which is flatpak hanging with this corresponding stack trace: [ 368.520976] INFO: task pool-/usr/lib/f:582 blocked for more than 122 seconds. [ 368.521509] Not tainted 6.12.0-rc1-g86b74eb5a11e #734 [ 368.521905] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 368.522483] task:pool-/usr/lib/f state:D stack:0 pid:582 tgid:405 ppid:1 flags:0x00000002 [ 368.523238] Call Trace: [ 368.523411] <TASK> [ 368.523548] __schedule+0xaf0/0x27f0 [ 368.523773] ? __pfx___schedule+0x10/0x10 [ 368.524024] schedule+0x7e/0x300 [ 368.524233] __wait_on_freeing_inode+0xda/0x120 [ 368.524527] ? __pfx___wait_on_freeing_inode+0x10/0x10 [ 368.524830] ? _raw_spin_unlock_irqrestore+0xe/0x30 [ 368.525124] ? __pfx_wake_bit_function+0x10/0x10 [ 368.525411] bch2_inode_hash_find+0x372/0x580 [ 368.525700] ? __pfx_bch2_dirent_read_target+0x10/0x10 [ 368.526023] ? bch2_dirent_hash+0x23d/0x370 [ 368.526286] ? __pfx_bch2_inode_hash_find+0x10/0x10 [ 368.526583] ? __pfx_dirent_cmp_key+0x10/0x10 [ 368.526972] bch2_lookup_trans+0x61a/0x990 [ 368.527247] ? __pfx_bch2_lookup_trans+0x10/0x10 [ 368.527559] ? __do_sys_newfstatat+0x75/0xd0 [ 368.527850] ? do_syscall_64+0x4b/0x110 [ 368.528088] ? entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 368.528411] ? __pfx_bch2_hash_info_init+0x10/0x10 [ 368.528720] ? __mod_memcg_state+0x102/0x390 [ 368.529134] ? obj_cgroup_charge+0x1b4/0x4c0 [ 368.529500] ? __memcg_slab_post_alloc_hook+0x536/0xba0 [ 368.529913] ? kvm_sched_clock_read+0x11/0x20 [ 368.530265] ? _raw_spin_lock+0x85/0xe0 [ 368.530553] ? bch2_lookup_trans+0x323/0x990 [ 368.530909] ? __asan_memset+0x24/0x50 [ 368.531204] ? __bch2_trans_get+0x735/0xdd0 [ 368.531525] ? bch2_lookup+0x18a/0x350 [ 368.531859] bch2_lookup+0x18a/0x350 [ 368.532145] ? __pfx_bch2_lookup+0x10/0x10 [ 368.532466] ? __pfx_lockref_get_not_dead+0x10/0x10 [ 368.532860] __lookup_slow+0x182/0x350 [ 368.533159] ? __pfx___lookup_slow+0x10/0x10 [ 368.533513] walk_component+0x2ab/0x4f0 [ 368.533816] path_lookupat+0x120/0x660 [ 368.534110] ? vfs_fstatat+0x53/0xb0 [ 368.534413] filename_lookup+0x1aa/0x520 [ 368.534718] ? __pfx_filename_lookup+0x10/0x10 [ 368.535052] ? __wake_up_common+0xf2/0x170 [ 368.535385] ? _raw_spin_unlock_irq+0xe/0x30 [ 368.535683] ? __pfx_eventfd_write+0x10/0x10 [ 368.536019] ? kasan_save_stack+0x24/0x50 [ 368.536313] ? __kasan_record_aux_stack+0xad/0xc0 [ 368.536651] ? kmem_cache_free+0x353/0x550 [ 368.536919] ? do_syscall_64+0x4b/0x110 [ 368.537179] ? entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 368.537528] vfs_statx+0xbf/0x140 [ 368.537745] ? kmem_cache_alloc_noprof+0x12d/0x340 [ 368.538057] ? __pfx_vfs_statx+0x10/0x10 [ 368.538322] ? getname_flags.part.0+0xaf/0x4a0 [ 368.538627] vfs_fstatat+0x6c/0xb0 [ 368.538840] __do_sys_newfstatat+0x75/0xd0 [ 368.539100] ? __pfx___do_sys_newfstatat+0x10/0x10 [ 368.539415] ? kasan_unpoison+0x27/0x60 [ 368.539675] ? __pfx_slab_free_after_rcu_debug+0x10/0x10 [ 368.540081] ? fdget_pos+0x366/0x5d0 [ 368.540355] ? fput+0x1b/0x2d0 [ 368.540602] ? ksys_write+0x18c/0x1c0 [ 368.540934] ? __pfx_ksys_write+0x10/0x10 [ 368.541252] ? fpregs_assert_state_consistent+0x20/0xa0 [ 368.541657] ? clear_bhb_loop+0x45/0xa0 [ 368.541948] do_syscall_64+0x4b/0x110 [ 368.542270] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 368.542711] RIP: 0033:0x7fef789bb94e [ 368.543001] RSP: 002b:00007fef713fe468 EFLAGS: 00000246 ORIG_RAX: 0000000000000106 [ 368.543605] RAX: ffffffffffffffda RBX: 00007fef713fe5d0 RCX: 00007fef789bb94e [ 368.544129] RDX: 00007fef713fe540 RSI: 00007fef713fe5d0 RDI: 000000000000001f [ 368.544636] RBP: 00007fef713ff710 R08: 0000000000000073 R09: 0000000000000000 [ 368.545130] R10: 0000000000000100 R11: 0000000000000246 R12: 000000000000001f [ 368.545640] R13: 00007fef713fe540 R14: 00007fef5802dc00 R15: 00007fef713ff780 [ 368.546177] </TASK> I'm seeing this happen on commit 86b74eb5 as well, which is the commit that's before any of my or Josef's folio changes, and this persists across reboots. I've seen this happen for a couple other tasks too, "task:flatpak" and "task:pool-flatpak" with similar stack traces eg [ 368.505270] task:pool-flatpak in state:D stack:0 pid:568 tgid:538 ppid:537 flags:0x00000002 [ 368.505872] Call Trace: [ 368.506042] <TASK> [ 368.506191] __schedule+0xaf0/0x27f0 [ 368.506430] ? __pfx___schedule+0x10/0x10 [ 368.506715] schedule+0x7e/0x300 [ 368.506940] __wait_on_freeing_inode+0xda/0x120 [ 368.507490] ? __pfx___wait_on_freeing_inode+0x10/0x10 [ 368.507965] ? _raw_spin_unlock_irqrestore+0xe/0x30 [ 368.508312] ? __pfx_wake_bit_function+0x10/0x10 [ 368.508650] bch2_inode_hash_find+0x372/0x580 [ 368.508969] ? __pfx_bch2_dirent_read_target+0x10/0x10 [ 368.509332] ? bch2_dirent_hash+0x23d/0x370 [ 368.509637] ? __pfx_bch2_inode_hash_find+0x10/0x10 [ 368.509991] ? __pfx_dirent_cmp_key+0x10/0x10 [ 368.510309] bch2_lookup_trans+0x61a/0x990 [ 368.510662] ? __pfx_bch2_lookup_trans+0x10/0x10 [ 368.511155] ? vfs_statx+0xbf/0x140 [ 368.511429] ? vfs_fstatat+0x6c/0xb0 [ 368.511731] ? do_syscall_64+0x4b/0x110 [ 368.512014] ? __d_alloc+0x5cc/0x8f0 [ 368.512283] ? memcg_list_lru_alloc+0x184/0x8a0 [ 368.512612] ? __pfx_bch2_hash_info_init+0x10/0x10 [ 368.512943] ? entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 368.513327] ? __memcg_slab_post_alloc_hook+0x536/0xba0 [ 368.513753] ? kvm_sched_clock_read+0x11/0x20 [ 368.514133] ? _raw_spin_lock+0x85/0xe0 [ 368.515095] ? bch2_lookup_trans+0x323/0x990 [ 368.516994] ? __asan_memset+0x24/0x50 [ 368.517294] ? __bch2_trans_get+0x735/0xdd0 [ 368.517675] ? bch2_lookup+0x18a/0x350 [ 368.517987] bch2_lookup+0x18a/0x350 [ 368.518271] ? __pfx_bch2_lookup+0x10/0x10 [ 368.518568] ? __pfx_lockref_get_not_dead+0x10/0x10 [ 368.518933] __lookup_slow+0x182/0x350 [ 368.519264] ? __pfx___lookup_slow+0x10/0x10 [ 368.519633] walk_component+0x2ab/0x4f0 [ 368.519934] ? fput+0x1b/0x2d0 [ 368.520206] link_path_walk.part.0.constprop.0+0x4ad/0xac0 [ 368.520657] path_lookupat+0x72/0x660 [ 368.521022] ? vfs_fstatat+0x53/0xb0 [ 368.521343] filename_lookup+0x1aa/0x520 [ 368.521664] ? __pfx_filename_lookup+0x10/0x10 [ 368.522021] ? __pfx_xa_load+0x10/0x10 [ 368.522319] ? ___slab_alloc+0x128/0x9d0 [ 368.522683] ? mntput_no_expire+0xcf/0x760 [ 368.523018] ? lockref_put_return+0xc7/0x140 [ 368.523357] ? kmem_cache_free+0x19e/0x550 [ 368.523686] ? __pfx_mutex_unlock+0x10/0x10 [ 368.524035] ? do_renameat2+0x1f4/0xa50 [ 368.524404] ? do_renameat2+0x1f4/0xa50 [ 368.524716] vfs_statx+0xbf/0x140 [ 368.524980] ? kmem_cache_alloc_noprof+0x12d/0x340 [ 368.525355] ? __pfx_vfs_statx+0x10/0x10 [ 368.525702] ? getname_flags.part.0+0xaf/0x4a0 [ 368.526080] vfs_fstatat+0x6c/0xb0 [ 368.526391] __do_sys_newfstatat+0x75/0xd0 [ 368.526745] ? __pfx___do_sys_newfstatat+0x10/0x10 [ 368.527206] ? do_symlinkat+0x15d/0x260 [ 368.527509] ? do_mkdirat+0x19d/0x2c0 [ 368.527807] ? kasan_save_track+0x14/0x30 [ 368.528132] ? getname_flags.part.0+0xaf/0x4a0 [ 368.528509] ? fpregs_assert_state_consistent+0x20/0xa0 [ 368.529198] ? clear_bhb_loop+0x45/0xa0 [ 368.529536] do_syscall_64+0x4b/0x110 [ 368.529856] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 368.530279] RIP: 0033:0x7f592256194e [ 368.530610] RSP: 002b:00007f5915bfe548 EFLAGS: 00000217 ORIG_RAX: 0000000000000106 [ 368.531350] RAX: ffffffffffffffda RBX: 0000000000000013 RCX: 00007f592256194e [ 368.531943] RDX: 00007f5915bfe570 RSI: 00007f5915bfe600 RDI: 0000000000000013 [ 368.532491] RBP: 00007f5915bfe740 R08: 0000000000000073 R09: 0000000000000000 [ 368.533048] R10: 0000000000000100 R11: 0000000000000217 R12: 00007f5915bfe570 [ 368.533597] R13: 00007f5915bfe600 R14: 00007f5915bfe56c R15: 00007f5915bfe780 [ 368.534236] </TASK> [ 368.534419] INFO: task pool-flatpak in:571 blocked for more than 122 seconds. This seems like something we should investigate as well. I'm happy to help repro this if needed. I'm able to hit this pretty consistently. Malte, would you mind applying the patch above and confirming if this fixes it for you on your VM? And while you've been running flatpak, have you also been seeing some flatpak tasks hanging on some of your VMs? Thanks, Joanne > > > > > > > > Thanks, > > Joanne > > > >> Thanks, > >> Joanne > >>> Catching up on this thread now. I'll investigate this today. > >>> > >>>> /Malte > >>>>