On Thu, Dec 5, 2024 at 4:24 PM Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> wrote: > (fixing typo in cc list: tujinjiang@xxxxxxxxx -> tujinjiang@xxxxxxxxxx) > > + Liam > > (JinJiang - you forgot to cc the correct maintainers, please ensure you run > scripts/get_maintainers.pl on files you change) > > On Thu, Dec 05, 2024 at 04:12:12PM +0100, Amir Goldstein wrote: > > On Thu, Dec 5, 2024 at 4:04 PM Lorenzo Stoakes > > <lorenzo.stoakes@xxxxxxxxxx> wrote: > > > > > > + Matthew for large folio aspect > > > > > > On Thu, Dec 05, 2024 at 10:30:38PM +0800, Jinjiang Tu wrote: > > > > During our tests in containers, there is a read-only file (i.e., shared > > > > libraies) in the overlayfs filesystem, and the underlying filesystem is > > > > ext4, which supports large folio. We mmap the file with PROT_READ prot, > > > > and then call madvise(MADV_COLLAPSE) for it. However, the madvise call > > > > fails and returns EINVAL. > > > > > > > > The reason is that the mapping address isn't aligned to PMD size. Since > > > > overlayfs doesn't support large folio, __get_unmapped_area() doesn't call > > > > thp_get_unmapped_area() to get a THP aligned address. > > > > > > > > To fix it, call get_unmapped_area() with the realfile. > > > > > > Isn't the correct solution to get overlayfs to support large folios? > > > > > > > > > > > Besides, since overlayfs may be built with CONFIG_OVERLAY_FS=m, we should > > > > export get_unmapped_area(). > > > > > > Yeah, not in favour of this at all. This is an internal implementation > > > detail. It seems like you're trying to hack your way into avoiding > > > providing support for large folios and to hand it off to the underlying > > > file system. > > > > > > Again, why don't you just support large folios in overlayfs? > > > > > > > This whole discussion seems moot. > > overlayfs does not have address_space operations > > It does not have its own page cache. > > And here we see my total lack of knowledge of overlayfs coming into play > here :) Thanks for pointing this out. > > In that case, I object even further to the original of course... > > > > > The file in vma->vm_file is not an overlayfs file at all - it is the > > real (e.g. ext4) file > > when returning from ovl_mmap() => backing_file_mmap() > > so I have very little clue why the proposed solution even works, > > but it certainly does not look correct. > > I think then Jinjiang in this cause you ought to go back to the drawing > board and reconsider what might be the underlying issue here. To summarize: overlayfs switches out the VMA's backing file in the ->mmap handler. ->get_unmapped_area has to be called on the original file, before the VMA is set up (obviously), but the VMA's ->vm_file can only be overridden once the overlayfs ->mmap handler is called. So the ->get_unmapped_area you see early in the mmap path is provided by overlayfs, while the VMA you have in the end is actually basically just a VMA of the backing file that doesn't have much to do with the original file. So I guess some possible solutions would be that overlayfs forwards the .get_unmapped_area to the backing file manually, or that the ->vm_file swapping mechanism is changed to use some new separate file_operations handler for "I want to use another backing file" that is called before the get_unmapped_area stuff? (But to be clear, I'm not saying whether these are good ideas or not. Maybe Lorenzo has more of an opinion on that than I do.) By the way, I think FUSE is kinda similar, FUSE also has a "passthrough" mode that uses backing_file_mmap(); FUSE also doesn't have any special code in their .get_unmapped_area handler for this. But FUSE's .get_unmapped_area is set to thp_get_unmapped_area, which I guess the passthrough mode it is sorta wrong the other way around and unnecessarily over-aligns even if the backing file can't do THP?