Re: [PATCH -next] ovl: respect underlying filesystem's get_unmapped_area()

Jann Horn <jannh@xxxxxxxxxx> · Tue, 10 Dec 2024 18:36:29 +0100

On Thu, Dec 5, 2024 at 4:24 PM Lorenzo Stoakes
<lorenzo.stoakes@xxxxxxxxxx> wrote:
> (fixing typo in cc list: tujinjiang@xxxxxxxxx -> tujinjiang@xxxxxxxxxx)
>
> + Liam
>
> (JinJiang - you forgot to cc the correct maintainers, please ensure you run
> scripts/get_maintainers.pl on files you change)
>
> On Thu, Dec 05, 2024 at 04:12:12PM +0100, Amir Goldstein wrote:
> > On Thu, Dec 5, 2024 at 4:04 PM Lorenzo Stoakes
> > <lorenzo.stoakes@xxxxxxxxxx> wrote:
> > >
> > > + Matthew for large folio aspect
> > >
> > > On Thu, Dec 05, 2024 at 10:30:38PM +0800, Jinjiang Tu wrote:
> > > > During our tests in containers, there is a read-only file (i.e., shared
> > > > libraies) in the overlayfs filesystem, and the underlying filesystem is
> > > > ext4, which supports large folio. We mmap the file with PROT_READ prot,
> > > > and then call madvise(MADV_COLLAPSE) for it. However, the madvise call
> > > > fails and returns EINVAL.
> > > >
> > > > The reason is that the mapping address isn't aligned to PMD size. Since
> > > > overlayfs doesn't support large folio, __get_unmapped_area() doesn't call
> > > > thp_get_unmapped_area() to get a THP aligned address.
> > > >
> > > > To fix it, call get_unmapped_area() with the realfile.
> > >
> > > Isn't the correct solution to get overlayfs to support large folios?
> > >
> > > >
> > > > Besides, since overlayfs may be built with CONFIG_OVERLAY_FS=m, we should
> > > > export get_unmapped_area().
> > >
> > > Yeah, not in favour of this at all. This is an internal implementation
> > > detail. It seems like you're trying to hack your way into avoiding
> > > providing support for large folios and to hand it off to the underlying
> > > file system.
> > >
> > > Again, why don't you just support large folios in overlayfs?
> > >
> >
> > This whole discussion seems moot.
> > overlayfs does not have address_space operations
> > It does not have its own page cache.
>
> And here we see my total lack of knowledge of overlayfs coming into play
> here :) Thanks for pointing this out.
>
> In that case, I object even further to the original of course...
>
> >
> > The file in  vma->vm_file is not an overlayfs file at all - it is the
> > real (e.g. ext4) file
> > when returning from ovl_mmap() => backing_file_mmap()
> > so I have very little clue why the proposed solution even works,
> > but it certainly does not look correct.
>
> I think then Jinjiang in this cause you ought to go back to the drawing
> board and reconsider what might be the underlying issue here.

To summarize: overlayfs switches out the VMA's backing file in the
->mmap handler. ->get_unmapped_area has to be called on the original
file, before the VMA is set up (obviously), but the VMA's ->vm_file
can only be overridden once the overlayfs ->mmap handler is called. So
the ->get_unmapped_area you see early in the mmap path is provided by
overlayfs, while the VMA you have in the end is actually basically
just a VMA of the backing file that doesn't have much to do with the
original file.

So I guess some possible solutions would be that overlayfs forwards
the .get_unmapped_area to the backing file manually, or that the
->vm_file swapping mechanism is changed to use some new separate
file_operations handler for "I want to use another backing file" that
is called before the get_unmapped_area stuff? (But to be clear, I'm
not saying whether these are good ideas or not. Maybe Lorenzo has more
of an opinion on that than I do.)

By the way, I think FUSE is kinda similar, FUSE also has a
"passthrough" mode that uses backing_file_mmap(); FUSE also doesn't
have any special code in their .get_unmapped_area handler for this.
But FUSE's .get_unmapped_area is set to thp_get_unmapped_area, which I
guess the passthrough mode it is sorta wrong the other way around and
unnecessarily over-aligns even if the backing file can't do THP?