Re: [PATCH -next] ovl: respect underlying filesystem's get_unmapped_area()

Amir Goldstein <amir73il@xxxxxxxxx> · Fri, 6 Dec 2024 13:58:12 +0100

On Fri, Dec 6, 2024 at 11:45 AM Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> wrote:
>
>
>
> On 2024/12/6 17:25, Lorenzo Stoakes wrote:
> > To be clear - I'm not accepting the export of __get_unmapped_area() so if
> > you depend on this for this approach, you can't take this approach.
> >
> > It's an internal implementation detail. That you choose to make your
> > filesystem possibly a module doesn't mean that mm is required to export
> > internal impl details to you. Sorry.
> >
> > To rescind this would require a very strong argument, you have not provided
> > it.
> >
> > On Fri, Dec 06, 2024 at 11:35:08AM +0800, Jinjiang Tu wrote:
> >>
> >> 在 2024/12/5 23:04, Lorenzo Stoakes 写道:
> >>> + Matthew for large folio aspect
> >>>
> >>> On Thu, Dec 05, 2024 at 10:30:38PM +0800, Jinjiang Tu wrote:
> >>>> During our tests in containers, there is a read-only file (i.e., shared
> >>>> libraies) in the overlayfs filesystem, and the underlying filesystem is
> >>>> ext4, which supports large folio. We mmap the file with PROT_READ prot,
> >>>> and then call madvise(MADV_COLLAPSE) for it. However, the madvise call
> >>>> fails and returns EINVAL.
> >>>>
> >>>> The reason is that the mapping address isn't aligned to PMD size. Since
> >>>> overlayfs doesn't support large folio, __get_unmapped_area() doesn't call
> >>>> thp_get_unmapped_area() to get a THP aligned address.
> >>>>
> >>>> To fix it, call get_unmapped_area() with the realfile.
> >>> Isn't the correct solution to get overlayfs to support large folios?
> >>>
> >>>> Besides, since overlayfs may be built with CONFIG_OVERLAY_FS=m, we should
> >>>> export get_unmapped_area().
> >>> Yeah, not in favour of this at all. This is an internal implementation
> >>> detail. It seems like you're trying to hack your way into avoiding
> >>> providing support for large folios and to hand it off to the underlying
> >>> file system.
> >>>
> >>> Again, why don't you just support large folios in overlayfs?
> >>>
> >>> Literally no other file system or driver appears to make use of this
> >>> directly in this manner.
> >>>
> >>> And there's absolutely no way this should be exported non-GPL as if it were
> >>> unavoidable core functionality that everyone needs. Only you seem to...
> >>>
> >>>> Signed-off-by: Jinjiang Tu <tujinjiang@xxxxxxxxxx>
> >>>> ---
> >>>>    fs/overlayfs/file.c | 20 ++++++++++++++++++++
> >>>>    mm/mmap.c           |  1 +
> >>>>    2 files changed, 21 insertions(+)
> >>>>
> >>>> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> >>>> index 969b458100fe..d0dcf675ebe8 100644
> >>>> --- a/fs/overlayfs/file.c
> >>>> +++ b/fs/overlayfs/file.c
> >>>> @@ -653,6 +653,25 @@ static int ovl_flush(struct file *file, fl_owner_t id)
> >>>>            return err;
> >>>>    }
> >>>>
> >>>> +static unsigned long ovl_get_unmapped_area(struct file *file,
> >>>> +          unsigned long addr, unsigned long len, unsigned long pgoff,
> >>>> +          unsigned long flags)
> >>>> +{
> >>>> +  struct file *realfile;
> >>>> +  const struct cred *old_cred;
> >>>> +  unsigned long ret;
> >>>> +
> >>>> +  realfile = ovl_real_file(file);
> >>>> +  if (IS_ERR(realfile))
> >>>> +          return PTR_ERR(realfile);
> >>>> +
> >>>> +  old_cred = ovl_override_creds(file_inode(file)->i_sb);
> >>>> +  ret = get_unmapped_area(realfile, addr, len, pgoff, flags);
> >>>> +  ovl_revert_creds(old_cred);
> >>> Why are you overriding credentials, then reinstating them here? That
> >>> seems... iffy? I knew nothing about overlayfs so this may just be a
> >>> misunderstanding...
> >>
> >> I refer to other file operations in overlayfs (i.e., ovl_fallocate, backing_file_mmap).
> >> Since get_unmapped_area() has security related operations (e.g., security_mmap_addr()),
> >> We should call it with the cred of the underlying file.
> >>
> >>>
> >>>> +
> >>>> +  return ret;
> >>>> +}
> >>>> +
> >>>>    const struct file_operations ovl_file_operations = {
> >>>>            .open           = ovl_open,
> >>>>            .release        = ovl_release,
> >>>> @@ -661,6 +680,7 @@ const struct file_operations ovl_file_operations = {
> >>>>            .write_iter     = ovl_write_iter,
> >>>>            .fsync          = ovl_fsync,
> >>>>            .mmap           = ovl_mmap,
> >>>> +  .get_unmapped_area = ovl_get_unmapped_area,
> >>>>            .fallocate      = ovl_fallocate,
> >>>>            .fadvise        = ovl_fadvise,
> >>>>            .flush          = ovl_flush,
> >>>> diff --git a/mm/mmap.c b/mm/mmap.c
> >>>> index 16f8e8be01f8..60eb1ff7c9a8 100644
> >>>> --- a/mm/mmap.c
> >>>> +++ b/mm/mmap.c
> >>>> @@ -913,6 +913,7 @@ __get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
> >>>>            error = security_mmap_addr(addr);
> >>>>            return error ? error : addr;
> >>>>    }
> >>>> +EXPORT_SYMBOL(__get_unmapped_area);
> >>> We'll need a VERY good reason to export this internal implementation
> >>> detail, and if that were provided we'd need a VERY good reason for it not
> >>> to be GPL.
> >>>
> >>> This just seems to be a cheap way of invoking (),
> >>> maybe, if it is being used by the underlying file system.
> >>
> >> But the underlying file system may not support large folio. In this case,
> >> the mmap address doesn't need to be aligned with THP size.
> >
> > But it'd not cause any problems to just do that anyway right? I don't think
> > many people think 'oh no I have a PMD aligned mapping now what will I do'?
> >
> > Again - the right solution here is to handle large folios in overlayfs as
> > far as I can tell.
>
> I think this is not to handle large folio for overlayfs, it is about vma
> alignment or vma allocation for memory mapped files,
>
>
> 1) many fs support THP mapping, using thp_get_unmapped_area(),
>
> fs/bcachefs/fs.c:       .get_unmapped_area = thp_get_unmapped_area,
> fs/btrfs/file.c:        .get_unmapped_area = thp_get_unmapped_area,
> fs/erofs/data.c:        .get_unmapped_area = thp_get_unmapped_area,
> fs/ext2/file.c: .get_unmapped_area = thp_get_unmapped_area,
> fs/ext4/file.c: .get_unmapped_area = thp_get_unmapped_area,
> fs/fuse/file.c: .get_unmapped_area = thp_get_unmapped_area,
> fs/xfs/xfs_file.c:      .get_unmapped_area = thp_get_unmapped_area,
>
> 2) and some fs has own get_unmapped_area callback too,
>
> fs/cramfs/inode.c:      .get_unmapped_area      = cramfs_physmem_get_unmapped_area,
> fs/hugetlbfs/inode.c:   .get_unmapped_area      = hugetlb_get_unmapped_area,
> fs/ramfs/file-mmu.c:    .get_unmapped_area      = ramfs_mmu_get_unmapped_area,
> fs/ramfs/file-nommu.c:  .get_unmapped_area      = ramfs_nommu_get_unmapped_area,
> fs/romfs/mmap-nommu.c:  .get_unmapped_area      = romfs_get_unmapped_area,
> mm/shmem.c:     .get_unmapped_area = shmem_get_unmapped_area,
>
> They has own rules to get a vma area, but with overlayfs(tries to
> present a filesystem which is the result over overlaying one filesystem
> on top of the other), we now only use the default
> mm_get_unmapped_area_vmflags() to get a vma area, since the overlayfs
> has no '.get_unmapped_area' callback.
>
> do_mmap
>    __get_unmapped_area
>      // get_area = NULL
>      mm_get_unmapped_area_vmflags
>    mmap_region
>      mmap_file
>        ovl_mmap
>
> It looks wrong, so we need to get the readfile via ovl_real_file()
> and use realfile' get_unmapped_area callback, and if the realfile
> is not with the callback, fallback to the default
> mm_get_unmapped_area(),
>
> >
> > In any case as per the above, we're just not exporting
> > __get_unmapped_area(), sorry.
> >
>
> So maybe use mm_get_unmapped_area() instead of __get_unmapped_area(),
> something like below,
>
> +static unsigned long ovl_get_unmapped_area(struct file *file,
> +               unsigned long addr, unsigned long len, unsigned long pgoff,
> +               unsigned long flags)
> +{
> +       struct file *realfile;
> +       const struct cred *old_cred;
> +
> +       realfile = ovl_real_file(file);
> +       if (IS_ERR(realfile))
> +               return PTR_ERR(realfile);
> +
> +       if (realfile->f_op->get_unmapped_area) {
> +               unsigned long ret;
> +
> +               old_cred = ovl_override_creds(file_inode(file)->i_sb);
> +               ret = realfile->f_op->get_unmapped_area(realfile, addr, len,
> +                                                       pgoff, flags);
> +               ovl_revert_creds(old_cred);
> +
> +               if (ret)
> +                       return ret;
> +       }
> +
> +       return mm_get_unmapped_area(current->mm, file, addr, len, pgoff,
> flags);
> +}
>
> Correct me If I'm wrong.
>

You just need to be aware of the fact that between ovl_get_unmapped_area()
and ovl_mmap(), ovl_real_file(file) could change from the lower file, to the
upper file due to another operation that initiated copy-up.

Thanks,
Amir.