On Fri, Dec 6, 2024 at 11:45 AM Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> wrote: > > > > On 2024/12/6 17:25, Lorenzo Stoakes wrote: > > To be clear - I'm not accepting the export of __get_unmapped_area() so if > > you depend on this for this approach, you can't take this approach. > > > > It's an internal implementation detail. That you choose to make your > > filesystem possibly a module doesn't mean that mm is required to export > > internal impl details to you. Sorry. > > > > To rescind this would require a very strong argument, you have not provided > > it. > > > > On Fri, Dec 06, 2024 at 11:35:08AM +0800, Jinjiang Tu wrote: > >> > >> 在 2024/12/5 23:04, Lorenzo Stoakes 写道: > >>> + Matthew for large folio aspect > >>> > >>> On Thu, Dec 05, 2024 at 10:30:38PM +0800, Jinjiang Tu wrote: > >>>> During our tests in containers, there is a read-only file (i.e., shared > >>>> libraies) in the overlayfs filesystem, and the underlying filesystem is > >>>> ext4, which supports large folio. We mmap the file with PROT_READ prot, > >>>> and then call madvise(MADV_COLLAPSE) for it. However, the madvise call > >>>> fails and returns EINVAL. > >>>> > >>>> The reason is that the mapping address isn't aligned to PMD size. Since > >>>> overlayfs doesn't support large folio, __get_unmapped_area() doesn't call > >>>> thp_get_unmapped_area() to get a THP aligned address. > >>>> > >>>> To fix it, call get_unmapped_area() with the realfile. > >>> Isn't the correct solution to get overlayfs to support large folios? > >>> > >>>> Besides, since overlayfs may be built with CONFIG_OVERLAY_FS=m, we should > >>>> export get_unmapped_area(). > >>> Yeah, not in favour of this at all. This is an internal implementation > >>> detail. It seems like you're trying to hack your way into avoiding > >>> providing support for large folios and to hand it off to the underlying > >>> file system. > >>> > >>> Again, why don't you just support large folios in overlayfs? > >>> > >>> Literally no other file system or driver appears to make use of this > >>> directly in this manner. > >>> > >>> And there's absolutely no way this should be exported non-GPL as if it were > >>> unavoidable core functionality that everyone needs. Only you seem to... > >>> > >>>> Signed-off-by: Jinjiang Tu <tujinjiang@xxxxxxxxxx> > >>>> --- > >>>> fs/overlayfs/file.c | 20 ++++++++++++++++++++ > >>>> mm/mmap.c | 1 + > >>>> 2 files changed, 21 insertions(+) > >>>> > >>>> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c > >>>> index 969b458100fe..d0dcf675ebe8 100644 > >>>> --- a/fs/overlayfs/file.c > >>>> +++ b/fs/overlayfs/file.c > >>>> @@ -653,6 +653,25 @@ static int ovl_flush(struct file *file, fl_owner_t id) > >>>> return err; > >>>> } > >>>> > >>>> +static unsigned long ovl_get_unmapped_area(struct file *file, > >>>> + unsigned long addr, unsigned long len, unsigned long pgoff, > >>>> + unsigned long flags) > >>>> +{ > >>>> + struct file *realfile; > >>>> + const struct cred *old_cred; > >>>> + unsigned long ret; > >>>> + > >>>> + realfile = ovl_real_file(file); > >>>> + if (IS_ERR(realfile)) > >>>> + return PTR_ERR(realfile); > >>>> + > >>>> + old_cred = ovl_override_creds(file_inode(file)->i_sb); > >>>> + ret = get_unmapped_area(realfile, addr, len, pgoff, flags); > >>>> + ovl_revert_creds(old_cred); > >>> Why are you overriding credentials, then reinstating them here? That > >>> seems... iffy? I knew nothing about overlayfs so this may just be a > >>> misunderstanding... > >> > >> I refer to other file operations in overlayfs (i.e., ovl_fallocate, backing_file_mmap). > >> Since get_unmapped_area() has security related operations (e.g., security_mmap_addr()), > >> We should call it with the cred of the underlying file. > >> > >>> > >>>> + > >>>> + return ret; > >>>> +} > >>>> + > >>>> const struct file_operations ovl_file_operations = { > >>>> .open = ovl_open, > >>>> .release = ovl_release, > >>>> @@ -661,6 +680,7 @@ const struct file_operations ovl_file_operations = { > >>>> .write_iter = ovl_write_iter, > >>>> .fsync = ovl_fsync, > >>>> .mmap = ovl_mmap, > >>>> + .get_unmapped_area = ovl_get_unmapped_area, > >>>> .fallocate = ovl_fallocate, > >>>> .fadvise = ovl_fadvise, > >>>> .flush = ovl_flush, > >>>> diff --git a/mm/mmap.c b/mm/mmap.c > >>>> index 16f8e8be01f8..60eb1ff7c9a8 100644 > >>>> --- a/mm/mmap.c > >>>> +++ b/mm/mmap.c > >>>> @@ -913,6 +913,7 @@ __get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, > >>>> error = security_mmap_addr(addr); > >>>> return error ? error : addr; > >>>> } > >>>> +EXPORT_SYMBOL(__get_unmapped_area); > >>> We'll need a VERY good reason to export this internal implementation > >>> detail, and if that were provided we'd need a VERY good reason for it not > >>> to be GPL. > >>> > >>> This just seems to be a cheap way of invoking (), > >>> maybe, if it is being used by the underlying file system. > >> > >> But the underlying file system may not support large folio. In this case, > >> the mmap address doesn't need to be aligned with THP size. > > > > But it'd not cause any problems to just do that anyway right? I don't think > > many people think 'oh no I have a PMD aligned mapping now what will I do'? > > > > Again - the right solution here is to handle large folios in overlayfs as > > far as I can tell. > > I think this is not to handle large folio for overlayfs, it is about vma > alignment or vma allocation for memory mapped files, > > > 1) many fs support THP mapping, using thp_get_unmapped_area(), > > fs/bcachefs/fs.c: .get_unmapped_area = thp_get_unmapped_area, > fs/btrfs/file.c: .get_unmapped_area = thp_get_unmapped_area, > fs/erofs/data.c: .get_unmapped_area = thp_get_unmapped_area, > fs/ext2/file.c: .get_unmapped_area = thp_get_unmapped_area, > fs/ext4/file.c: .get_unmapped_area = thp_get_unmapped_area, > fs/fuse/file.c: .get_unmapped_area = thp_get_unmapped_area, > fs/xfs/xfs_file.c: .get_unmapped_area = thp_get_unmapped_area, > > 2) and some fs has own get_unmapped_area callback too, > > fs/cramfs/inode.c: .get_unmapped_area = cramfs_physmem_get_unmapped_area, > fs/hugetlbfs/inode.c: .get_unmapped_area = hugetlb_get_unmapped_area, > fs/ramfs/file-mmu.c: .get_unmapped_area = ramfs_mmu_get_unmapped_area, > fs/ramfs/file-nommu.c: .get_unmapped_area = ramfs_nommu_get_unmapped_area, > fs/romfs/mmap-nommu.c: .get_unmapped_area = romfs_get_unmapped_area, > mm/shmem.c: .get_unmapped_area = shmem_get_unmapped_area, > > They has own rules to get a vma area, but with overlayfs(tries to > present a filesystem which is the result over overlaying one filesystem > on top of the other), we now only use the default > mm_get_unmapped_area_vmflags() to get a vma area, since the overlayfs > has no '.get_unmapped_area' callback. > > do_mmap > __get_unmapped_area > // get_area = NULL > mm_get_unmapped_area_vmflags > mmap_region > mmap_file > ovl_mmap > > It looks wrong, so we need to get the readfile via ovl_real_file() > and use realfile' get_unmapped_area callback, and if the realfile > is not with the callback, fallback to the default > mm_get_unmapped_area(), > > > > > In any case as per the above, we're just not exporting > > __get_unmapped_area(), sorry. > > > > So maybe use mm_get_unmapped_area() instead of __get_unmapped_area(), > something like below, > > +static unsigned long ovl_get_unmapped_area(struct file *file, > + unsigned long addr, unsigned long len, unsigned long pgoff, > + unsigned long flags) > +{ > + struct file *realfile; > + const struct cred *old_cred; > + > + realfile = ovl_real_file(file); > + if (IS_ERR(realfile)) > + return PTR_ERR(realfile); > + > + if (realfile->f_op->get_unmapped_area) { > + unsigned long ret; > + > + old_cred = ovl_override_creds(file_inode(file)->i_sb); > + ret = realfile->f_op->get_unmapped_area(realfile, addr, len, > + pgoff, flags); > + ovl_revert_creds(old_cred); > + > + if (ret) > + return ret; > + } > + > + return mm_get_unmapped_area(current->mm, file, addr, len, pgoff, > flags); > +} > > Correct me If I'm wrong. > You just need to be aware of the fact that between ovl_get_unmapped_area() and ovl_mmap(), ovl_real_file(file) could change from the lower file, to the upper file due to another operation that initiated copy-up. Thanks, Amir.