Re: [RFC PATCH 0/1] Address hugetlbfs mmap behavior

Prakash Sangappa <prakash.sangappa@xxxxxxxxxx> · Fri, 10 May 2024 16:28:31 +0000

> On May 8, 2024, at 10:28 AM, David Hildenbrand <david@xxxxxxxxxx> wrote:
> 
> On 08.05.24 19:00, Prakash Sangappa wrote:
>>> On May 7, 2024, at 5:00 AM, David Hildenbrand <david@xxxxxxxxxx> wrote:
>>> 
>>> On 03.05.24 03:21, Prakash Sangappa wrote:
>>>> This patch proposes to fix hugetlbfs mmap behavior so that the
>>>> file size does not get updated in the mmap call.
>>>> The current behavior is that hugetlbfs file size will get extended by a
>>>> PROT_WRITE mmap(2) call if mmap size is greater then file size. This is
>>>> not normal filesystem behavior.
>>>> There seem to have been very little discussion about this. There was a
>>>> patch discussion[1] a while back, implying hugetlbfs file size needs
>>>> extending because of the hugetlb page reservations. Looks like this was
>>>> not merged.
>>>> It appears there is no correlation between file size and hugetlb page
>>>> reservations. Take the case of PROT_READ mmap, where the file size is
>>>> not extended even though hugetlb pages are reserved.
>>>> On the other hand ftruncate(2) to increase a file size does not reserve
>>>> hugetlb pages. Also, mmap with MAP_NORESERVE flag extends the file size
>>>> even though hugetlb pages are not reserved.
>>>> Hugetlb pages get reserved(if MAP_NORESERVE is not specified) when the
>>>> hugeltbfs file is mmapped, and it only covers the file's offset,length
>>>> range specified in the mmap call.
>>>> Issue:
>>>> Some applications would prefer to manage hugetlb page allocations explicity
>>>> with use of fallocate(2). The hugetlbfs file would be PROT_WRITE mapped with
>>>> MAP_NORESERVE flag, which is accessed only after allocating necessary pages
>>>> using fallocate(2) and release the pages by truncating the file size. Any stray
>>>> access beyond file size is expected to generate a signal. This does not
>>>> work properly due to current behavior which extends file size in mmap call.
>>> 
>>> Would a simple workaround be to mmap(PROT_READ) and then mprotect(PROT_READ|PROT_WRITE)?
>> Another workaround could be to ftruncate(2) the file after  mmap(PROT_READ|PROT_WRITE), if MAP_NORESERVE is used. But these will require application changes as a special case for hugetlbfs that can be considered.
> 
> I'd assume that most applications that mmap() hugetlb files need to
> special-case hugetlb because of the different logical page size
> granularity already. But yes, it's all unfortunate.

Will run this by out application/Database team regarding implementing workarounds. 

> 
>> However, should this mmap behavior  be addressed? Why mmap(PROT_WRITE) has to extend the file size needs clarification.
> 
> The issue is, as you write, that it's existing behavior and changing it
> could cause harm to other apps that rely on that. But I do wonder if really
> anybody relies on that ...
> 
> Let's explore the history:
> 
> The current VM_WRITE check was added in:
> 
> commit b6174df5eec9cdfd598c03d6d0807e344e109213
> Author: Zhang, Yanmin <yanmin.zhang@xxxxxxxxx>
> Date:   Mon Jul 10 04:44:49 2006 -0700
> 
>    [PATCH] mmap zero-length hugetlb file with PROT_NONE to protect a hugetlb virtual area
>        Sometimes, applications need below call to be successful although
>    "/mnt/hugepages/file1" doesn't exist.
>        fd = open("/mnt/hugepages/file1", O_CREAT|O_RDWR, 0755);
>    *addr = mmap(NULL, 0x1024*1024*256, PROT_NONE, 0, fd, 0);
>        As for regular pages (or files), above call does work, but as for huge
>    pages, above call would fail because hugetlbfs_file_mmap would fail if
>    (!(vma->vm_flags & VM_WRITE) && len > inode->i_size).
>        This capability on huge page is useful on ia64 when the process wants to
>    protect one area on region 4, so other threads couldn't read/write this
>    area.  A famous JVM (Java Virtual Machine) implementation on IA64 needs the
>    capability.
> 
> But it was only moved.
> 
> Before that patch:
> * mmap(PROT_WRITE) would have failed if the file size would be exceeded
> * mmap(PROT_READ/PROT_NONE) would have extended the file
> 
> After that patch
> * mmap(PROT_WRITE) will extend the file
> * mmap(PROT_READ/PROT_NONE) do not extend the file
> 
> The code before that predates git times.
> 
> Having a mount option to change that really is suboptimal IMHO ... we shouldn't add mount options to work
> around all hugetlbfs quirks.
> 
> I suggest either
> 
> (a) Document it, along with the workaround

At least needs documentation. 

> (b) Change it an cross fingers.
> 
> 
> In QEMU source code is a very interesting comment:
> 
>     * ftruncate is not supported by hugetlbfs in older
>     * hosts, so don't bother bailing out on errors.
>     * If anything goes wrong with it under other filesystems,
>     * mmap will fail.
> 
> So, was mmap() maybe the way to easily grow a hugetlbfs file before ftruncate() support
> was added?
> 
> QEMU will only call ftruncate() if the file size is empty, though. So if you'd have a
> smaller file QEMU would not try growing it, and mmap() would succeed and grow it. That's
> a rare case to happen, though, and likely also undesired here: we want it to behave just
> like ordinary files!

Ideally yes. 

Thanks for your feedback. 
-Prakash.

> 
> -- 
> Cheers,
> 
> David / dhildenb