> On May 8, 2024, at 10:28 AM, David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 08.05.24 19:00, Prakash Sangappa wrote: >>> On May 7, 2024, at 5:00 AM, David Hildenbrand <david@xxxxxxxxxx> wrote: >>> >>> On 03.05.24 03:21, Prakash Sangappa wrote: >>>> This patch proposes to fix hugetlbfs mmap behavior so that the >>>> file size does not get updated in the mmap call. >>>> The current behavior is that hugetlbfs file size will get extended by a >>>> PROT_WRITE mmap(2) call if mmap size is greater then file size. This is >>>> not normal filesystem behavior. >>>> There seem to have been very little discussion about this. There was a >>>> patch discussion[1] a while back, implying hugetlbfs file size needs >>>> extending because of the hugetlb page reservations. Looks like this was >>>> not merged. >>>> It appears there is no correlation between file size and hugetlb page >>>> reservations. Take the case of PROT_READ mmap, where the file size is >>>> not extended even though hugetlb pages are reserved. >>>> On the other hand ftruncate(2) to increase a file size does not reserve >>>> hugetlb pages. Also, mmap with MAP_NORESERVE flag extends the file size >>>> even though hugetlb pages are not reserved. >>>> Hugetlb pages get reserved(if MAP_NORESERVE is not specified) when the >>>> hugeltbfs file is mmapped, and it only covers the file's offset,length >>>> range specified in the mmap call. >>>> Issue: >>>> Some applications would prefer to manage hugetlb page allocations explicity >>>> with use of fallocate(2). The hugetlbfs file would be PROT_WRITE mapped with >>>> MAP_NORESERVE flag, which is accessed only after allocating necessary pages >>>> using fallocate(2) and release the pages by truncating the file size. Any stray >>>> access beyond file size is expected to generate a signal. This does not >>>> work properly due to current behavior which extends file size in mmap call. >>> >>> Would a simple workaround be to mmap(PROT_READ) and then mprotect(PROT_READ|PROT_WRITE)? >> Another workaround could be to ftruncate(2) the file after mmap(PROT_READ|PROT_WRITE), if MAP_NORESERVE is used. But these will require application changes as a special case for hugetlbfs that can be considered. > > I'd assume that most applications that mmap() hugetlb files need to > special-case hugetlb because of the different logical page size > granularity already. But yes, it's all unfortunate. Will run this by out application/Database team regarding implementing workarounds. > >> However, should this mmap behavior be addressed? Why mmap(PROT_WRITE) has to extend the file size needs clarification. > > The issue is, as you write, that it's existing behavior and changing it > could cause harm to other apps that rely on that. But I do wonder if really > anybody relies on that ... > > Let's explore the history: > > The current VM_WRITE check was added in: > > commit b6174df5eec9cdfd598c03d6d0807e344e109213 > Author: Zhang, Yanmin <yanmin.zhang@xxxxxxxxx> > Date: Mon Jul 10 04:44:49 2006 -0700 > > [PATCH] mmap zero-length hugetlb file with PROT_NONE to protect a hugetlb virtual area > Sometimes, applications need below call to be successful although > "/mnt/hugepages/file1" doesn't exist. > fd = open("/mnt/hugepages/file1", O_CREAT|O_RDWR, 0755); > *addr = mmap(NULL, 0x1024*1024*256, PROT_NONE, 0, fd, 0); > As for regular pages (or files), above call does work, but as for huge > pages, above call would fail because hugetlbfs_file_mmap would fail if > (!(vma->vm_flags & VM_WRITE) && len > inode->i_size). > This capability on huge page is useful on ia64 when the process wants to > protect one area on region 4, so other threads couldn't read/write this > area. A famous JVM (Java Virtual Machine) implementation on IA64 needs the > capability. > > But it was only moved. > > Before that patch: > * mmap(PROT_WRITE) would have failed if the file size would be exceeded > * mmap(PROT_READ/PROT_NONE) would have extended the file > > After that patch > * mmap(PROT_WRITE) will extend the file > * mmap(PROT_READ/PROT_NONE) do not extend the file > > The code before that predates git times. > > Having a mount option to change that really is suboptimal IMHO ... we shouldn't add mount options to work > around all hugetlbfs quirks. > > I suggest either > > (a) Document it, along with the workaround At least needs documentation. > (b) Change it an cross fingers. > > > In QEMU source code is a very interesting comment: > > * ftruncate is not supported by hugetlbfs in older > * hosts, so don't bother bailing out on errors. > * If anything goes wrong with it under other filesystems, > * mmap will fail. > > So, was mmap() maybe the way to easily grow a hugetlbfs file before ftruncate() support > was added? > > QEMU will only call ftruncate() if the file size is empty, though. So if you'd have a > smaller file QEMU would not try growing it, and mmap() would succeed and grow it. That's > a rare case to happen, though, and likely also undesired here: we want it to behave just > like ordinary files! Ideally yes. Thanks for your feedback. -Prakash. > > -- > Cheers, > > David / dhildenb