On Thu, 26 Mar 2015, David Rientjes wrote: > On Thu, 26 Mar 2015, Davide Libenzi wrote: > > > > Yes, this munmap() behavior of lengths <= hugepage_size - PAGE_SIZE for a > > > hugetlb vma is long standing and there may be applications that break as a > > > result of changing the behavior: a database that reserves all allocated > > > hugetlb memory with mmap() so that it always has exclusive access to those > > > hugepages, whether they are faulted or not, and maintains its own hugepage > > > pool (which is common), may test the return value of munmap() and depend > > > on it returning -EINVAL to determine if it is freeing memory that was > > > either dynamically allocated or mapped from the hugetlb reserved pool. > > > > You went a long way to create such a case. > > But, in your case, that application will erroneously considering hugepage > > mmaped memory, as dynamically allocated, since it will always get EINVAL, > > unless it passes an aligned size. Aligned size, which a fix like the one > > posted in the patch will still leave as success. > > There was a patch proposed last week to add reserved pools to the > hugetlbfs mount option specifically for the case where a large database > wants sole reserved access to the hugepage pool. This is why hugetlbfs > pages become reserved on mmap(). In that case, the database never wants > to do munmap() and instead maintains its own hugepage pool. > > That makes the usual database case, mmap() all necessary hugetlb pages to > reserve them, even easier since they have historically had to maintain > this pool amongst various processes. > > Is there a process out there that tests for munmap(ptr) == EINVAL and, if > true, returns ptr to its hugepage pool? I can't say for certain that none > exist, that's why the potential for breakage exists. Such an application can use /proc/pid/smaps to determine the page size of a mapping. IMO, this is relying on broken behavior but I see where you are coming from that this behavior has been present for a long time. As I stated before, I think we should fix this bug and make munmap() behavior match what is described in the man page. > > > OTOH, an application, which might be more common than the one you posted, > > which calls munmap() to release a pointer which it validly got from a > > previous mmap(), will leak huge pages as all the issued munmaps will fail. > > > > That application would have to be ignoring an EINVAL return value. > > > > If we were to go back in time and decide this when the munmap() behavior > > > for hugetlb vmas was originally introduced, that would be valid. The > > > problem is that it could lead to userspace breakage and that's a > > > non-starter. > > > > > > What we can do is improve the documentation and man-page to clearly > > > specify the long-standing behavior so that nobody encounters unexpected > > > results in the future. > > > > This way you will leave the mmap API with broken semantics. > > In any case, I am done arguing. > > I will leave to Andrew to sort it out, and to Michael Kerrisk to update > > the mmap man pages with the new funny behaviour. > > > > The behavior is certainly not new, it has always been the case for > munmap() on hugetlb vmas. > > In a strict POSIX interpretation, it refers only to pages in the sense of > what is returned by sysconf(_SC_PAGESIZE). Such vmas are not backed by > any pages of size sysconf(_SC_PAGESIZE), so this behavior is undefined. > It would be best to modify the man page to explicitly state this for > MAP_HUGETLB.
Attachment:
signature.asc
Description: Digital signature