On 03/14/2017 11:37 AM, Andrea Arcangeli wrote: > Hello, > > On Wed, Mar 08, 2017 at 05:30:55PM -0800, Mike Kravetz wrote: >> On 01/10/2017 03:02 PM, Mike Kravetz wrote: >>> Another more concrete topic is hugetlb reservations. Michal Hocko >>> proposed the topic "mm patches review bandwidth", and brought up the >>> related subject of areas in need of attention from an architectural >>> POV. I suggested that hugetlb reservations was one such area. I'm >>> guessing it was introduced to solve a rather concrete problem. However, >>> over time additional hugetlb functionality was added and the >>> capabilities of the reservation code was stretched to accommodate. >>> It would be good to step back and take a look at the design of this >>> code to determine if a rewrite/redesign is necessary. Michal suggested >>> documenting the current design/code as a first step. If people think >>> this is worth discussion at the summit, I could put together such a >>> design before the gathering. >> >> I attempted to put together a design/overview of how hugetlb reservations >> currently work. Hopefully, this will be useful. > > Another area of hugetlbfs that is not clear is the status of > MADV_REMOVE and the behavior of fallocate punch hole that deviates > from more standard shmem semantics. That might also be a topic of > interest related to your hugetlbfs topic and marginally related to > userfaultfd. Thanks Andrea, I was not aware qemu was carrying all this information. > The current status for anon, shmem and hugetlbfs like this: > > MADV_DONTNEED works: anon, !VM_SHARED shmem > MADV_DONTNEED doesn't work: hugetlbfs VM_SHARED, hugetlbfs !VM_SHARED > MADV_DONTNEED works but not guaranteed to fault: shmem VM_SHARED > > MADV_REMOVE works: shmem VM_SHARED, hugetlbfs VM_SHARED > MADV_REMOVE doesn't work: anon, shmem !VM_SHARED, hugetlbfs !VM_SHARED > > fallocate punch hole works: hugetlbfs VM_SHARED, hugetlbfs !VM_SHARED, > shmem VM_SHARED > fallocate punch hole doesn't work: anon, shmem !VM_SHARED > > So what happens in qemu is: > > anon -> MADV_DONTNEED > > shmem !VM_SHARED -> MADV_DONTNEED (fallocate punch hole wouldn't zap > private pages, but it does on hugetlbfs) > > shmem VM_SHARED -> fallocate punch hole (MADV_REMOVE would > work too) > > hugetlbfs !VM_SHARED -> fallocate punch hole (works for hugetlbfs > but not for shmem !VM_SHARED) > > hugetlbfs VM_SHARED -> fallocate punch hole (MADV_REMOVE would work too) > > This means qemu has to carry around information on the type of memory > it got from the initial memblock setup, so at live migration time it > can zap the memory with the right call. (NOTE: such memory is not > generated by userfaultfd UFFDIO_COPY, but it was allocated and mapped > and it must be zapped well before calling userfaultfd the first time). > > To do this qemu uses fstatfs and finds out which kind of memory it's > dealing with to use the right call depending on which memory. > > In short it'd be better to have something like a generic MADV_REMOVE > that guarantees a non-present fault after it succeeds, no matter what > kind of memory is mapped in the virtual range that has to be > zapped. The above is far from ideal from a userland developer > prospective. I think we will need to have a new generic MADV_REMOVE type of call as you suggest. Based on existing documentation for MADV_DONTNEED, MADV_REMOVE and fallocate hole punch they each are designed not to work on at least one of the desired memory mapping types. > Overall fallocate punch hole covers the most cases so to keep the code > simpler ironically MADV_REMOVE ends up being never used despite it > provides a more friendly API than fallocate to qemu. The files are > always mapped and the older code only dealt with virtual addresses > (before hugetlbfs and shmem entered thee equation). Ideally qemu wants > to call the same madvise regardles if the memory is anon shmem or > hugetlbfs without having to carry around file descriptor, file offsets > and superblock types. > > It's also not clear why MADV_DONTNEED doesn't work for hugetlbfs > !VM_SHARED mappings and why fallocate punch hole is also zapping > private cow-like pages from !VM_SHARED mappings (although if it > didn't, it would be impossible to zap those... so it's good luck it > does). Yes, it is more like good luck than design. fallocate hole punch for hugetlbfs VM_SHARED was the original use case/design. MADV_REMOVE was added just because it could without additional effort. Thanks for bringing this up. We should definitely discuss within the scope of hugetlbfs and/or userfaultfd. -- Mike Kravetz > > Thanks, > Andrea > > PS. CC'ed also qemu-devel in case it may help clarify why things are > implemented they way they are in the postcopy live migration > hugetlbfs/shmem support and in the future patches for shmem/hugetlbfs > share=on. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>