> The real control of hugetlbfs comes from the interfaces provided by > the kernel. If kernel provides similar interfaces to control page sizes > of THPs, it should work the same as hugetlbfs. Mixing page sizes usually > comes from system memory fragmentation and hugetlbfs does not have this > mixture because of its special allocation pools not because of the code With hugeltbfs, you have a guarantee that all pages within your VMA have the same page size. This is an important property. With THP you have the guarantee that any page can be operated on, as if it would be base-page granularity. Example: KVM on s390x a) It cannot deal with THP. If you supply THP, the kernel will simply split up all THP and prohibit new ones from getting formed. All works well (well, no speedup because no THP). b) It can deal with 1MB huge pages (in some configurations). c) It cannot deal with 2G huge pages. So user space really has to control which pagesize to use in case of hugetlbfs. > itself. If THPs are allocated from the same pools, they would act > the same as hugetlbfs. What am I missing here? Did I mention that I dislike taking THP from the CMA pool? ;) > > I just do not get why hugetlbfs is so special that it can have pagesize > fine control when normal pages cannot get. The “it should be invisible > to userpsace” argument suddenly does not hold for hugetlbfs. It's not about "cannot get", it's about "do we need it". We do have a trigger "THP yes/no". I wonder in which cases that wouldn't be sufficient. The name "Transparent" implies that they *should* be transparent to user space. This, unfortunately, is not completely true: 1. Performance aspects: Breaking up THP is bad for performance. This can be observed fairly easily by when using 4k-based memory ballooning in virtualized environments. If we stick to the current THP size (e.g., 2MB), we are mostly fine. Breaking up 1G THP into 2MB THP when required is completely acceptable. 2. Wasting memory: Touch a 4K page, get 2M populated. Somewhat acceptable / controllable. Touch 4K, get 1G populated is not desirable. And I think we mostly agree that we should operate only on fully-populated ranges to replace by 1G THP. But then, there is no observerable difference between 1G THP and 2M THP from user space point of view except performance. So we are debating about "Should the kernel tell us that we can use 1G THP for a VMA". What if we were suddenly to support 2G THP (look at arm64 how they support all kinds of huge pages for hugetlbfs)? Do we really need *another* trigger? What Michal proposed (IIUC) is rather user space telling the kernel "this large memory range here is *really* important for performance, please try to optimize the memory layout, give me the best you've got". MADV_HUGEPAGE_1GB is just ugly. -- Thanks, David / dhildenb