On 20.04.22 00:50, Mike Kravetz wrote: > On 4/8/22 02:26, David Hildenbrand wrote: >>>> >>>> Let's assume a 4 TiB device and 2 MiB hugepage size. That's 2097152 huge >>>> pages. Each such PMD entry consumes 8 bytes. That's 16 MiB. >>>> >>>> Sure, with thousands of processes sharing that memory, the size of page >>>> tables required would increase with each and every process. But TBH, >>>> that's in no way different to other file systems where we're even >>>> dealing with PTE tables. >>> >>> The numbers for a real use case I am frequently quoted are something like: >>> 1TB shared mapping, 10,000 processes sharing the mapping >>> 4K PMD Page per 1GB of shared mapping >>> 4M saving for each shared process >>> 9,999 * 4M ~= 39GB savings >> >> 3.7 % of all memory. Noticeable if the feature is removed? yes. Do we >> care about supporting such corner cases that result in a maintenance >> burden? My take is a clear no. >> >>> >>> However, if you look at commit 39dde65c9940c which introduced huge pmd sharing >>> it states that performance rather than memory savings was the primary >>> objective. >>> >>> "For hugetlb, the saving on page table memory is not the primary >>> objective (as hugetlb itself already cuts down page table overhead >>> significantly), instead, the purpose of using shared page table on hugetlb is >>> to allow faster TLB refill and smaller cache pollution upon TLB miss. >>> >>> With PT sharing, pte entries are shared among hundreds of processes, the >>> cache consumption used by all the page table is smaller and in return, >>> application gets much higher cache hit ratio. One other effect is that >>> cache hit ratio with hardware page walker hitting on pte in cache will be >>> higher and this helps to reduce tlb miss latency. These two effects >>> contribute to higher application performance." >>> >>> That 'makes sense', but I have never tried to measure any such performance >>> benefit. It is easier to calculate the memory savings. >> >> It does makes sense; but then, again, what's specific here about hugetlb? >> >> Most probably it was just easy to add to hugetlb in contrast to other >> types of shared memory. >> >>> >>>> >>>> Which results in me wondering if >>>> >>>> a) We should simply use gigantic pages for such extreme use case. Allows >>>> for freeing up more memory via vmemmap either way. >>> >>> The only problem with this is that many processors in use today have >>> limited TLB entries for gigantic pages. >>> >>>> b) We should instead look into reclaiming reconstruct-able page table. >>>> It's hard to imagine that each and every process accesses each and >>>> every part of the gigantic file all of the time. >>>> c) We should instead establish a more generic page table sharing >>>> mechanism. >>> >>> Yes. I think that is the direction taken by mshare() proposal. If we have >>> a more generic approach we can certainly start deprecating hugetlb pmd >>> sharing. >> >> My strong opinion is to remove it ASAP and get something proper into place. >> > > No arguments about the complexity of this code. However, there will be some > people who will notice if it is removed. Yes, it should never have been added that way -- unfortunately. > > Whether or not we remove huge pmd sharing support, I would still like to > address the scalability issue. To do so, taking i_mmap_rwsem in read mode > for fault processing needs to go away. With this gone, the issue of faults > racing with truncation needs to be addressed as it depended on fault code > taking the mutex. At a high level, this is fairly simple but hugetlb > reservations add to the complexity. This was not completely addressed in > this series. Okay. > > I will be sending out another RFC that more correctly address all the issues > this series attempted to address. I am not discounting your opinion that we > should get rid of huge pmd sharing. Rather, I would at least like to get > some eyes on my approach to addressing the issue with reservations during > fault and truncate races. Makes sense to me. I agree that we should fix all that. What I experienced is that the pmd sharing over-complicates the situation quite a lot and makes the code hard to follow [huge page reservation is another thing I dislike, especially because it's no good in NUMA setups and we still have to preallocate huge pages to make it work reliably] -- Thanks, David / dhildenb