On 09/30/23 20:55, riel@xxxxxxxxxxx wrote: > From: Rik van Riel <riel@xxxxxxxxxxx> > > Replace the custom hugetlbfs VMA locking code with the recently > introduced invalidate_lock. This greatly simplifies things. > > However, this is a large enough change that it should probably go in > separately from the other changes. > > Suggested-by: Matthew Wilcox <willy@xxxxxxxxxxxxx> > Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx> > --- > fs/hugetlbfs/inode.c | 68 +----------- > include/linux/fs.h | 6 ++ > include/linux/hugetlb.h | 21 +--- > mm/hugetlb.c | 227 ++++------------------------------------ > 4 files changed, 32 insertions(+), 290 deletions(-) As noted elsewhere, there are issues with patch 2 of this series, and the complete series does not pass libhugetlbfs tests. However, there were questions about the performance characteristics of replacing hugetlb vma lock with the invalidate_lock. This is from commit 188a39725ad7 describing the performance gains from the hugetlb vma lock. The recent regression report [1] notes page fault and fork latency of shared hugetlb mappings. To measure this, I created two simple programs: 1) map a shared hugetlb area, write fault all pages, unmap area Do this in a continuous loop to measure faults per second 2) map a shared hugetlb area, write fault a few pages, fork and exit Do this in a continuous loop to measure forks per second These programs were run on a 48 CPU VM with 320GB memory. The shared mapping size was 250GB. For comparison, a single instance of the program was run. Then, multiple instances were run in parallel to introduce lock contention. Changing the locking scheme results in a significant performance benefit. test instances unmodified revert vma -------------------------------------------------------------------------- faults per sec 1 393043 395680 389932 faults per sec 24 71405 81191 79048 forks per sec 1 2802 2747 2725 forks per sec 24 439 536 500 Combined faults 24 1621 68070 53662 Combined forks 24 358 67 142 Combined test is when running both faulting program and forking program simultaneously. This series was 'stable enough' to run the test, although I did see some bad PMD state warnings and threw out those runs. Here are the results: test instances next-20230925 next-20230925+series -------------------------------------------------------------------------- faults per sec 1 382994 386884 faults per sec 24 97959 75427 forks per sec 1 3105 3148 forks per sec 24 693 715 Combined faults 24 74506 31648 Combined forks 24 233 282 The significant measurement is 'Combined faults 24'. There is a 50+% drop, which is better than I expected. It might be interesting to fix up all issues in the series and rerun these tests? Do note that the performance issue was originally reported as an issue with a database using hugetlbfs (not my employer). I did not have access to the actual DB to recreate issue. However, the user verified that changes in the 'Combined faults 24' measurement reflected changes in their DB performance. -- Mike Kravetz