A customer with large SMP systems (up to 16 sockets) with application that uses large amount of static hugepages (~500-1500GB) are experiencing random multisecond delays. These delays was caused by the long time it took to scan the VMA interval tree with mmap_sem held. To fix this problem while perserving existing behavior as much as possible, we need to allow timeout in down_write() and disabling PMD sharing when it is taking too long to do so. Since a transaction can involving touching multiple huge pages, timing out for each of the huge page interactions does not completely solve the problem. So a threshold is set to completely disable PMD sharing if too many timeouts happen. The first 4 patches of this 5-patch series adds a new down_write_timedlock() API which accepts a timeout argument and return true is locking is successful or false otherwise. It works more or less than a down_write_trylock() but the calling thread may sleep. The last patch implements the timeout mechanism as described above. With the patched kernel installed, the customer confirmed that the problem was gone. Waiman Long (5): locking/rwsem: Add down_write_timedlock() locking/rwsem: Enable timeout check when spinning on owner locking/osq: Allow early break from OSQ locking/rwsem: Enable timeout check when staying in the OSQ hugetlbfs: Limit wait time when trying to share huge PMD include/linux/fs.h | 7 ++ include/linux/osq_lock.h | 13 +-- include/linux/rwsem.h | 4 +- kernel/locking/lock_events_list.h | 1 + kernel/locking/mutex.c | 2 +- kernel/locking/osq_lock.c | 12 +- kernel/locking/rwsem.c | 183 +++++++++++++++++++++++++----- mm/hugetlb.c | 24 +++- 8 files changed, 201 insertions(+), 45 deletions(-) -- 2.18.1