Re: [PATCH] mm/hugetlb: fix deadlock in hugetlb_cow error path

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Tue, 15 Dec 2020 14:39:38 -0800

On 12/14/20 5:06 PM, Mike Kravetz wrote:
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d029d938d26d..8713f8ef0f4c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4106,10 +4106,30 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
>  		 * may get SIGKILLed if it later faults.
>  		 */
>  		if (outside_reserve) {
> +			struct address_space *mapping = vma->vm_file->f_mapping;
> +			pgoff_t idx;
> +			u32 hash;
> +
>  			put_page(old_page);
>  			BUG_ON(huge_pte_none(pte));
> +			/*
> +			 * Drop hugetlb_fault_mutex and i_mmap_rwsem before
> +			 * unmapping.  unmapping needs to hold i_mmap_rwsem
> +			 * in write mode.  Dropping i_mmap_rwsem in read mode
> +			 * here is OK as COW mappings do not interact with
> +			 * PMD sharing.
> +			 *
> +			 * Reacquire both after unmap operation.
> +			 */
> +			idx = vma_hugecache_offset(h, vma, haddr);
> +			hash = hugetlb_fault_mutex_hash(mapping, idx);
> +			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> +			i_mmap_unlock_read(vma->vm_file->f_mapping);

The assignment 'mapping = vma->vm_file->f_mapping' is done at the beginning
of this block.  Silly that it is not used here.

> +
>  			unmap_ref_private(mm, vma, old_page, haddr);
> -			BUG_ON(huge_pte_none(pte));
> +
> +			i_mmap_lock_read(vma->vm_file->f_mapping);

and here.

> +			mutex_lock(&hugetlb_fault_mutex_table[hash]);
>  			spin_lock(ptl);
>  			ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
>  			if (likely(ptep &&
> 

Updated patch to use block local variable mapping.

>From aa450d80a63dc4533b2eca9f61c1acfb37587c06 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
Date: Mon, 14 Dec 2020 16:26:32 -0800
Subject: [PATCH v2] mm/hugetlb: fix deadlock in hugetlb_cow error path

syzbot reported the deadlock here [1].  The issue is in hugetlb cow
error handling when there are not enough huge pages for the faulting
task which took the original reservation.  It is possible that other
(child) tasks could have consumed pages associated with the reservation.
In this case, we want the task which took the original reservation to
succeed.  So, we unmap any associated pages in children so that they
can be used by the faulting task that owns the reservation.

The unmapping code needs to hold i_mmap_rwsem in write mode.  However,
due to commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd
sharing synchronization") we are already holding i_mmap_rwsem in read
mode when hugetlb_cow is called.  Technically, i_mmap_rwsem does not
need to be held in read mode for COW mappings as they can not share
pmd's.  Modifying the fault code to not take i_mmap_rwsem in read mode
for COW (and other non-sharable) mappings is too involved for a stable
fix.  Instead, we simply drop the hugetlb_fault_mutex and i_mmap_rwsem
before unmapping.  This is OK as it is technically not needed.  They
are reacquired after unmapping as expected by calling code.  Since this
is done in an uncommon error path, the overhead of dropping and
reacquiring mutexes is acceptable.

While making changes, remove redundant BUG_ON after unmap_ref_private.

[1] https://lkml.kernel.org/r/000000000000b73ccc05b5cf8558@xxxxxxxxxx

Reported-by: syzbot+5eee4145df3c15e96625@xxxxxxxxxxxxxxxxxxxxxxxxx
Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
Cc: <stable@xxxxxxxxxxxxxxx>
Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
---
 mm/hugetlb.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d029d938d26d..7e89f31d7ef8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4106,10 +4106,30 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * may get SIGKILLed if it later faults.
 		 */
 		if (outside_reserve) {
+			struct address_space *mapping = vma->vm_file->f_mapping;
+			pgoff_t idx;
+			u32 hash;
+
 			put_page(old_page);
 			BUG_ON(huge_pte_none(pte));
+			/*
+			 * Drop hugetlb_fault_mutex and i_mmap_rwsem before
+			 * unmapping.  unmapping needs to hold i_mmap_rwsem
+			 * in write mode.  Dropping i_mmap_rwsem in read mode
+			 * here is OK as COW mappings do not interact with
+			 * PMD sharing.
+			 *
+			 * Reacquire both after unmap operation.
+			 */
+			idx = vma_hugecache_offset(h, vma, haddr);
+			hash = hugetlb_fault_mutex_hash(mapping, idx);
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			i_mmap_unlock_read(mapping);
+
 			unmap_ref_private(mm, vma, old_page, haddr);
-			BUG_ON(huge_pte_none(pte));
+
+			i_mmap_lock_read(mapping);
+			mutex_lock(&hugetlb_fault_mutex_table[hash]);
 			spin_lock(ptl);
 			ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
 			if (likely(ptep &&
-- 
2.29.2