On Wed, Sep 4, 2024 at 1:55 AM Carlos Llamas <cmllamas@xxxxxxxxxx> wrote: > > On Tue, Sep 03, 2024 at 07:45:12PM +0800, Barry Song wrote: > > On Tue, Sep 3, 2024 at 7:01 PM Hillf Danton <hdanton@xxxxxxxx> wrote: > > > > > > On Tue, Sep 03, 2024 at 10:50:09AM +1200, Barry Song wrote: > > > > From: Barry Song <v-songbaohua@xxxxxxxx> > > > > > > > > The mmap_write_lock() can block all access to the VMAs, for example page > > > > faults. Performing memory allocation while holding this lock may trigger > > > > direct reclamation, leading to others being queued in the rwsem for an > > > > extended period. > > > > We've observed that the allocation can sometimes take more than 300ms, > > > > significantly blocking other threads. The user interface sometimes > > > > becomes less responsive as a result. To prevent this, let's move the > > > > allocation outside of the write lock. > > Thanks for you patch Barry. So, we are aware of this contention and I've > been working on a fix for it. See more about this below. Cool, Carlos! > > > > > > > I suspect concurrent allocators make things better wrt response, cutting > > > alloc latency down to 10ms for instance in your scenario. Feel free to > > > show figures given Tangquan's 48-hour profiling. > > > > Likely. > > > > Concurrent allocators are quite common in PFs which occur > > in the same PTE. whoever gets PTL sets PTE, others free the allocated > > pages. > > > > > > > > > A potential side effect could be an extra alloc_page() for the second > > > > thread executing binder_install_single_page() while the first thread > > > > has done it earlier. However, according to Tangquan's 48-hour profiling > > > > using monkey, the likelihood of this occurring is minimal, with a ratio > > > > of only 1 in 2400. Compared to the significantly costly rwsem, this is > > > > negligible. > > This is not negligible. In fact, it is the exact reason for the page > allocation to be done with the mmap sem. If the first thread sleeps on > vm_insert_page(), then binder gets into a bad state of multiple threads > trying to reclaim pages that won't really be used. Memory pressure goes > from bad to worst pretty quick. > > FWIW, I believe this was first talked about here: > https://lore.kernel.org/all/ZWmNpxPXZSxdmDE1@xxxxxxxxxx/ However, I'm not entirely convinced that this is a problem :-) Concurrent allocations like this can occur in many places, especially in PFs. Reclamation is not useless because it helps free up memory for others; it's not without value. I also don't believe binder is one of the largest users executing concurrent allocations. > > > > > > On the other hand, holding a write lock without making any VMA > > > > modifications appears questionable and likely incorrect. While this > > > > patch focuses on reducing the lock duration, future updates may aim > > > > to eliminate the write lock entirely. > > > > > > If spin, better not before taking a look at vm_insert_page(). > > > > I have patch 2/3 transitioning to mmap_read_lock, and per_vma_lock is > > currently in the > > testing queue. At the moment, alloc->spin is in place, but I'm not > > entirely convinced > > it's the best replacement for the write lock. Let's wait for > > Tangquan's test results. > > > > Patch 2 is detailed below, but it has only passed the build-test phase > > so far, so > > its result is uncertain. I'm sharing it early in case you find it > > interesting. And I > > am not convinced Commit d1d8875c8c13 ("binder: fix UAF of alloc->vma in > > race with munmap()") is a correct fix to really avoid all UAF of alloc->vma. > > > > [PATCH] binder_alloc: Don't use mmap_write_lock for installing page > > > > Commit d1d8875c8c13 ("binder: fix UAF of alloc->vma in race with > > munmap()") uses the mmap_rwsem write lock to protect against a race > > condition with munmap, where the vma is detached by the write lock, > > but pages are zapped by the read lock. This approach is extremely > > expensive for the system, though perhaps less so for binder itself, > > as the write lock can block all other operations. > > > > As an alternative, we could hold only the read lock and re-check > > that the vma hasn't been detached. To protect simultaneous page > > installation, we could use alloc->lock instead. > > > > Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx> > > --- > > drivers/android/binder_alloc.c | 32 +++++++++++++++++--------------- > > 1 file changed, 17 insertions(+), 15 deletions(-) > > > > diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c > > index f20074e23a7c..a2281dfacbbc 100644 > > --- a/drivers/android/binder_alloc.c > > +++ b/drivers/android/binder_alloc.c > > @@ -228,24 +228,17 @@ static int binder_install_single_page(struct > > binder_alloc *alloc, > > return -ESRCH; > > > > /* > > - * Don't allocate page in mmap_write_lock, this can block > > - * mmap_rwsem for a long time; Meanwhile, allocation failure > > - * doesn't necessarily need to return -ENOMEM, if lru_page > > - * has been installed, we can still return 0(success). > > + * Allocation failure doesn't necessarily need to return -ENOMEM, > > + * if lru_page has been installed, we can still return 0(success). > > + * So, defer the !page check until after binder_get_installed_page() > > + * is completed. > > */ > > page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO); > > > > - /* > > - * Protected with mmap_sem in write mode as multiple tasks > > - * might race to install the same page. > > - */ > > - mmap_write_lock(alloc->mm); > > - if (binder_get_installed_page(lru_page)) { > > - ret = 1; > > - goto out; > > - } > > + mmap_read_lock(alloc->mm); > > > > - if (!alloc->vma) { > > + /* vma might have been dropped or deattached */ > > + if (!alloc->vma || !find_vma(alloc->mm, addr)) { > > pr_err("%d: %s failed, no vma\n", alloc->pid, __func__); > > ret = -ESRCH; > > goto out; > > @@ -257,18 +250,27 @@ static int binder_install_single_page(struct > > binder_alloc *alloc, > > goto out; > > } > > > > + spin_lock(&alloc->lock); > > You can't hold a spinlock and then call vm_insert_page(). Thanks! This patch has only passed the build test so far. It seems like we can hold off on further testing for now. > > > + if (binder_get_installed_page(lru_page)) { > > + spin_unlock(&alloc->lock); > > + ret = 1; > > + goto out; > > + } > > + > > ret = vm_insert_page(alloc->vma, addr, page); > > if (ret) { > > pr_err("%d: %s failed to insert page at offset %lx with %d\n", > > alloc->pid, __func__, addr - alloc->buffer, ret); > > + spin_unlock(&alloc->lock); > > ret = -ENOMEM; > > goto out; > > } > > > > /* Mark page installation complete and safe to use */ > > binder_set_installed_page(lru_page, page); > > + spin_unlock(&alloc->lock); > > out: > > - mmap_write_unlock(alloc->mm); > > + mmap_read_unlock(alloc->mm); > > mmput_async(alloc->mm); > > if (ret && page) > > __free_page(page); > > -- > > 2.39.3 (Apple Git-146) > > > Sorry, but as I mentioned, I've been working on fixing this contention > by supporting concurrent "faults" in binder_install_single_page(). This > is the appropriate fix. I should be sending a patch soon after working > out the conflicts with the shrinker's callback. Awesome! I’m eager to see your patch, and we’re ready to help with testing. I strongly recommend dropping the write lock entirely. Using `mmap_write_lock()` isn’t just a binder-specific concern; it has the potential to affect the entire Android system. In patch 3, I experimented with using `per_vma_lock` as well. I’m _not_ proposing it for merging since you’re already working on it, but I wanted to share the idea. (just like patch2, it has only passed build-test) [PATCH] binder_alloc: Further move to per_vma_lock from mmap_read_lock To further reduce the read lock duration, let's try using per_vma_lock first. If that fails, we can take the read lock, similar to how page fault handlers operate. Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx> --- drivers/android/binder_alloc.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c index a2281dfacbbc..b40a5dd650c8 100644 --- a/drivers/android/binder_alloc.c +++ b/drivers/android/binder_alloc.c @@ -221,6 +221,8 @@ static int binder_install_single_page(struct binder_alloc *alloc, struct binder_lru_page *lru_page, unsigned long addr) { + struct vm_area_struct *vma; + bool per_vma_lock = true; struct page *page; int ret = 0; @@ -235,10 +237,15 @@ static int binder_install_single_page(struct binder_alloc *alloc, */ page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO); - mmap_read_lock(alloc->mm); + vma = lock_vma_under_rcu(alloc->mm, addr); + if (!vma) { + per_vma_lock = false; + mmap_read_lock(alloc->mm); + vma = find_vma(alloc->mm, addr); + } - /* vma might have been dropped or deattached */ - if (!alloc->vma || !find_vma(alloc->mm, addr)) { + /* vma might have been dropped, deattached or changed to new one */ + if (!alloc->vma || !vma || vma != alloc->vma) { pr_err("%d: %s failed, no vma\n", alloc->pid, __func__); ret = -ESRCH; goto out; @@ -270,7 +277,10 @@ static int binder_install_single_page(struct binder_alloc *alloc, binder_set_installed_page(lru_page, page); spin_unlock(&alloc->lock); out: - mmap_read_unlock(alloc->mm); + if (per_vma_lock) + vma_end_read(vma); + else + mmap_read_unlock(alloc->mm); mmput_async(alloc->mm); if (ret && page) __free_page(page); -- 2.39.3 (Apple Git-146) > > Thanks, > -- > Carlos Llamas Thanks Barry