Re: arm64 flushing 255GB of vmalloc space takes too long

Eric Miao <eric.y.miao@xxxxxxxxx> · Wed, 9 Jul 2014 09:53:26 -0700

On Tue, Jul 8, 2014 at 6:43 PM, Laura Abbott <lauraa@xxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> I have an arm64 target which has been observed hanging in __purge_vmap_area_lazy
> in vmalloc.c The root cause of this 'hang' is that flush_tlb_kernel_range is
> attempting to flush 255GB of virtual address space. This takes ~2 seconds and
> preemption is disabled at this time thanks to the purge lock. Disabling
> preemption for that time is long enough to trigger a watchdog we have setup.
>
> Triggering this is fairly easy:
> 1) Early in bootup, vmalloc > lazy_max_pages. This gives an address near the
> start of the vmalloc range.
> 2) load a module
> 3) vfree the vmalloc region from step 1
> 4) unload the module
>
> The arm64 virtual address layout looks like
> vmalloc : 0xffffff8000000000 - 0xffffffbbffff0000   (245759 MB)
> vmemmap : 0xffffffbc02400000 - 0xffffffbc03600000   (    18 MB)
> modules : 0xffffffbffc000000 - 0xffffffc000000000   (    64 MB)
>
> and the algorithm in __purge_vmap_area_lazy flushes between the lowest address.
> Essentially, if we are using a reasonable amount of vmalloc space and a module
> unload triggers a vmalloc purge, we will end up triggering our watchdog.
>
> A couple of options I thought of:
> 1) Increase the timeout of our watchdog to allow the flush to occur. Nobody
> I suggested this to likes the idea as the watchdog firing generally catches
> behavior that results in poor system performance and disabling preemption
> for that long does seem like a problem.
> 2) Change __purge_vmap_area_lazy to do less work under a spinlock. This would
> certainly have a performance impact and I don't even know if it is plausible.
> 3) Allow module unloading to trigger a vmalloc purge beforehand to help avoid
> this case. This would still be racy if another vfree came in during the time
> between the purge and the vfree but it might be good enough.
> 4) Add 'if size > threshold flush entire tlb' (I haven't profiled this yet)

We have the same problem. I'd agree with point 2 and point 4, point 1/3 do not
actually fix this issue. purge_vmap_area_lazy() could be called in other
cases.

w.r.t the threshold to flush entire tlb instead of doing that page-by-page, that
could be different from platform to platform. And considering the cost of tlb
flush on x86, I wonder why this isn't an issue on x86.

The whole __purge_vmap_area_lazy() is protected by a single spinlock, I
see no reason why a mutex cannot be used there, this allows preemption
during this likely lengthy process.

The rbtree removal seems to be heavy too - worst case would be to call
__free_vmap_area() for lazy_max_pages times. And they are all protected
by a single spinlock for the whole traversal, which is not necessary.

CC+ Russell, Catalin, Will.

We have a patch as below:

============================ >8 =========================

>From cabc0fc0bdeee3d7824660b17f25ab465e3cc76b Mon Sep 17 00:00:00 2001
From: Eric Miao <emiao@xxxxxxxxxx>
Date: Mon, 7 Jul 2014 18:19:09 -0700
Subject: [PATCH 2/2] mm: vmalloc: allow preemption in __purge_vmap_area_lazy()

Signed-off-by: Eric Miao <emiao@xxxxxxxxxx>
Signed-off-by: Richard Wiley <rwiley@xxxxxxxxxx>
---
 mm/vmalloc.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 35b19d4..35a2d36 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -597,7 +597,7 @@ void set_iounmap_nonlazy(void)
 static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end,
  int sync, int force_flush)
 {
- static DEFINE_SPINLOCK(purge_lock);
+ static DEFINE_MUTEX(purge_lock);
  LIST_HEAD(valist);
  struct vmap_area *va;
  struct vmap_area *n_va;
@@ -609,10 +609,10 @@ static void __purge_vmap_area_lazy(unsigned long
*start, unsigned long *end,
  * the case that isn't actually used at the moment anyway.
  */
  if (!sync && !force_flush) {
- if (!spin_trylock(&purge_lock))
+ if (!mutex_trylock(&purge_lock))
  return;
  } else
- spin_lock(&purge_lock);
+ mutex_lock(&purge_lock);

  if (sync)
  purge_fragmented_blocks_allcpus();
@@ -644,12 +644,13 @@ static void __purge_vmap_area_lazy(unsigned long
*start, unsigned long *end,
  }

  if (nr) {
- spin_lock(&vmap_area_lock);
- list_for_each_entry_safe(va, n_va, &valist, purge_list)
+ list_for_each_entry_safe(va, n_va, &valist, purge_list) {
+ spin_lock(&vmap_area_lock);
  __free_vmap_area(va);
- spin_unlock(&vmap_area_lock);
+ spin_unlock(&vmap_area_lock);
+ }
  }
- spin_unlock(&purge_lock);
+ mutex_unlock(&purge_lock);
 }

 /*
-- 
1.8.1.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>