Re: PROBLEM: kernel crashes when running xfsdump since ~6.4

Hailong Liu <hailong.liu@xxxxxxxx> · Fri, 21 Jun 2024 18:45:27 +0800




On Fri, 21. Jun 11:44, Uladzislau Rezki wrote:
> On Fri, Jun 21, 2024 at 03:07:16PM +0800, Baoquan He wrote:
> > On 06/21/24 at 11:30am, Hailong Liu wrote:
> > > On Thu, 20. Jun 14:02, Nick Bowler wrote:
> > > > On 2024-06-20 02:19, Nick Bowler wrote:
> > > > > After upgrading my sparc to 6.9.5 I noticed that attempting to run
> > > > > xfsdump instantly (within a couple seconds) and reliably crashes the
> > > > > kernel.  The same problem is also observed on 6.10-rc4.
> > > > [...]
> > > > >   062eacf57ad91b5c272f89dc964fd6dd9715ea7d is the first bad commit
> > > > >   commit 062eacf57ad91b5c272f89dc964fd6dd9715ea7d
> > > > >   Author: Uladzislau Rezki (Sony) <urezki@xxxxxxxxx>
> > > > >   Date:   Thu Mar 30 21:06:38 2023 +0200
> > > > >
> > > > >       mm: vmalloc: remove a global vmap_blocks xarray
> > > >
> > > > I think I might see what is happening here.
> > > >
> > > > On this machine, there are two CPUs numbered 0 and 2 (there is no CPU1).
> > > >
> > > +Baoquan
> >
> > Thanks for adding me, Hailong.
> >
> > >
> > > Ahh, I thought you are right. addr_to_vb_xa assume that the CPU numbers are
> > > contiguous. I don't have knowledge about CPU at all.
> > > Technically change the implement addr_to_vb_xa() to
> > > return &per_cpu(vmap_block_queue, raw_smp_processor_id()).vmap_blocks;
> > > would also work, but it violate the load balance. Wating for
> > > experts reply.
> >
> > Yeah, I think so as you explained.
> >
> > >
> > > > The per-cpu variables in mm/vmalloc.c are initialized like this, in
> > > > vmalloc_init
> > > >
> > > >   for_each_possible_cpu(i) {
> > > >     /* ... */
> > > >     vbq = &per_cpu(vmap_block_queue, i);
> > > >     /* initialize stuff in vbq */
> > > >   }
> > > >
> > > > This loops over the set bits of cpu_possible_mask, bits 0 and 2 are set,
> > > > so it initializes stuff with i=0 and i=2, skipping i=1 (I added prints to
> > > > confirm this).
> > > >
> > > > Then, in vm_map_ram, with the problematic change it calls the new
> > > > function addr_to_vb_xa, which does this:
> > > >
> > > >   int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > > >   return &per_cpu(vmap_block_queue, index).vmap_blocks;
> > > >
> > > > The num_possible_cpus() function counts the number of set bits in
> > > > cpu_possible_mask, so it returns 2.  Thus, index is either 0 or 1, which
> > > > does not correspond to what was initialized (0 or 2).  The crash occurs
> > > > when the computed index is 1 in this function.  In this case, the
> > > > returned value appears to be garbage (I added prints to confirm this).
> >
> > This is a great catch.
> >
> Indeed :)
>
> > > >
> > > > If I change addr_to_vb_xa function to this:
> > > >
> > > >   int index = ((addr / VMAP_BLOCK_SIZE) & 1) << 1; /* 0 or 2 */
> > > >   return &per_cpu(vmap_block_queue, index).vmap_blocks;
> >
> > Yeah, while above change is not generic, e.g if it's CPU0 and CPU3.
> > I think we should take the max possible CPU number as the hush bucket
> > size. The vb->va is also got from global free_vmap_area, so no need to
> > worry about the waste.
> >
> Agree.
>
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index be2dd281ea76..18e87cafbaf2 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2542,7 +2542,7 @@ static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);
> >  static struct xarray *
> >  addr_to_vb_xa(unsigned long addr)
> >  {
> > -	int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
> > +	int index = (addr / VMAP_BLOCK_SIZE) % nr_cpu_ids;
> >
> >  	return &per_cpu(vmap_block_queue, index).vmap_blocks;
> >  }
> >
> The problem i see is about not-initializing of the:
> <snip>
> 	for_each_possible_cpu(i) {
> 		struct vmap_block_queue *vbq;
> 		struct vfree_deferred *p;
>
> 		vbq = &per_cpu(vmap_block_queue, i);
> 		spin_lock_init(&vbq->lock);
> 		INIT_LIST_HEAD(&vbq->free);
> 		p = &per_cpu(vfree_deferred, i);
> 		init_llist_head(&p->list);
> 		INIT_WORK(&p->wq, delayed_vfree_work);
> 		xa_init(&vbq->vmap_blocks);
> 	}
> <snip>
>

Sorry to trouble you. do you mean as follows:
@@ -4480,7 +4480,7 @@ void __init vmalloc_init(void)
         */
        vmap_area_cachep = KMEM_CACHE(vmap_area, SLAB_PANIC);

-       for_each_possible_cpu(i) {
+       for (i = 0; i < nr_cpu_ids; i++) {
                struct vmap_block_queue *vbq;
                struct vfree_deferred *p;

That’s exactly what I was thinking for the first solution as
well. However, Do we really need this for the impossible CPU's variable
initialization? How about using the index to find the CPU?
Just like the following, but this is not the optimal solution, and it
also wastes a lot of CPU time :).

@@ -1994,8 +1994,12 @@ static struct xarray *
 addr_to_vb_xa(unsigned long addr)
 {
        int index = (addr / VMAP_BLOCK_SIZE) % num_possible_cpus();
+       int cpu;

-       return &per_cpu(vmap_block_queue, index).vmap_blocks;
+       for_each_possible_cpu(cpu)
+               if (!index--)
+                       break;
+       return &per_cpu(vmap_block_queue, cpu).vmap_blocks;

BTW, Using raw_smp_processor_id in this manner is incorrect; that’s my mistake.

> correctly or fully. It is my bad i did not think that CPUs in a possible mask
> can be non sequential :-/
>
> nr_cpu_ids - is not the max possible CPU. For example, in Nick case,
> when he has two CPUs, num_possible_cpus() and nr_cpu_ids are the same.
>
> Or i missed something in your patch, Baoquan?
>
> --
> Uladzislau Rezki

--
help you, help me,
Hailong.