Re: [PATCH] of: allocate / free phandle cache outside of the devtree_lock

Frank Rowand <frowand.list@xxxxxxxxx> · Tue, 12 Nov 2019 16:48:12 -0600

Hi Sebastian,

On 11/11/19 11:21 AM, Sebastian Andrzej Siewior wrote:
> The phandle cache code allocates memory while holding devtree_lock which
> is a raw_spinlock_t. Memory allocation (and free()) is not possible on
> RT while a raw_spinlock_t is held.
> Invoke the kfree() and kcalloc() while the lock is dropped.

I thought the GFP_ATOMIC passed to kcalloc() in of_populate_phandle_cache()
was sufficient.  And I didn't realize (or remember) that kfree was
not allowed while a raw_spinlock_t is held.  Do you have a
pointer to the preempt RT documentation that explains that?
I'd like to add that pointer to my personal notes about locking so
that I won't mis-remember this too often.

> Cc: Rob Herring <robh+dt@xxxxxxxxxx>
> Cc: Frank Rowand <frowand.list@xxxxxxxxx>
> Cc: devicetree@xxxxxxxxxxxxxxx
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
> ---
> 
> This is a repost of:
> 	https://lore.kernel.org/linux-devicetree/20180910154227.xsbbqvw3cayro4gg@xxxxxxxxxxxxx/
> 
> I mentioned this patch (briefly) to Frank, let me summarize:
> 
> of_populate_phandle_cache() triggers a warning during boot on arm64 with
> RT enabled. By moving memory allocation/free outside of the locked
> section (which really disables interrupts on -RT) everything is fine
> again.
> 
> The lock has been made a raw_spinlock_t in RT as part pSeries bring up.
> It then made its way upstream as:
>    28d0e36bf9686 ("OF: Fixup resursive locking code paths")
>    d6d3c4e656513 ("OF: convert devtree lock from rw_lock to raw spinlock")
> 
> I've been looking into making devtree_lock a spinlock_t which would
> avoid this patch. I haven't seen an issue during boot on arm64 even
> with hotplug.> However Power64/pSeries complained during boot:
> 
> | smp: Bringing up secondary CPUs ...
> | BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:973
> | in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
> | 1 lock held by swapper/1/0:
> |  #0: c000000000def6e0 (devtree_lock){+.+.}, at: of_find_node_opts_by_path+0x1f4/0x230
> | Preemption disabled at:
> | [<c0000000000557a0>] start_secondary+0xd0/0x6a0
> |
> | Call Trace:
> | [c0000001f9667d10] [c000000000158e30] ___might_sleep+0x250/0x270
> | [c0000001f9667da0] [c000000000984f40] rt_spin_lock+0x70/0x90
> | [c0000001f9667de0] [c0000000007e3634] of_find_node_opts_by_path+0x1f4/0x230
> | [c0000001f9667e30] [c0000000007e3844] of_get_next_cpu_node+0x144/0x180
> | [c0000001f9667e70] [c0000000007e38d8] of_get_cpu_node+0x58/0x90
> | [c0000001f9667eb0] [c00000000002eb00] cpu_to_chip_id+0x20/0x70
> | [c0000001f9667ee0] [c000000000055858] start_secondary+0x188/0x6a0
> | [c0000001f9667f90] [c00000000000b554] start_secondary_prolog+0x10/0x14
> 
> because cpu_to_chip_id() acquires devtree_lock() early in the CPU-bring
> up path.

I read too much into that sentence, and ran off on a tangent re-educating
myself on preempt RT lock stuff.

The issue in this path is that start_secondary() disables preemption before
going down the code path that ends up with an attempt by of_find_node_opts_by_path()
to lock devtree_lock.  It is ok to acquire a raw spinlock with preemption
disabled, but not ok to acquire a normal spinlock with preemption disabled.

The calling path to cpu_to_chip_id() has an intervening call that does not
show up in the above trace, add_cpu_to_masks().  The first call of cpu_to_chip_id()
is "int chipid = cpu_to_chip_id(cpu)", which could be moved out to start_secondary(),
before preemption is disabled.  But at the end of add_cpu_to_masks() is:

        for_each_cpu(i, cpu_online_mask)
                if (cpu_to_chip_id(i) == chipid)
                        set_cpus_related(cpu, i, cpu_core_mask);

This use of cpu_to_chip_id() is a little harder to move to before the preemption,
but it is possible.  A table of the chip ids for all possible cpus could be
created before disabling preemption, and the table could be passed into
add_cpu_to_masks().  This would allow devtree_lock to be changed to a
spinlock_t.

I like this approach because it removes the one known place that constrains
what type of lock devtree_lock is.

My second choice (and I am willing to accept this) is:

> 
>  drivers/of/base.c |   19 +++++++++++++------
>  1 file changed, 13 insertions(+), 6 deletions(-)
> 
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -138,31 +138,34 @@ static u32 phandle_cache_mask;
>  /*
>   * Caller must hold devtree_lock.
>   */

Add a one line comment to the effect of kfree()
can not occur while raw_spinlock_t held, so caller must
do the kfree().

> -static void __of_free_phandle_cache(void)
> +static struct device_node** __of_free_phandle_cache(void)
>  {
>  	u32 cache_entries = phandle_cache_mask + 1;
>  	u32 k;
> +	struct device_node **shadow;
>  
>  	if (!phandle_cache)
> -		return;
> +		return NULL;
>  
>  	for (k = 0; k < cache_entries; k++)
>  		of_node_put(phandle_cache[k]);
>  
> -	kfree(phandle_cache);
> +	shadow = phandle_cache;
>  	phandle_cache = NULL;
> +	return shadow;
>  }
>  
>  int of_free_phandle_cache(void)
>  {
>  	unsigned long flags;
> +	struct device_node **shadow;
>  
>  	raw_spin_lock_irqsave(&devtree_lock, flags);
>  
> -	__of_free_phandle_cache();
> +	shadow = __of_free_phandle_cache();
>  
>  	raw_spin_unlock_irqrestore(&devtree_lock, flags);
> -
> +	kfree(shadow);
>  	return 0;
>  }
>  #if !defined(CONFIG_MODULES)
> @@ -197,10 +200,11 @@ void of_populate_phandle_cache(void)
>  	u32 cache_entries;
>  	struct device_node *np;
>  	u32 phandles = 0;
> +	struct device_node **shadow;
>  
>  	raw_spin_lock_irqsave(&devtree_lock, flags);
>  
> -	__of_free_phandle_cache();
> +	shadow = __of_free_phandle_cache();
>  
>  	for_each_of_allnodes(np)
>  		if (np->phandle && np->phandle != OF_PHANDLE_ILLEGAL)
> @@ -208,12 +212,14 @@ void of_populate_phandle_cache(void)
>  
>  	if (!phandles)
>  		goto out;

Add a one line comment to the effect of raw_spinlock_t can not be held
when calling kcalloc().

> +	raw_spin_unlock_irqrestore(&devtree_lock, flags);
>  
>  	cache_entries = roundup_pow_of_two(phandles);
>  	phandle_cache_mask = cache_entries - 1;
> 

Need to avoid race with of_find_node_by_phandle().  So change the following
to tmp_phandle_cache = kcalloc(...

>  	phandle_cache = kcalloc(cache_entries, sizeof(*phandle_cache),
>  				GFP_ATOMIC);
> +	raw_spin_lock_irqsave(&devtree_lock, flags);

Then here:

        phandle_cache = tmp_phandle_cache;

>  	if (!phandle_cache)
>  		goto out;
>  
> @@ -225,6 +231,7 @@ void of_populate_phandle_cache(void)
>  
>  out:
>  	raw_spin_unlock_irqrestore(&devtree_lock, flags);
> +	kfree(shadow);
>  }
>  
>  void __init of_core_init(void)
> 

The subtle race with of_find_node_by_phandle() is that if
of_find_node_by_phandle() added an entry to the cache it
also did an of_node_get().  It is ok that of_populate_phandle_cache()
overwrite the cache entry, but it would also do an additional
of_node_get().

-Frank