Re: [PATCH] of: allocate / free phandle cache outside of the devtree_lock

Frank Rowand <frowand.list@xxxxxxxxx> · Fri, 29 Nov 2019 20:48:05 -0600

Hi Sebastian,

On 11/29/19 7:57 AM, Sebastian Andrzej Siewior wrote:
> On 2019-11-12 16:48:12 [-0600], Frank Rowand wrote:
>> Hi Sebastian,
> Hi Frank,
> 
>> On 11/11/19 11:21 AM, Sebastian Andrzej Siewior wrote:
>>> The phandle cache code allocates memory while holding devtree_lock which
>>> is a raw_spinlock_t. Memory allocation (and free()) is not possible on
>>> RT while a raw_spinlock_t is held.
>>> Invoke the kfree() and kcalloc() while the lock is dropped.
>>
>> I thought the GFP_ATOMIC passed to kcalloc() in of_populate_phandle_cache()
>> was sufficient.  And I didn't realize (or remember) that kfree was
>> not allowed while a raw_spinlock_t is held.  Do you have a
>> pointer to the preempt RT documentation that explains that?
>> I'd like to add that pointer to my personal notes about locking so
>> that I won't mis-remember this too often.
> 
> Unfortunately not yet. I have a mini document that needs polishing…
> However, since as far as my RT memory goes:
> GFP_ATOMIC can be used in non-blocking context. On !RT that includes
> disabled preemption and/or interrupts _usually_ part of locking (like
> spin_lock_irqsave()) or the inherited context like IRQ-handler or
> softirq for instance.
> On RT the former example is not blocking anymore (sleeping spinlocks,
> threaded IRQs). The locking with raw_spinlock_t still disables
> preemption and interrupts as part of the procedure. The GFP_ATOMIC in
> this case does not matter on RT here in terms of blocking (the part with
> emergency pools and so on remains unchanged). 
> The kfree() part within a raw_spinlock_t section has the same
> limitations on RT as kmalloc() for the same reason. The SLUB part of the
> function is fine. The problem is once SLUB has to call to the page
> allocated in order to return or ask for memory. We can't do this in
> blocking context.

Thanks for writing this up.

And thanks for digging even deeper into the relevant context than I
did below.

-Frank

> 
> …
>>> | smp: Bringing up secondary CPUs ...
>>> | BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:973
>>> | in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1
>>> | 1 lock held by swapper/1/0:
>>> |  #0: c000000000def6e0 (devtree_lock){+.+.}, at: of_find_node_opts_by_path+0x1f4/0x230
>>> | Preemption disabled at:
>>> | [<c0000000000557a0>] start_secondary+0xd0/0x6a0
>>> |
>>> | Call Trace:
>>> | [c0000001f9667d10] [c000000000158e30] ___might_sleep+0x250/0x270
>>> | [c0000001f9667da0] [c000000000984f40] rt_spin_lock+0x70/0x90
>>> | [c0000001f9667de0] [c0000000007e3634] of_find_node_opts_by_path+0x1f4/0x230
>>> | [c0000001f9667e30] [c0000000007e3844] of_get_next_cpu_node+0x144/0x180
>>> | [c0000001f9667e70] [c0000000007e38d8] of_get_cpu_node+0x58/0x90
>>> | [c0000001f9667eb0] [c00000000002eb00] cpu_to_chip_id+0x20/0x70
>>> | [c0000001f9667ee0] [c000000000055858] start_secondary+0x188/0x6a0
>>> | [c0000001f9667f90] [c00000000000b554] start_secondary_prolog+0x10/0x14
>>>
>>> because cpu_to_chip_id() acquires devtree_lock() early in the CPU-bring
>>> up path.
>>
>> I read too much into that sentence, and ran off on a tangent re-educating
>> myself on preempt RT lock stuff.
>>
>> The issue in this path is that start_secondary() disables preemption before
>> going down the code path that ends up with an attempt by of_find_node_opts_by_path()
>> to lock devtree_lock.  It is ok to acquire a raw spinlock with preemption
>> disabled, but not ok to acquire a normal spinlock with preemption disabled.
> That is correct. So
> 	spin_lock();
> 	spin_lock();
> 
> is okay. However
> 	preempt_disable();
> 	spin_lock();
> 
> is not okay. Same is true for s/preempt_disable/local_irq_disable()/.
> 
> Please note the even before preempt_disable() in start_secondary() the
> whole code runs with disabled interrupts as part of CPU bring up. 
> 
>> The calling path to cpu_to_chip_id() has an intervening call that does not
>> show up in the above trace, add_cpu_to_masks().  The first call of cpu_to_chip_id()
>> is "int chipid = cpu_to_chip_id(cpu)", which could be moved out to start_secondary(),
>> before preemption is disabled.  But at the end of add_cpu_to_masks() is:
>>
>>         for_each_cpu(i, cpu_online_mask)
>>                 if (cpu_to_chip_id(i) == chipid)
>>                         set_cpus_related(cpu, i, cpu_core_mask);
>>
>> This use of cpu_to_chip_id() is a little harder to move to before the preemption,
>> but it is possible.  A table of the chip ids for all possible cpus could be
>> created before disabling preemption, and the table could be passed into
>> add_cpu_to_masks().  This would allow devtree_lock to be changed to a
>> spinlock_t.
>>
>> I like this approach because it removes the one known place that constrains
>> what type of lock devtree_lock is.
> 
> So even if we move that whole section before preempt_disable(), we still
> run with disabled interrupts and can't use spinlock_t.
>  
>> My second choice (and I am willing to accept this) is:
>>
>>>
>>>  drivers/of/base.c |   19 +++++++++++++------
>>>  1 file changed, 13 insertions(+), 6 deletions(-)
>>>
>>> --- a/drivers/of/base.c
>>> +++ b/drivers/of/base.c
>>> @@ -138,31 +138,34 @@ static u32 phandle_cache_mask;
>>>  /*
>>>   * Caller must hold devtree_lock.
>>>   */
>>
>> Add a one line comment to the effect of kfree()
>> can not occur while raw_spinlock_t held, so caller must
>> do the kfree().
> 
> okay.
> 
> …
>>> @@ -208,12 +212,14 @@ void of_populate_phandle_cache(void)
>>>  
>>>  	if (!phandles)
>>>  		goto out;
>>
>> Add a one line comment to the effect of raw_spinlock_t can not be held
>> when calling kcalloc().
> okay.
> 
>>> +	raw_spin_unlock_irqrestore(&devtree_lock, flags);
>>>  
>>>  	cache_entries = roundup_pow_of_two(phandles);
>>>  	phandle_cache_mask = cache_entries - 1;
>>>
>>
>> Need to avoid race with of_find_node_by_phandle().  So change the following
>> to tmp_phandle_cache = kcalloc(...
> okay.
>  
>>>  	phandle_cache = kcalloc(cache_entries, sizeof(*phandle_cache),
>>>  				GFP_ATOMIC);
>>> +	raw_spin_lock_irqsave(&devtree_lock, flags);
>>
>> Then here:
>>
>>         phandle_cache = tmp_phandle_cache;
>>
>>>  	if (!phandle_cache)
>>>  		goto out;
>>>  
>>> @@ -225,6 +231,7 @@ void of_populate_phandle_cache(void)
>>>  
>>>  out:
>>>  	raw_spin_unlock_irqrestore(&devtree_lock, flags);
>>> +	kfree(shadow);
>>>  }
>>>  
>>>  void __init of_core_init(void)
>>>
>>
>> The subtle race with of_find_node_by_phandle() is that if
>> of_find_node_by_phandle() added an entry to the cache it
>> also did an of_node_get().  It is ok that of_populate_phandle_cache()
>> overwrite the cache entry, but it would also do an additional
>> of_node_get().
> okay.
> 
>> -Frank
> 
> Sebastian
>