Hi Sebastian, On 11/29/19 7:57 AM, Sebastian Andrzej Siewior wrote: > On 2019-11-12 16:48:12 [-0600], Frank Rowand wrote: >> Hi Sebastian, > Hi Frank, > >> On 11/11/19 11:21 AM, Sebastian Andrzej Siewior wrote: >>> The phandle cache code allocates memory while holding devtree_lock which >>> is a raw_spinlock_t. Memory allocation (and free()) is not possible on >>> RT while a raw_spinlock_t is held. >>> Invoke the kfree() and kcalloc() while the lock is dropped. >> >> I thought the GFP_ATOMIC passed to kcalloc() in of_populate_phandle_cache() >> was sufficient. And I didn't realize (or remember) that kfree was >> not allowed while a raw_spinlock_t is held. Do you have a >> pointer to the preempt RT documentation that explains that? >> I'd like to add that pointer to my personal notes about locking so >> that I won't mis-remember this too often. > > Unfortunately not yet. I have a mini document that needs polishing… > However, since as far as my RT memory goes: > GFP_ATOMIC can be used in non-blocking context. On !RT that includes > disabled preemption and/or interrupts _usually_ part of locking (like > spin_lock_irqsave()) or the inherited context like IRQ-handler or > softirq for instance. > On RT the former example is not blocking anymore (sleeping spinlocks, > threaded IRQs). The locking with raw_spinlock_t still disables > preemption and interrupts as part of the procedure. The GFP_ATOMIC in > this case does not matter on RT here in terms of blocking (the part with > emergency pools and so on remains unchanged). > The kfree() part within a raw_spinlock_t section has the same > limitations on RT as kmalloc() for the same reason. The SLUB part of the > function is fine. The problem is once SLUB has to call to the page > allocated in order to return or ask for memory. We can't do this in > blocking context. Thanks for writing this up. And thanks for digging even deeper into the relevant context than I did below. -Frank > > … >>> | smp: Bringing up secondary CPUs ... >>> | BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:973 >>> | in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/1 >>> | 1 lock held by swapper/1/0: >>> | #0: c000000000def6e0 (devtree_lock){+.+.}, at: of_find_node_opts_by_path+0x1f4/0x230 >>> | Preemption disabled at: >>> | [<c0000000000557a0>] start_secondary+0xd0/0x6a0 >>> | >>> | Call Trace: >>> | [c0000001f9667d10] [c000000000158e30] ___might_sleep+0x250/0x270 >>> | [c0000001f9667da0] [c000000000984f40] rt_spin_lock+0x70/0x90 >>> | [c0000001f9667de0] [c0000000007e3634] of_find_node_opts_by_path+0x1f4/0x230 >>> | [c0000001f9667e30] [c0000000007e3844] of_get_next_cpu_node+0x144/0x180 >>> | [c0000001f9667e70] [c0000000007e38d8] of_get_cpu_node+0x58/0x90 >>> | [c0000001f9667eb0] [c00000000002eb00] cpu_to_chip_id+0x20/0x70 >>> | [c0000001f9667ee0] [c000000000055858] start_secondary+0x188/0x6a0 >>> | [c0000001f9667f90] [c00000000000b554] start_secondary_prolog+0x10/0x14 >>> >>> because cpu_to_chip_id() acquires devtree_lock() early in the CPU-bring >>> up path. >> >> I read too much into that sentence, and ran off on a tangent re-educating >> myself on preempt RT lock stuff. >> >> The issue in this path is that start_secondary() disables preemption before >> going down the code path that ends up with an attempt by of_find_node_opts_by_path() >> to lock devtree_lock. It is ok to acquire a raw spinlock with preemption >> disabled, but not ok to acquire a normal spinlock with preemption disabled. > That is correct. So > spin_lock(); > spin_lock(); > > is okay. However > preempt_disable(); > spin_lock(); > > is not okay. Same is true for s/preempt_disable/local_irq_disable()/. > > Please note the even before preempt_disable() in start_secondary() the > whole code runs with disabled interrupts as part of CPU bring up. > >> The calling path to cpu_to_chip_id() has an intervening call that does not >> show up in the above trace, add_cpu_to_masks(). The first call of cpu_to_chip_id() >> is "int chipid = cpu_to_chip_id(cpu)", which could be moved out to start_secondary(), >> before preemption is disabled. But at the end of add_cpu_to_masks() is: >> >> for_each_cpu(i, cpu_online_mask) >> if (cpu_to_chip_id(i) == chipid) >> set_cpus_related(cpu, i, cpu_core_mask); >> >> This use of cpu_to_chip_id() is a little harder to move to before the preemption, >> but it is possible. A table of the chip ids for all possible cpus could be >> created before disabling preemption, and the table could be passed into >> add_cpu_to_masks(). This would allow devtree_lock to be changed to a >> spinlock_t. >> >> I like this approach because it removes the one known place that constrains >> what type of lock devtree_lock is. > > So even if we move that whole section before preempt_disable(), we still > run with disabled interrupts and can't use spinlock_t. > >> My second choice (and I am willing to accept this) is: >> >>> >>> drivers/of/base.c | 19 +++++++++++++------ >>> 1 file changed, 13 insertions(+), 6 deletions(-) >>> >>> --- a/drivers/of/base.c >>> +++ b/drivers/of/base.c >>> @@ -138,31 +138,34 @@ static u32 phandle_cache_mask; >>> /* >>> * Caller must hold devtree_lock. >>> */ >> >> Add a one line comment to the effect of kfree() >> can not occur while raw_spinlock_t held, so caller must >> do the kfree(). > > okay. > > … >>> @@ -208,12 +212,14 @@ void of_populate_phandle_cache(void) >>> >>> if (!phandles) >>> goto out; >> >> Add a one line comment to the effect of raw_spinlock_t can not be held >> when calling kcalloc(). > okay. > >>> + raw_spin_unlock_irqrestore(&devtree_lock, flags); >>> >>> cache_entries = roundup_pow_of_two(phandles); >>> phandle_cache_mask = cache_entries - 1; >>> >> >> Need to avoid race with of_find_node_by_phandle(). So change the following >> to tmp_phandle_cache = kcalloc(... > okay. > >>> phandle_cache = kcalloc(cache_entries, sizeof(*phandle_cache), >>> GFP_ATOMIC); >>> + raw_spin_lock_irqsave(&devtree_lock, flags); >> >> Then here: >> >> phandle_cache = tmp_phandle_cache; >> >>> if (!phandle_cache) >>> goto out; >>> >>> @@ -225,6 +231,7 @@ void of_populate_phandle_cache(void) >>> >>> out: >>> raw_spin_unlock_irqrestore(&devtree_lock, flags); >>> + kfree(shadow); >>> } >>> >>> void __init of_core_init(void) >>> >> >> The subtle race with of_find_node_by_phandle() is that if >> of_find_node_by_phandle() added an entry to the cache it >> also did an of_node_get(). It is ok that of_populate_phandle_cache() >> overwrite the cache entry, but it would also do an additional >> of_node_get(). > okay. > >> -Frank > > Sebastian >