Re: [PATCH] xen: don't reschedule in preemption off sections

Jan Beulich <jbeulich@xxxxxxxx> · Fri, 10 Jul 2020 13:55:28 +0200

On 10.07.2020 12:50, Jürgen Groß wrote:
> On 10.07.20 11:49, Jan Beulich wrote:
>> On 10.07.2020 09:50, Juergen Gross wrote:
>>> For support of long running hypercalls xen_maybe_preempt_hcall() is
>>> calling cond_resched() in case a hypercall marked as preemptible has
>>> been interrupted.
>>>
>>> Normally this is no problem, as only hypercalls done via some ioctl()s
>>> are marked to be preemptible. In rare cases when during such a
>>> preemptible hypercall an interrupt occurs and any softirq action is
>>> started from irq_exit(), a further hypercall issued by the softirq
>>> handler will be regarded to be preemptible, too. This might lead to
>>> rescheduling in spite of the softirq handler potentially having set
>>> preempt_disable(), leading to splats like:
>>>
>>> BUG: sleeping function called from invalid context at drivers/xen/preempt.c:37
>>> in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 20775, name: xl
>>> INFO: lockdep is turned off.
>>> CPU: 1 PID: 20775 Comm: xl Tainted: G D W 5.4.46-1_prgmr_debug.el7.x86_64 #1
>>> Call Trace:
>>> <IRQ>
>>> dump_stack+0x8f/0xd0
>>> ___might_sleep.cold.76+0xb2/0x103
>>> xen_maybe_preempt_hcall+0x48/0x70
>>> xen_do_hypervisor_callback+0x37/0x40
>>> RIP: e030:xen_hypercall_xen_version+0xa/0x20
>>> Code: ...
>>> RSP: e02b:ffffc900400dcc30 EFLAGS: 00000246
>>> RAX: 000000000004000d RBX: 0000000000000200 RCX: ffffffff8100122a
>>> RDX: ffff88812e788000 RSI: 0000000000000000 RDI: 0000000000000000
>>> RBP: ffffffff83ee3ad0 R08: 0000000000000001 R09: 0000000000000001
>>> R10: 0000000000000000 R11: 0000000000000246 R12: ffff8881824aa0b0
>>> R13: 0000000865496000 R14: 0000000865496000 R15: ffff88815d040000
>>> ? xen_hypercall_xen_version+0xa/0x20
>>> ? xen_force_evtchn_callback+0x9/0x10
>>> ? check_events+0x12/0x20
>>> ? xen_restore_fl_direct+0x1f/0x20
>>> ? _raw_spin_unlock_irqrestore+0x53/0x60
>>> ? debug_dma_sync_single_for_cpu+0x91/0xc0
>>> ? _raw_spin_unlock_irqrestore+0x53/0x60
>>> ? xen_swiotlb_sync_single_for_cpu+0x3d/0x140
>>> ? mlx4_en_process_rx_cq+0x6b6/0x1110 [mlx4_en]
>>> ? mlx4_en_poll_rx_cq+0x64/0x100 [mlx4_en]
>>> ? net_rx_action+0x151/0x4a0
>>> ? __do_softirq+0xed/0x55b
>>> ? irq_exit+0xea/0x100
>>> ? xen_evtchn_do_upcall+0x2c/0x40
>>> ? xen_do_hypervisor_callback+0x29/0x40
>>> </IRQ>
>>> ? xen_hypercall_domctl+0xa/0x20
>>> ? xen_hypercall_domctl+0x8/0x20
>>> ? privcmd_ioctl+0x221/0x990 [xen_privcmd]
>>> ? do_vfs_ioctl+0xa5/0x6f0
>>> ? ksys_ioctl+0x60/0x90
>>> ? trace_hardirqs_off_thunk+0x1a/0x20
>>> ? __x64_sys_ioctl+0x16/0x20
>>> ? do_syscall_64+0x62/0x250
>>> ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>
>>> Fix that by testing preempt_count() before calling cond_resched().
>>>
>>> In kernel 5.8 this can't happen any more due to the entry code rework.
>>>
>>> Reported-by: Sarah Newman <srn@xxxxxxxxx>
>>> Fixes: 0fa2f5cb2b0ecd8 ("sched/preempt, xen: Use need_resched() instead of should_resched()")
>>> Cc: Sarah Newman <srn@xxxxxxxxx>
>>> Signed-off-by: Juergen Gross <jgross@xxxxxxxx>
>>> ---
>>>   drivers/xen/preempt.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/xen/preempt.c b/drivers/xen/preempt.c
>>> index 17240c5325a3..6ad87b5c95ed 100644
>>> --- a/drivers/xen/preempt.c
>>> +++ b/drivers/xen/preempt.c
>>> @@ -27,7 +27,7 @@ EXPORT_SYMBOL_GPL(xen_in_preemptible_hcall);
>>>   asmlinkage __visible void xen_maybe_preempt_hcall(void)
>>>   {
>>>   	if (unlikely(__this_cpu_read(xen_in_preemptible_hcall)
>>> -		     && need_resched())) {
>>> +		     && need_resched() && !preempt_count())) {
>>
>> Doesn't this have the at least latent risk of hiding issues in
>> other call trees (by no longer triggering logging like the one
>> that has propmted this change)? Wouldn't it be better to save,
>> clear, and restore the flag in one of xen_evtchn_do_upcall() or
>> xen_do_hypervisor_callback()?
> 
> First regarding "risk of hiding issues": it seems as if lots of kernels
> aren't even configured to trigger this logging. It would need
> CONFIG_DEBUG_ATOMIC_SLEEP to be enabled and at least SUSE kernels don't
> seem to have it on. I suspect the occasional xen_mc_flush() failures we
> have seen are related to this problem.
> 
> And in theory saving, clearing and restoring the flag would be fine, but
> it can't be done in a single function with the code flow as up to 5.7.
> What would need to be done is to save and clear the flag in e.g.
> __xen_evtchn_do_upcall() and to pass it to xen_maybe_preempt_hcall() as
> a parameter. In xen_maybe_preempt_hcall() the passed flag value would
> need to be used for the decision whether to call cond_resched() and then
> the flag could be restored (after the cond_resched() call).

I'm afraid I don't follow: If __xen_evtchn_do_upcall() cleared the flag,
xen_maybe_preempt_hcall() would amount to a no-op (up and until the
flag's prior value would get restored), wouldn't it? No need to pass
anything into there.

Jan