Hi Kai, On Tue, Aug 15, 2023 at 1:58 AM Huang, Kai <kai.huang@xxxxxxxxx> wrote: > > On Mon, 2023-08-14 at 06:37 +0200, Jack Wang wrote: > > We hit softlocup with following call trace: > > kernel: [ 2041.288210] Call Trace: > > kernel: [ 2041.288213] <IRQ> > > kernel: [ 2041.288216] ? watchdog_timer_fn+0x1b4/0x210 > > kernel: [ 2041.288222] ? lockup_detector_update_enable+0x50/0x50 > > kernel: [ 2041.288225] ? __hrtimer_run_queues+0x127/0x280 > > kernel: [ 2041.288230] ? hrtimer_interrupt+0xfc/0x210 > > kernel: [ 2041.288233] ? __sysvec_apic_timer_interrupt+0x59/0xd0 > > kernel: [ 2041.288237] ? sysvec_apic_timer_interrupt+0x6d/0x90 > > kernel: [ 2041.288242] </IRQ> > > kernel: [ 2041.288242] <TASK> > > kernel: [ 2041.288244] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > > kernel: [ 2041.288250] xa_erase+0x21/0xb0 > > kernel: [ 2041.288256] ? sgx_free_epc_page+0x20/0x50 > > kernel: [ 2041.288260] sgx_vepc_release+0x75/0x220 > > kernel: [ 2041.288263] __fput+0x89/0x250 > > kernel: [ 2041.288269] task_work_run+0x59/0x90 > > kernel: [ 2041.288275] do_exit+0x337/0x9a0 > > > > Similar like 8795359e35bc ("x86/sgx: Silence softlockup detection when releasing large enclaves") > > The test system has 64GB of enclave memory, and all assigned to a single > > VM. Release vepc take longer time and triggers the softlockup warning. > > > > Add a cond_resched() to give other tasks a chance to run and placate > > the softlockup detector. > > > > Cc: Jarkko Sakkinen <jarkko@xxxxxxxxxx> > > Cc: Haitao Huang <haitao.huang@xxxxxxxxxxxxxxx> > > Reported-by: Yu Zhang <yu.zhang@xxxxxxxxx> > > > > Signed-off-by: Jack Wang <jinpu.wang@xxxxxxxxx> > > --- > > arch/x86/kernel/cpu/sgx/virt.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/arch/x86/kernel/cpu/sgx/virt.c b/arch/x86/kernel/cpu/sgx/virt.c > > index c3e37eaec8ec..01d2795792cc 100644 > > --- a/arch/x86/kernel/cpu/sgx/virt.c > > +++ b/arch/x86/kernel/cpu/sgx/virt.c > > @@ -204,6 +204,7 @@ static int sgx_vepc_release(struct inode *inode, struct file *file) > > continue; > > > > xa_erase(&vepc->page_array, index); > > + cond_resched(); > > } > > > > /* > > @@ -222,6 +223,7 @@ static int sgx_vepc_release(struct inode *inode, struct file *file) > > list_add_tail(&epc_page->list, &secs_pages); > > > > xa_erase(&vepc->page_array, index); > > + cond_resched(); > > } > > > > /* > > This adds cond_resched() to the first two loops: releasing non-SECS pages and > releasing SECS pages which don't have children on remote vepc instance(s). > There is a third round loop to release SECS pages which have children on remote > vepc instance(s). I think it's better to add cond_resched() there too, although > it's extremely unlikely it could trigger softlockup w/o cond_resched(). I also think the third one are extremely unlikely could cause softlockup. And our tests do not hit it either. > > Also similar to 8795359e35bc ("x86/sgx: Silence softlockup detection when > releasing large enclaves"), this looks stable-kernel material. So perhaps add: > > Cc: stable@xxxxxxxxxxxxxxx I've added the fixed tag and cc stable in the v2 patch. Thx for the feedback and review.