Sven Schnelle <svens@xxxxxxxxxxxxx> writes: > Alex Bennée <alex.bennee@xxxxxxxxxx> writes: > >> Sven Schnelle <svens@xxxxxxxxxxxxx> writes: >> >>> Hi, >>> >>> David Hildenbrand <david@xxxxxxxxxx> writes: >>> >>>> On 04.05.22 09:37, Janosch Frank wrote: >>>>> I had a short look yesterday and the boot usually hangs in the raid6 >>>>> code. Disabling vector instructions didn't make a difference but a few >>>>> interruptions via GDB solve the problem for some reason. >>>>> >>>>> CCing David and Thomas for TCG >>>>> >>>> >>>> I somehow recall that KASAN was always disabled under TCG, I might be >>>> wrong (I thought we'd get a message early during boot that the HW >>>> doesn't support KASAN). >>>> >>>> I recall that raid code is a heavy user of vector instructions. >>>> >>>> How can I reproduce? Compile upstream (or -next?) with kasan support and >>>> run it under TCG? >>> >>> I spent some time looking into this. It's usually hanging in >>> s390vx8_gen_syndrome(). My first thought was that it is a problem with >>> the VX instructions, but turned out that it hangs even if i remove all >>> the code from s390vx8_gen_syndrome(). >>> >>> Tracing the execution of TB's, i see that the generated code is always >>> jumping between a few TB's, but never exiting the TB's to check for >>> interrupts (i.e. return to cpu_tb_exec(). I only see calls to >>> helper_lookup_tb_ptr to lookup the tb pointer for the next TB. >>> >>> The raid6 code is waiting for some time to expire by reading jiffies, >>> but interrupts are never processed and therefore jiffies doesn't change. >>> So the raid6 code hangs forever. >>> >>> As a test, i made a quick change to test: >>> >>> diff --git a/accel/tcg/cpu-exec.c b/accel/tcg/cpu-exec.c >>> index c997c2e8e0..35819fd5a7 100644 >>> --- a/accel/tcg/cpu-exec.c >>> +++ b/accel/tcg/cpu-exec.c >>> @@ -319,7 +319,8 @@ const void *HELPER(lookup_tb_ptr)(CPUArchState *env) >>> cpu_get_tb_cpu_state(env, &pc, &cs_base, &flags); >>> >>> cflags = curr_cflags(cpu); >>> - if (check_for_breakpoints(cpu, pc, &cflags)) { >>> + if (check_for_breakpoints(cpu, pc, &cflags) || >>> + unlikely(qatomic_read(&cpu->interrupt_request))) { >>> cpu_loop_exit(cpu); >>> } >>> >>> And that makes the problem go away. But i'm not familiar with the TCG >>> internals, so i can't say whether the generated code is incorrect or >>> something else is wrong. I have tcg log files of a failing + working run >>> if someone wants to take a look. They are rather large so i would have to >>> upload them somewhere. >> >> Whatever is setting cpu->interrupt_request should be calling >> cpu_exit(cpu) which sets the exit flag which is checked at the start of >> every TB execution (see gen_tb_start). > > Thanks, that was very helpful. I added debugging and it turned out > that the TB is left because of a pending irq. The code then calls > s390_cpu_exec_interrupt: > > bool s390_cpu_exec_interrupt(CPUState *cs, int interrupt_request) > { > if (interrupt_request & CPU_INTERRUPT_HARD) { > S390CPU *cpu = S390_CPU(cs); > CPUS390XState *env = &cpu->env; > > if (env->ex_value) { > /* Execution of the target insn is indivisible from > the parent EXECUTE insn. */ > return false; > } > if (s390_cpu_has_int(cpu)) { > s390_cpu_do_interrupt(cs); > return true; > } > if (env->psw.mask & PSW_MASK_WAIT) { > /* Woken up because of a floating interrupt but it has already > * been delivered. Go back to sleep. */ > cpu_interrupt(CPU(cpu), CPU_INTERRUPT_HALT); > } > } > return false; > } > > Note the 'if (env->ex_value) { }' check. It looks like this function > just returns false in case tcg is executing an EX instruction. After > that the information that the TB should be exited because of an > interrupt is gone. So the TB's are never exited again, although the > interrupt wasn't handled. At least that's my assumption now, if i'm > wrong please tell me. Looking at the code i see CF_NOIRQ to prevent TB's from getting interrupted. But i only see that used in the core tcg code. Would that be a possibility, or is there something else/better? Sorry for the dumb questions, i'm not often working on qemu ;-)