Re: [kvm-unit-tests PATCH v1] arch-run: Wait for incoming socket being removed

"Nicholas Piggin" <npiggin@xxxxxxxxx> · Tue, 12 Mar 2024 15:39:56 +1000

On Wed Mar 6, 2024 at 11:03 PM AEST, Nico Boehr wrote:
> Quoting Marc Hartmayer (2024-03-05 19:12:16)
> [...]
> > > diff --git a/scripts/arch-run.bash b/scripts/arch-run.bash
> > > index 2214d940cf7d..413f3eda8cb8 100644
> > > --- a/scripts/arch-run.bash
> > > +++ b/scripts/arch-run.bash
> > > @@ -237,12 +237,8 @@ do_migration ()
> > >       echo > ${dst_infifo}
> > >       rm ${dst_infifo}
> > >  
> > > -     # Ensure the incoming socket is removed, ready for next destination
> > > -     if [ -S ${dst_incoming} ] ; then
> > > -             echo "ERROR: Incoming migration socket not removed after migration." >& 2
> > > -             qmp ${dst_qmp} '"quit"'> ${dst_qmpout} 2>/dev/null
> > > -             return 2
> > > -     fi
> > > +     # Wait for the incoming socket being removed, ready for next destination
> > > +     while [ -S ${dst_incoming} ] ; do sleep 0.1 ; done
> > 
> > But now, you have removed the erroring out path completely. Maybe wait
> > max. 3s and then bail out?
>
> Well, I was considering that, but:
> - I'm not a huge fan of fine-grained timeouts. Fine-tuning a gazillion
>   timeouts is not a fun task, I think you know what I'm talking about :)
> - a number of other places that can potentially get stuck also don't have
>   proper timeouts (like waiting for the QMP socket or the migration
>   socket), so for a proper solution we'd need to touch a lot of other
>   places...
>
> What I think we really want is a migration timeout. That isn't quite simple
> since we can't easily pull $(timeout_cmd) before $(panic_cmd) and
> $(migration_cmd) in run-scripts...
>
> My suggestion: let's fix this issue and work on the timeout as a seperate
> fix.

The migration tests as a whole have big trouble with timeouts already.
The problem is timeouts are implemented with the 'timeout' command but
that is specific to the QEMU process so especially the migration harness
with lots of loops can easily hang.

I tried a few ways to address this like starting a background 'sleep ;
kill' shell, but that gets very complicated to handle interrupts properly
that kill the stuck bits and having the harness report the error
sanely. I'm thinking a subshell that runs the entire test case and start
*that* with 'timeout' might be a better approach.

So I agree, let's take patches that fix behaviour when there are no
timeouts, and address the timeout problem as a whole as a separate
effort rather than worrying too much about individual loops just yet.
(It's a fair review comment to ask though).

Thanks,
Nick