On Mon, Sep 21, 2015 at 8:04 PM, John Ferlan <jferlan@xxxxxxxxxx> wrote: > > > On 09/21/2015 05:09 AM, Shivaprasad bhat wrote: >> Thanks John for the comments. >> >> >> On Fri, Sep 18, 2015 at 10:34 PM, John Ferlan <jferlan@xxxxxxxxxx> wrote: >>> >>> >>> On 09/14/2015 10:44 AM, Shivaprasad G Bhat wrote: >>>> Tunnelled migration can hang if the destination qemu exits despite all the >>>> ABI checks. This happens whenever the destination qemu exits before the >>>> complete transfer is noticed by source qemu. The savevm state checks at >>>> runtime can fail at destination and cause qemu to error out. >>>> The source qemu cant notice it as the EPIPE is not propogated to it. >>>> The qemuMigrationIOFunc() notices the stream being broken from virStreamSend() >>>> and it cleans up the stream alone. The qemuMigrationWaitForCompletion() would >>>> never get to 100% transfer completion. >>>> The qemuMigrationWaitForCompletion() never breaks out as well since >>>> the ssh connection to destination is healthy, and the source qemu also thinks >>>> the migration is ongoing as the Fd to which it transfers, is never >>>> closed or broken. So, the migration will hang forever. Even Ctrl-C on the >>>> virsh migrate wouldn't be honoured. Close the source side FD when there is >>>> an error in the stream. That way, the source qemu updates itself and >>>> qemuMigrationWaitForCompletion() notices the failure. >>>> >>>> Close the FD for all kinds of errors to be sure. The error message is not >>>> copied for EPIPE so that the destination error is copied instead later. >>>> >>>> Note: >>>> Reproducible with repeated migrations between Power hosts running in different >>>> subcores-per-core modes. >>>> >>>> Changes from v1 -> v2: >>>> VIR_FORCE_CLOSE() was called twice for this use case which would log >>>> unneccessary warnings. So, move the fd close to qemuMigrationIOFunc >>>> so that there are no unnecessary duplicate attempts.(Would this trigger >>>> a Coverity error? I don't have a setup to check.) >>>> >>>> Signed-off-by: Shivaprasad G Bhat <sbhat@xxxxxxxxxxxxxxxxxx> >>>> --- >>>> src/qemu/qemu_migration.c | 8 ++++++-- >>>> 1 file changed, 6 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c >>>> index ff89ab5..9602fb2 100644 >>>> --- a/src/qemu/qemu_migration.c >>>> +++ b/src/qemu/qemu_migration.c >>>> @@ -4012,6 +4012,7 @@ static void qemuMigrationIOFunc(void *arg) >>>> if (virStreamFinish(data->st) < 0) >>>> goto error; >>>> >>>> + VIR_FORCE_CLOSE(data->sock); >>>> VIR_FREE(buffer); >>>> >>>> return; >>>> @@ -4029,7 +4030,11 @@ static void qemuMigrationIOFunc(void *arg) >>>> } >>>> >>>> error: >>>> - virCopyLastError(&data->err); >>>> + /* Let the source qemu know that the transfer cant continue anymore. >>>> + * Don't copy the error for EPIPE as destination has the actual error. */ >>>> + VIR_FORCE_CLOSE(data->sock); >>>> + if (!virLastErrorIsSystemErrno(EPIPE)) >>>> + virCopyLastError(&data->err); >>>> virResetLastError(); >>>> VIR_FREE(buffer); >>>> } >>>> @@ -4397,7 +4402,6 @@ qemuMigrationRun(virQEMUDriverPtr driver, >>>> if (iothread && qemuMigrationStopTunnel(iothread, ret < 0) < 0) >>>> ret = -1; >>>> } >>>> - VIR_FORCE_CLOSE(fd); >>> >>> ^^^ >>> >>> This causes Coverity to claim a RESOURCE_LEAK >>> >>> Feels like this was a mistake edit... >>> >> >> The VIR_FORCE_CLOSE() inside qemuMigrationIOFunc() alone is sufficient. >> Having this again here would lead to Warning in the logs. I too thought coverity >> would complain. Is there a way to force ignore a coverity warning? >> > > Typically a marker of sorts, such as > > /* coverity[leaked_handle] <some extra comment explaining why> */ > > Although I'm not sure that's the best way to handle this either. > > The problem I see though is this is an "error path" issue and when > perhaps it's safe/required to close fd/io->sock/data->sock. > > Your commit comment notes that the issue is seen on a fairly specific > event (virStreamSend failure) for a specific type of migration. I believe the failure can be seen for all types of migration with savestate mismatch. > As I > read the code, that failure jumps to error (as does virStreamFinish). So > now you have a fairly specific set of instances which perhaps would > cause qemuMigrationWaitForCompletion to need to fail. The fix you have > proposed is to close the data->sock (io->sock, fd). However, your > proposal is a larger hammer. I assume closure of data->sock causes > WaitForCompletion to fail (perhaps differently) and that's why you chose it. > > Going back to the beginning of qemuMigrationRun, setting the 'fd' and > using StartTunnel seems to rely on MIGRATION_FWD_DIRECT, but if a tunnel > is created an (ugh) 'iothread' variable is NON-NULL. What if 'iothread' > were passed to qemuMigrationWaitForCompletion? Then again passed to > qemuMigrationCompleted which would check if iothread non-NULL and for > some new flag that could be created in _qemuMigrationIOThread, being > true. If it were true, then that would cause failure so that > WaitForCompletion would return error status. The only way the flag is > set to true is in qemuMigrationIOFunc when the code jumps to error. > (e.g. add bool stream_abort and data->stream_abort = true in "error:", > then check iothread && iothread->stream_abort, force error). > I tried this. Until the fd is closed, the 'info migrate' wouldn't return from the qemu. So, the migration hangs. Checking the stream status before/after fetching the job status would still leave a race. So, having it in a different thread (qemuMigrationIOFunc) here seems inevitable to me. #1 0x00003fff83ff593c in virCondWait (c=<optimized out>, m=<optimized out>) at util/virthread.c:154 #2 0x00003fff76235544 in qemuMonitorSend (mon=0x3fff54003970, msg=<optimized out>) at qemu/qemu_monitor.c:1035 #3 0x00003fff7624fb30 in qemuMonitorJSONCommandWithFd (mon=0x3fff54003970, cmd=0x3fff5c001420, scm_fd=-1, reply=0x3fff79fbd388) at qemu/qemu_monitor_json.c:293 #4 0x00003fff76254b90 in qemuMonitorJSONCommand (reply=0x3fff79fbd388, cmd=0x3fff5c001420, mon=0x3fff54003970) at qemu/qemu_monitor_json.c:323 #5 qemuMonitorJSONGetMigrationStatus (mon=0x3fff54003970, status=0x3fff79fbd538) at qemu/qemu_monitor_json.c:2620 #6 0x00003fff7623a664 in qemuMonitorGetMigrationStatus (mon=<optimized out>, status=<optimized out>) at qemu/qemu_monitor.c:2134 #7 0x00003fff76228e6c in qemuMigrationFetchJobStatus (driver=0x3fff6c118d80, vm=0x3fff6c27faf0, asyncJob=<optimized out>, jobInfo=0x3fff79fbd4f0) at qemu/qemu_migration.c:2528 #8 0x00003fff76228fb4 in qemuMigrationUpdateJobStatus (driver=0x3fff6c118d80, vm=0x3fff6c27faf0, asyncJob=QEMU_ASYNC_JOB_MIGRATION_OUT) at qemu/qemu_migration.c:2565 #9 0x00003fff76229200 in qemuMigrationCheckJobStatus (driver=0x3fff6c118d80, vm=0x3fff6c27faf0, asyncJob=QEMU_ASYNC_JOB_MIGRATION_OUT) ---Type <return> to continue, or q <return> to quit---cont at qemu/qemu_migration.c:2585 #10 0x00003fff76229408 in qemuMigrationCompleted (iothread=<optimized out>, > The only question then in my mind then is would this also be something > that should be done for the virStreamFinish failure path which also > jumps to error? > Yes, I too am not sure if its appropriate for virStreamFinish failure case. But, given this an error path, I feel we can safely go ahead and close the data->sock. > Doing this means you shouldn't need the VIR_FILE_CLOSE(data->sock) for > either the normal or error path. Does this seem to be a reasonable > approach and solve the issue you're facing? > > John >> Thanks and Regards, >> Shivaprasad >> >>> The rest of the patch looks reasonable; however, I'm far from the expert >>> in these matters. >>> >>> John >>>> >>>> if (priv->job.completed) { >>>> qemuDomainJobInfoUpdateTime(priv->job.completed); >>>> >>>> -- >>>> libvir-list mailing list >>>> libvir-list@xxxxxxxxxx >>>> https://www.redhat.com/mailman/listinfo/libvir-list >>>> -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list