On Mon, Oct 12, 2015 at 8:03 PM, Jiri Denemark <jdenemar@xxxxxxxxxx> wrote: > On Thu, Oct 08, 2015 at 17:59:07 +0530, Shivaprasad G Bhat wrote: >> Tunnelled migration can hang if the destination qemu exits despite all the >> ABI checks. This happens whenever the destination qemu exits before the >> complete transfer is noticed by source qemu. The savevm state checks at >> runtime can fail at destination and cause qemu to error out. >> The source qemu cant notice it as the EPIPE is not propogated to it. >> The qemuMigrationIOFunc() notices the stream being broken from virStreamSend() >> and it cleans up the stream alone. The qemuMigrationWaitForCompletion() would >> never get to 100% transfer completion. >> The qemuMigrationWaitForCompletion() never breaks out as well since >> the ssh connection to destination is healthy, and the source qemu also thinks >> the migration is ongoing as the Fd to which it transfers, is never >> closed or broken. So, the migration will hang forever. Even Ctrl-C on the >> virsh migrate wouldn't be honoured. Close the source side FD when there is >> an error in the stream. That way, the source qemu updates itself and >> qemuMigrationWaitForCompletion() notices the failure. >> >> Close the FD for all kinds of errors to be sure. The error message is not >> copied for EPIPE so that the destination error is copied instead later. >> >> Note: >> Reproducible with repeated migrations between Power hosts running in different >> subcores-per-core modes. >> >> Changes from v2 -> v3: >> VIR_FORCE_CLOSE() needn't be called on the qemu source fd as that is closed >> by qemuMigrationIOFunc() for tunnelled migration. Set the fd to -1. >> >> Signed-off-by: Shivaprasad G Bhat <sbhat@xxxxxxxxxxxxxxxxxx> >> --- >> src/qemu/qemu_migration.c | 18 ++++++++++++++---- >> 1 file changed, 14 insertions(+), 4 deletions(-) >> >> diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c >> index 7440108..c982838 100644 >> --- a/src/qemu/qemu_migration.c >> +++ b/src/qemu/qemu_migration.c >> @@ -4019,6 +4019,7 @@ static void qemuMigrationIOFunc(void *arg) >> if (virStreamFinish(data->st) < 0) >> goto error; >> >> + VIR_FORCE_CLOSE(data->sock); >> VIR_FREE(buffer); >> >> return; >> @@ -4036,7 +4037,11 @@ static void qemuMigrationIOFunc(void *arg) >> } >> >> error: >> - virCopyLastError(&data->err); >> + /* Let the source qemu know that the transfer cant continue anymore. >> + * Don't copy the error for EPIPE as destination has the actual error. */ >> + VIR_FORCE_CLOSE(data->sock); >> + if (!virLastErrorIsSystemErrno(EPIPE)) >> + virCopyLastError(&data->err); >> virResetLastError(); >> VIR_FREE(buffer); >> } > > Hmm, I don't think we need to filter out any error here. Were you > actually able to get the EPIPE error except for seeing it in the logs? > What I'm trying to say is: if we get an EPIPE in qemuMigrationIOFunc, we > close data->sock which will propagate to the source QEMU and > qemuMigrationWaitForCompletion should return a "migration unexpectedly > failed" error, which will get stored in orig_err and even though > qemuMigrationStopTunnel will set the thread local error to the one from > iothread->data, we will reset it back to the orig_err one. Only in case > qemuMigrationWaitForCompletion returned success but we still got error > on the tunnel, we'd actually see the error from iothread->err. > Yes. The qemuMigrationWaitForCompletion() returns success sometimes and StopTunnel propogating the failure in such a case. This happens inconsistently, but I do see it happening. Please let me know what I should be doing if this is not the way. Thanks, Shivaprasad > Otherwise the patch looks fine. > > Jirka -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list