From: Dexuan Cui <decui@xxxxxxxxxxxxx> Sent: Saturday, January 25, 2020 11:54 AM > > The state machine in the hv_utils driver can run out of order in some > corner cases, e.g. if the kvp daemon doesn't call write() fast enough > due to some reason, kvp_timeout_func() can run first and move the state > to HVUTIL_READY; next, when kvp_on_msg() is called it returns -EINVAL > since kvp_transaction.state is smaller than HVUTIL_USERSPACE_REQ; later, > the daemon's write() gets an error -EINVAL, and the daemon will exit(). > > We can reproduce the issue by sending a SIGSTOP signal to the daemon, wait > for 1 minute, and send a SIGCONT signal to the daemon: the daemon will > exit() quickly. > > We can fix the issue by forcing a reset of the device (which means the > daemon can close() and open() the device again) and doing extra necessary > clean-up. > > Signed-off-by: Dexuan Cui <decui@xxxxxxxxxxxxx> > > --- > Changes in v2: > This is actually a new patch that makes the daemons more robust. > > Changes in v3 (I addressed Michael's comments): > Don't reset target_fd, since that's unnecessary. > Reset target_fname by: target_fname[0] = '\0'; > Added the missing "fs_frozen = true;" in vss_operate(). > After reopen_vss_fd: if vss_operate(VSS_OP_THAW) can not clear > fs_frozen due to an error, we just exit(). > Added comments. > > tools/hv/hv_fcopy_daemon.c | 33 +++++++++++++++++++++---- > tools/hv/hv_kvp_daemon.c | 36 ++++++++++++++++------------ > tools/hv/hv_vss_daemon.c | 49 +++++++++++++++++++++++++++++--------- > 3 files changed, 88 insertions(+), 30 deletions(-) > > diff --git a/tools/hv/hv_fcopy_daemon.c b/tools/hv/hv_fcopy_daemon.c > index aea2d91ab364..48cfa032432c 100644 > --- a/tools/hv/hv_fcopy_daemon.c > +++ b/tools/hv/hv_fcopy_daemon.c > @@ -80,6 +80,8 @@ static int hv_start_fcopy(struct hv_start_fcopy *smsg) > > error = 0; > done: > + if (error) > + target_fname[0] = '\0'; > return error; > } > > @@ -108,15 +110,29 @@ static int hv_copy_data(struct hv_do_fcopy *cpmsg) > return ret; > } > > +/* > + * Reset target_fname to "" in the two below functions for hibernation: if > + * the fcopy operation is aborted by hibernation, the daemon should remove the > + * partially-copied file; to achieve this, the hv_utils driver always fakes a > + * CANCEL_FCOPY message upon suspend, and later when the VM resumes back, > + * the daemon calls hv_copy_cancel() to remove the file; if a file is copied > + * successfully before suspend, hv_copy_finished() must reset target_fname to > + * avoid that the file can be incorrectly removed upon resume, since the faked > + * CANCEL_FCOPY message is spurious in this case. > + */ > static int hv_copy_finished(void) > { > close(target_fd); > + target_fname[0] = '\0'; > return 0; > } > static int hv_copy_cancel(void) > { > close(target_fd); > - unlink(target_fname); > + if (strlen(target_fname) > 0) { > + unlink(target_fname); > + target_fname[0] = '\0'; > + } > return 0; > > } > @@ -141,7 +157,7 @@ int main(int argc, char *argv[]) > struct hv_do_fcopy copy; > __u32 kernel_modver; > } buffer = { }; > - int in_handshake = 1; > + int in_handshake; > > static struct option long_options[] = { > {"help", no_argument, 0, 'h' }, > @@ -170,6 +186,10 @@ int main(int argc, char *argv[]) > openlog("HV_FCOPY", 0, LOG_USER); > syslog(LOG_INFO, "starting; pid is:%d", getpid()); > > +reopen_fcopy_fd: > + /* Remove any possible partially-copied file on error */ > + hv_copy_cancel(); Since you have removed the calls to close(fcopy_fd) after a pread() or pwrite() failure that were in v2 of the patch, I was expecting to see if (fcopy_fd != -1) close(fcopy_fd) here, like you've done with the kvp and vss code. And remember to initialize fcopy_fd to -1. :-) Michael > + in_handshake = 1; > fcopy_fd = open("/dev/vmbus/hv_fcopy", O_RDWR); > > if (fcopy_fd < 0) { > @@ -196,7 +216,7 @@ int main(int argc, char *argv[]) > len = pread(fcopy_fd, &buffer, sizeof(buffer), 0); > if (len < 0) { > syslog(LOG_ERR, "pread failed: %s", strerror(errno)); > - exit(EXIT_FAILURE); > + goto reopen_fcopy_fd; > } > > if (in_handshake) { > @@ -231,9 +251,14 @@ int main(int argc, char *argv[]) > > } > > + /* > + * pwrite() may return an error due to the faked CANCEL_FCOPY > + * message upon hibernation. Ignore the error by resetting the > + * dev file, i.e. closing and re-opening it. > + */ > if (pwrite(fcopy_fd, &error, sizeof(int), 0) != sizeof(int)) { > syslog(LOG_ERR, "pwrite failed: %s", strerror(errno)); > - exit(EXIT_FAILURE); > + goto reopen_fcopy_fd; > } > } > }