RE: [PATCH v3 1/4] Tools: hv: Reopen the devices if read() or write() returns errors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



From: Dexuan Cui <decui@xxxxxxxxxxxxx> Sent: Saturday, January 25, 2020 11:54 AM
> 
> The state machine in the hv_utils driver can run out of order in some
> corner cases, e.g. if the kvp daemon doesn't call write() fast enough
> due to some reason, kvp_timeout_func() can run first and move the state
> to HVUTIL_READY; next, when kvp_on_msg() is called it returns -EINVAL
> since kvp_transaction.state is smaller than HVUTIL_USERSPACE_REQ; later,
> the daemon's write() gets an error -EINVAL, and the daemon will exit().
> 
> We can reproduce the issue by sending a SIGSTOP signal to the daemon, wait
> for 1 minute, and send a SIGCONT signal to the daemon: the daemon will
> exit() quickly.
> 
> We can fix the issue by forcing a reset of the device (which means the
> daemon can close() and open() the device again) and doing extra necessary
> clean-up.
> 
> Signed-off-by: Dexuan Cui <decui@xxxxxxxxxxxxx>
> 
> ---
> Changes in v2:
>     This is actually a new patch that makes the daemons more robust.
> 
> Changes in v3 (I addressed Michael's comments):
>     Don't reset target_fd, since that's unnecessary.
>     Reset target_fname by: target_fname[0] = '\0';
>     Added the missing "fs_frozen = true;" in vss_operate().
>     After reopen_vss_fd: if vss_operate(VSS_OP_THAW) can not clear
>         fs_frozen due to an error, we just exit().
>     Added comments.
> 
>  tools/hv/hv_fcopy_daemon.c | 33 +++++++++++++++++++++----
>  tools/hv/hv_kvp_daemon.c   | 36 ++++++++++++++++------------
>  tools/hv/hv_vss_daemon.c   | 49 +++++++++++++++++++++++++++++---------
>  3 files changed, 88 insertions(+), 30 deletions(-)
> 
> diff --git a/tools/hv/hv_fcopy_daemon.c b/tools/hv/hv_fcopy_daemon.c
> index aea2d91ab364..48cfa032432c 100644
> --- a/tools/hv/hv_fcopy_daemon.c
> +++ b/tools/hv/hv_fcopy_daemon.c
> @@ -80,6 +80,8 @@ static int hv_start_fcopy(struct hv_start_fcopy *smsg)
> 
>  	error = 0;
>  done:
> +	if (error)
> +		target_fname[0] = '\0';
>  	return error;
>  }
> 
> @@ -108,15 +110,29 @@ static int hv_copy_data(struct hv_do_fcopy *cpmsg)
>  	return ret;
>  }
> 
> +/*
> + * Reset target_fname to "" in the two below functions for hibernation: if
> + * the fcopy operation is aborted by hibernation, the daemon should remove the
> + * partially-copied file; to achieve this, the hv_utils driver always fakes a
> + * CANCEL_FCOPY message upon suspend, and later when the VM resumes back,
> + * the daemon calls hv_copy_cancel() to remove the file; if a file is copied
> + * successfully before suspend, hv_copy_finished() must reset target_fname to
> + * avoid that the file can be incorrectly removed upon resume, since the faked
> + * CANCEL_FCOPY message is spurious in this case.
> + */
>  static int hv_copy_finished(void)
>  {
>  	close(target_fd);
> +	target_fname[0] = '\0';
>  	return 0;
>  }
>  static int hv_copy_cancel(void)
>  {
>  	close(target_fd);
> -	unlink(target_fname);
> +	if (strlen(target_fname) > 0) {
> +		unlink(target_fname);
> +		target_fname[0] = '\0';
> +	}
>  	return 0;
> 
>  }
> @@ -141,7 +157,7 @@ int main(int argc, char *argv[])
>  		struct hv_do_fcopy copy;
>  		__u32 kernel_modver;
>  	} buffer = { };
> -	int in_handshake = 1;
> +	int in_handshake;
> 
>  	static struct option long_options[] = {
>  		{"help",	no_argument,	   0,  'h' },
> @@ -170,6 +186,10 @@ int main(int argc, char *argv[])
>  	openlog("HV_FCOPY", 0, LOG_USER);
>  	syslog(LOG_INFO, "starting; pid is:%d", getpid());
> 
> +reopen_fcopy_fd:
> +	/* Remove any possible partially-copied file on error */
> +	hv_copy_cancel();

Since you have removed the calls to close(fcopy_fd) after a
pread() or pwrite() failure that were in v2 of the patch, I was
expecting to see

	if (fcopy_fd != -1)
		close(fcopy_fd)

here, like you've done with the kvp and vss code.  And
remember to initialize fcopy_fd to -1. :-)

Michael

> +	in_handshake = 1;
>  	fcopy_fd = open("/dev/vmbus/hv_fcopy", O_RDWR);
> 
>  	if (fcopy_fd < 0) {
> @@ -196,7 +216,7 @@ int main(int argc, char *argv[])
>  		len = pread(fcopy_fd, &buffer, sizeof(buffer), 0);
>  		if (len < 0) {
>  			syslog(LOG_ERR, "pread failed: %s", strerror(errno));
> -			exit(EXIT_FAILURE);
> +			goto reopen_fcopy_fd;
>  		}
> 
>  		if (in_handshake) {
> @@ -231,9 +251,14 @@ int main(int argc, char *argv[])
> 
>  		}
> 
> +		/*
> +		 * pwrite() may return an error due to the faked CANCEL_FCOPY
> +		 * message upon hibernation. Ignore the error by resetting the
> +		 * dev file, i.e. closing and re-opening it.
> +		 */
>  		if (pwrite(fcopy_fd, &error, sizeof(int), 0) != sizeof(int)) {
>  			syslog(LOG_ERR, "pwrite failed: %s", strerror(errno));
> -			exit(EXIT_FAILURE);
> +			goto reopen_fcopy_fd;
>  		}
>  	}
>  }




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux