> -----Original Message----- > From: Jason Wang [mailto:jasowang@xxxxxxxxxx] > Sent: Friday, November 28, 2014 18:13 PM > To: Dexuan Cui > Cc: gregkh@xxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; driverdev- > devel@xxxxxxxxxxxxxxxxxxxxxx; olaf@xxxxxxxxx; apw@xxxxxxxxxxxxx; KY > Srinivasan; vkuznets@xxxxxxxxxx; Haiyang Zhang > Subject: RE: [PATCH v3] hv: hv_fcopy: drop the obsolete message on transfer > failure > On Fri, Nov 28, 2014 at 4:36 PM, Dexuan Cui <decui@xxxxxxxxxxxxx> wrote: > >> -----Original Message----- > >> From: Jason Wang [mailto:jasowang@xxxxxxxxxx] > >> Sent: Friday, November 28, 2014 14:47 PM > >> To: Dexuan Cui > >> Cc: gregkh@xxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; > >> driverdev- > >> devel@xxxxxxxxxxxxxxxxxxxxxx; olaf@xxxxxxxxx; apw@xxxxxxxxxxxxx; KY > >> Srinivasan; vkuznets@xxxxxxxxxx; Haiyang Zhang > >> Subject: Re: [PATCH v3] hv: hv_fcopy: drop the obsolete message on > >> transfer > >> failure > >> On Thu, Nov 27, 2014 at 9:09 PM, Dexuan Cui <decui@xxxxxxxxxxxxx> > >> wrote: > >> > In the case the user-space daemon crashes, hangs or is killed, we > >> > need to down the semaphore, otherwise, after the daemon starts > >> next > >> > time, the obsolete data in fcopy_transaction.message or > >> > fcopy_transaction.fcopy_msg will be used immediately. > >> > > >> > Cc: Jason Wang <jasowang@xxxxxxxxxx> > >> > Cc: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> > >> > Cc: K. Y. Srinivasan <kys@xxxxxxxxxxxxx> > >> > Signed-off-by: Dexuan Cui <decui@xxxxxxxxxxxxx> > >> > --- > >> > > >> > v2: I removed the "FCP" prefix as Greg asked. > >> > > >> > I also updated the output message a little: > >> > "FCP: failed to acquire the semaphore" --> > >> > "can not acquire the semaphore: it is benign" > >> > > >> > v3: I added the code in fcopy_release() as Jason Wang suggested. > >> > I removed the pr_debug (it isn't so meaningful)and added a > >> > comment instead. > >> > > >> > drivers/hv/hv_fcopy.c | 19 +++++++++++++++++++ > >> > 1 file changed, 19 insertions(+) > >> > > >> > diff --git a/drivers/hv/hv_fcopy.c b/drivers/hv/hv_fcopy.c > >> > index 23b2ce2..faa6ba6 100644 > >> > --- a/drivers/hv/hv_fcopy.c > >> > +++ b/drivers/hv/hv_fcopy.c > >> > @@ -86,6 +86,18 @@ static void fcopy_work_func(struct work_struct > >> > *dummy) > >> > * process the pending transaction. > >> > */ > >> > fcopy_respond_to_host(HV_E_FAIL); > >> > + > >> > + /* In the case the user-space daemon crashes, hangs or is > >> killed, we > >> > + * need to down the semaphore, otherwise, after the daemon > >> starts > >> > next > >> > + * time, the obsolete data in fcopy_transaction.message or > >> > + * fcopy_transaction.fcopy_msg will be used immediately. > >> > + * > >> > + * NOTE: fcopy_read() happens to get the semaphore (very rare)? > >> > We're > >> > + * still OK, because we've reported the failure to the host. > >> > + */ > >> > + if (down_trylock(&fcopy_transaction.read_sema)) > >> > + ; > >> > >> Sorry, I'm not quite understand how if () ; can help here. > >> > >> Btw, a question not relate to this patch. > >> > >> What happens if a daemon is resume from SIGSTOP and expires the > >> check > >> here? > > Hi Jason, > > My idea is: here we need down_trylock(), but in case we can't get the > > semaphore, it's OK anyway: > > > > Scenario 1): > > 1.1: when the daemon is blocked on the pread(), the daemon receives > > SIGSTOP; > > 1.2: the host user runs the PowerShell Copy-VMFile command; > > 1.3.1: the driver reports the failure to the host user in 5s and > > 1.3.2: the driver down()-es the semaphore; > > 1.4: the daemon receives SIGCONT and it will be still blocked on the > > pread(). > > Without the down_trylock(), in 1.4, the daemon can receive an > > obsolete message. > > NOTE: in this scenario, the daemon is not killed. > > > > Scenario 2): > > In senario 1), if the daemon receives SIGCONT between 1.3.1 and 1.3.2 > > and > > do down() in fcopy_read(), it will receive the message but: the > > driver has > > reported the failure to the host user and the driver's 1.3.2 can't > > get the > > semaphore -- IMO this is acceptably OK, though in the VM, an > > incomplete > > file will be left there. > > BTW, I think in the daemon's hv_start_fcopy() we should add a > > close(target_fd) before open()-ing a new one. > > Right, but how about the case when resuming from SIGSTOP but no timeout? Sorry, I don't understand this: if no timeout, fcopy_read() will get the semaphore and fcopy_write() will try to cancel fcopy_work. > Looks like in this case userspace() may wait in down_interruptible() > until timeout. We probably need something like this: > > if (down_interruptible(&fcopy_transaction.read_sema)) { > up(&fcopy_transaction.read_sema); > return -EINTR; > } until "timeout"? if the daemon can't get the semaphore, it can only be wake by a signal(the daemon doesn't install handler, so by default most signals will kill the daemon). In case a signal waking up the daemon doesn't kill the daemon, why should we do up()? > > This should synchronize with the timeout work for sure. > But how about only schedule it after this? > It does not may sense to start the timer during interrupt > since the file may not even opened and it may take time > to handle signals? > > > > >> > > >> > + > >> > } > >> > > >> > static int fcopy_handle_handshake(u32 version) > >> > @@ -351,6 +363,13 @@ static int fcopy_release(struct inode *inode, > >> > struct file *f) > >> > */ > >> > in_hand_shake = true; > >> > opened = false; > >> > + > >> > + if (cancel_delayed_work_sync(&fcopy_work)) { > >> > + /* We haven't up()-ed the semaphore(very rare)? */ > >> > + if (down_trylock(&fcopy_transaction.read_sema)) > >> > + ; > >> > >> And this. > > > > Scenario 3): > > When the daemon exits(e.g., SIGKILL received), if there is a > > fcopy_work > > pending (scheduled but not start to run yet), we should cancel the > > work (as you suggested) and down() the semaphore, otherwise, the > > obsolete message will be received by the next instance of the daemon. > > Yes > > > > > > Scenario 4): in the driver's hv_fcopy_onchannelcallback(): > > schedule_delayed_work(&fcopy_work, 5*HZ); > > ----> if fcopy_release() is running on another vcpu, just > > before the next line? > > fcopy_send_data(); > > > > In this case, fcopy_release() can cancel fcopy_work, but > > can't get the semaphore since it hasn't been up()-ed. > > Hmm, in this case, fcopy_send_data() will do up() later, and we'll > > buffer an obsolete message in the driver, and the message will be > > fetched by the next instance of the daemon... > > > > Looks we need a spinlock here? > > Unless fcopy_release() can wait for all data for current transation > to be received. Spinlock won't help. > > But an idea is let the daemon the handle such cases. E.g make sure the > processing begins with START_COPY and end with COMPLETE/CANCEL_COPY. > Drop all requests that does not start with START_COPY. > > Thought? Good idea. I also think we should reinforce the concept of state machine in the daemon code. The daemon/driver communication has so many corner cases... > > > >> > > >> > + fcopy_respond_to_host(HV_E_FAIL); > >> > + } > >> > return 0; > >> > } > >> > > >> > -- > >> > 1.9.1 > >> > -- Dexuan _______________________________________________ devel mailing list devel@xxxxxxxxxxxxxxxxxxxxxx http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel