On Tue, Feb 15, 2011 at 04:22:20PM +0000, KY Srinivasan wrote: > > > > -----Original Message----- > > From: Greg KH [mailto:gregkh@xxxxxxx] > > Sent: Tuesday, February 15, 2011 9:03 AM > > To: KY Srinivasan > > Cc: Jiri Slaby; linux-kernel@xxxxxxxxxxxxxxx; devel@xxxxxxxxxxxxxxxxxxxxxx; > > virtualization@xxxxxxxxxxxxxx > > Subject: Re: [PATCH 2/3]: Staging: hv: Use native wait primitives > > > > On Tue, Feb 15, 2011 at 01:35:56PM +0000, KY Srinivasan wrote: > > > > > > > > > > -----Original Message----- > > > > From: Jiri Slaby [mailto:jirislaby@xxxxxxxxx] > > > > Sent: Tuesday, February 15, 2011 4:21 AM > > > > To: KY Srinivasan > > > > Cc: gregkh@xxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; > > > > devel@xxxxxxxxxxxxxxxxxxxxxx; virtualization@xxxxxxxxxxxxxx > > > > Subject: Re: [PATCH 2/3]: Staging: hv: Use native wait primitives > > > > > > > > On 02/11/2011 06:59 PM, K. Y. Srinivasan wrote: > > > > > In preperation for getting rid of the osd layer; change > > > > > the code to use native wait interfaces. As part of this, > > > > > fixed the buggy implementation in the osd_wait_primitive > > > > > where the condition was cleared potentially after the > > > > > condition was signalled. > > > > ... > > > > > @@ -566,7 +567,11 @@ int vmbus_establish_gpadl(struct vmbus_channel > > > > *channel, void *kbuffer, > > > > > > > > > > } > > > > > } > > > > > - osd_waitevent_wait(msginfo->waitevent); > > > > > + wait_event_timeout(msginfo->waitevent, > > > > > + msginfo->wait_condition, > > > > > + msecs_to_jiffies(1000)); > > > > > + BUG_ON(msginfo->wait_condition == 0); > > > > > > > > The added BUG_ONs all over the code look scary. These shouldn't be > > > > BUG_ONs at all. You should maybe warn and bail out, but not kill the > > > > whole machine. > > > > > > This is Linux code running as a guest on a Windows host; and so the guest > > cannot > > > tolerate a failure of the host. In the cases where I have chosen to BUG_ON, > > there > > > is no reasonable recovery possible when the host is non-functional (as > > determined > > > by a non-responsive host). > > > > If you have a non-responsive host, wouldn't that imply that this guest > > code wouldn't run at all? :) > > The fact that on a particular transaction the host has not responded within an expected > time interval does not necessarily mean that the guest code would not be running. There may be > issues on the host side that may be either transient or permanent that may cause problems like > this. Keep in mind, HyperV is a type 1 hypervisor that would schedule all VMs including the host > and so, guest would get scheduled. > > > > > Having BUG_ON() in drivers is not a good idea either way. Please remove > > these in future patches. > > In situations where there is not a reasonable rollback strategy (for > instance in one of the cases, we are granting access to the guest > physical pages to the host) we really have only 2 options: > > 1) Wait until the host responds. This wait could potentially be unbounded > and in fact this was the way the code was to begin with. One of the reviewers > had suggested that unbounded wait was to be corrected. > 2) Wait for a specific period and if the host does not respond > within a reasonable period, kill the guest since there is no recovery > possible. Killing the guest is a very serious thing, causing all sorts of possible problems with it, right? > I chose option 2, as part of addressing some of the prior review > comments. If the consensus now is to go back to option 1, I am fine with that; Unbounded waits aren't ok either, you need some sort of timeout. But, as this is a bit preferable to dieing, I suggest doing this, and comment the heck out of it to explain all of this for anyone who reads it. thanks, greg k-h _______________________________________________ devel mailing list devel@xxxxxxxxxxxxxxxxxxxxxx http://driverdev.linuxdriverproject.org/mailman/listinfo/devel