On Mon, Jun 29, 2020 at 07:20:35PM +0000, Anchal Agarwal wrote: > On Fri, Jun 26, 2020 at 11:12:39AM +0200, Roger Pau Monné wrote: > > So the frontend should do: > > > > - Switch to Closed state (and cleanup everything required). > > - Wait for backend to switch to Closed state (must be done > > asynchronously, handled in blkback_changed). > > - Switch frontend to XenbusStateInitialising, that will in turn force > > the backend to switch to XenbusStateInitWait. > > - After that it should just follow the normal connection procedure. > > > > I think the part that's missing is the frontend doing the state change > > to XenbusStateInitialising when the backend switches to the Closed > > state. > > > > > I was of the view we may just want to mark frontend closed which should do > > > the job of freeing resources and then following the same flow as > > > blkfront_restore. That does not seems to work correctly 100% of the time. > > > > I think the missing part is that you must wait for the backend to > > switch to the Closed state, or else the switch to > > XenbusStateInitialising won't be picked up correctly by the backend > > (because it's still doing it's cleanup). > > > > Using blkfront_restore might be an option, but you need to assert the > > backend is in the initial state before using that path. > > > Yes, I agree and I make sure that XenbusStateInitialising only triggers > on frontend once backend is disconnected. msleep in a loop not that graceful but > works. > Frontend only switches to XenbusStateInitialising once it sees backend > as Closed. The issue here is and may require more debugging is: > 1. Hibernate instance->Closing failed, artificially created situation by not > marking frontend Closed in the first place during freezing. > 2. System comes back up fine restored to 'backend connected'. I'm not sure I'm following what is happening here, what should happen IMO is that the backend will eventually reach the Closed state? Ie: the frontend has initiated the disconnection from the backend by setting the Closing state, and the backend will have to eventually reach the Closed state. At that point the frontend can initiate a reconnection by switching to the Initialising state. > 3. Re-run (1) again without reboot > 4. (4) fails to recover basically freezing does not fail at all which is weird > because it should timeout as it passes through same path. It hits a BUG in > talk_to_blkback() and instance crashes. It's hard to tell exactly. I guess you would have to figure what makes the frontend not get stuck at the same place as the first attempt. Roger.