On 9/14/20 5:47 PM, Anchal Agarwal wrote: > On Sun, Sep 13, 2020 at 11:43:30AM -0400, boris.ostrovsky@xxxxxxxxxx wrote: >> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >> >> >> >> On 8/21/20 6:25 PM, Anchal Agarwal wrote: >>> Though, accquirng pm_mutex is still right thing to do, we may >>> see deadlock if PM hibernation is interrupted by Xen suspend. >>> PM hibernation depends on xenwatch thread to process xenbus state >>> transactions, but the thread will sleep to wait pm_mutex which is >>> already held by PM hibernation context in the scenario. Xen shutdown >>> code may need some changes to avoid the issue. >> >> >> Is it Xen's shutdown or suspend code that needs to address this? (Or I >> may not understand what the problem is that you are describing) >> > Its Xen suspend code I think. If we do not take the system_transition_mutex > in do_suspend then if hibernation is triggered in parallel to xen suspend there > could be issues. But you *are* taking this mutex to avoid this exact race, aren't you? > Now this is still theoretical in my case and I havent been able > to reproduce such a race. So the approach the original author took was to take > this lock which to me seems right. > And its Xen suspend and not Xen Shutdown. So basically if this scenario > happens I am of the view one of other will fail to occur then how do we recover > or avoid this at all. > > Does that answer your question? > >>> + >>> +static int xen_setup_pm_notifier(void) >>> +{ >>> + if (!xen_hvm_domain() || xen_initial_domain()) >>> + return -ENODEV; >> >> I don't think this works anymore. > What do you mean? > The first check is for xen domain types and other is for architecture support. > The reason I put this check here is because I wanted to segregate the two. > I do not want to register this notifier at all for !hmv guest and also if its > an initial control domain. > The arm check only lands in notifier because once hibernate() api is called -> > calls pm_notifier_call_chain for PM_HIBERNATION_PREPARE this will fail for > aarch64. > Once we have support for aarch64 this notifier can go away altogether. > > Is there any other reason I may be missing why we should move this check to > notifier? Not registering this notifier is equivalent to having it return NOTIFY_OK. In your earlier versions just returning NOTIFY_OK was not sufficient for hibernation to proceed since the notifier would also need to set suspend_mode appropriately. But now your notifier essentially filters out unsupported configurations. And so if it is not called your configuration (e.g. PV domain) will be considered supported. >> In the past your notifier would set suspend_mode (or something) but now >> it really doesn't do anything except reports an error in some (ARM) cases. >> >> So I think you should move this check into the notifier. >> (And BTW I still think PM_SUSPEND_PREPARE should return an error too. >> The fact that we are using "suspend" in xen routine names is irrelevant) >> > I may have send "not-updated" version of the notifier's function change. > > + switch (pm_event) { > + case PM_HIBERNATION_PREPARE: > + /* Guest hibernation is not supported for aarch64 currently*/ > + if (IS_ENABLED(CONFIG_ARM64)) { > + ret = NOTIFY_BAD; > + break; > + } > + case PM_RESTORE_PREPARE: > + case PM_POST_RESTORE: > + case PM_POST_HIBERNATION: > + default: > + ret = NOTIFY_OK; > + } There is no difference on x86 between this code and what you sent earlier. In both instances PM_SUSPEND_PREPARE will return NOTIFY_OK. On ARM this code will allow suspend to proceed (which is not what we want). -boris > > With the above path PM_SUSPEND_PREPARE will go all together. Does that > resolves this issue? I wanted to get rid of all SUSPEND_* as they are not needed > here clearly. > The only reason I kept it there is if someone tries to trigger hibernation on > ARM instances they should get an error. As I am not sure about the current > behavior. There may be a better way to not invoke hibernation on ARM DomU's and > get rid of this block all together. > > Again, sorry for sending in the half baked fix. My workspace switch may have > caused the error. >> >> >> -boris >> > Anchal >> >>> + return register_pm_notifier(&xen_pm_notifier_block); >>> +} >>> +