[linux-pm] SMP suspend broken due to "swsusp: Change code ordering in disk.c" et al.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On Friday, 23 February 2007 04:29, Johannes Berg wrote:
> Hi,
> 
> After first debugging a while and then bisecting I found out why my quad
> G5 won't suspend any longer.
> 
> Let me explain. The patch in question (committed as
> ed746e3b18f4df18afa3763155972c5835f284c5, but the other ones around that
> for other suspend methods will have the same problems) modifies the
> suspend sequence to be like this:
> 
> freeze_processes
> swsusp_shrink_memory
> platform_prepare
> device_suspend
> disable_nonboot_cpus
> [...]
> 
> while previously it was
> 
> disable_nonboot_cpus
> freeze_processes
> platform_prepare
> swsusp_shrink_memory
> [...]
> 
> 
> The only thing I'm worried about here is the ordering of
> freeze_processes vs. disable_nonboot_cpus. The problem with this new
> ordering is with workqueues, specifically freezable per-CPU workqueues
> which consist of one kthread per CPU, bound to a single CPU. Now, when
> CPUs are hot-unplugged, the workqueue code (by having a cpu notifier
> called) will kill the thread for the CPU that is being unplugged. If you
> look into kernel/workqueue.c, you'll notice that this is done by a
> regular kthread_stop() as one might expect.
> 
> However, and this is the problem, for any freezable workqueue, the
> workqueue kthread will be frozen at this point! Hence, kthread_stop()
> will wait forever for the thread to finish, blocking the suspend
> process.

Hm, the only freezable workqueues I was aware of were those in XFS.

Moreover, the patch has got _a_ _lot_ of testing on SMP on x86_64
and I believe it works for people on i386 too.  So the workqueues in question
seem to be architecture-specific.  Is that correct?

> Now, as for a solution, I don't really have a great idea yet. We have a
> bunch of things we could do:
>  (1) simply change the ordering to disable nonboot CPUs much earlier
>  (2a) teach kthread_stop() about frozen processes and that it doesn't
>       need to wait for them because they'll die once they wake up again
>  (2b) teach kthread_stop() about frozen processes modify the freezer to
>       allow waking up a process that is destined to die
>  (3) teach the workqueue code about suspend
> 
> Of these options,
> 
> (1) would work, but also only punts the problem until someone wants to
> do multi-threaded suspend (as if...).

It will also break symmetry with the resume code that has to be like this
because of ACPI-related issues.

> (2a) would sort-of work, but what if someone unplugs a CPU while the
> system is suspended [will that even work]? the thread would get really
> stuck there, bound to a CPU that no longer exists.

Right now we are working on using the task freezer for CPU hotplugging and if
that works, this won't be an issue.
 
> (2b) should be possible, but would require some sort of per-thread
> exit-the-freezer API
> 
> (3) is icky

The workqueue code knows about the suspend already, that's why we have
create_freezeable_worqueue(), for example.

I'd like to first understand why the workqueues in question here are freezable.

> I think I prefer (2b) or alternatively (1). In any case, with the commit
> mentioned above reverted, my quad G5 can suspend to disk again and I'm
> happy that it isn't my fault ;)

Could you please check if the appended patch (on top of the commit you have
reverted) changes anything?

Rafael

---
 kernel/power/disk.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Index: linux-2.6.21-rc1/kernel/power/disk.c
===================================================================
--- linux-2.6.21-rc1.orig/kernel/power/disk.c
+++ linux-2.6.21-rc1/kernel/power/disk.c
@@ -132,9 +132,13 @@ int pm_suspend_disk(void)
 	if (error)
 		goto Thaw;
 
+	error = disable_nonboot_cpus();
+	if (error)
+		goto Enable_cpus;
+
 	error = platform_prepare();
 	if (error)
-		goto Thaw;
+		goto Enable_cpus;
 
 	suspend_console();
 	error = device_suspend(PMSG_FREEZE);
@@ -142,10 +146,6 @@ int pm_suspend_disk(void)
 		printk(KERN_ERR "PM: Some devices failed to suspend\n");
 		goto Resume_devices;
 	}
-	error = disable_nonboot_cpus();
-	if (error)
-		goto Enable_cpus;
-
 	if (pm_disk_mode == PM_DISK_TEST) {
 		printk("swsusp debug: Waiting for 5 seconds.\n");
 		mdelay(5000);


[Index of Archives]     [Linux ACPI]     [Netdev]     [Ethernet Bridging]     [Linux Wireless]     [CPU Freq]     [Kernel Newbies]     [Fedora Kernel]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux