On Tue, 10 Nov 2009, Rafael J. Wysocki wrote: > > > > So somebody is trying to remove a list entry that was already free'd. > > The entry is being removed by run_workqueue(), so I wonder what the scenario is. If the workqueue code runs on the wrogn CPU, any random crap can easily happen. If a workqueue entry gets queued on the wrong CPU, people who try to unqueue it may not find it again, and will thus fail to remove it even though it's still listed. And the code really is pretty subtle. queue_delayed_work_on(), for example, uses raw_smp_processor_id() to basically pick a target CPU for the work ("whatever the CPU happens to be now"). But does that have to match the CPU that the timeout will trigger on? I dunno. I've walked through the code many times, but every time I end up forgetting all the subtleties a few days later. The workqueue code is very fragile in many ways. I'm adding Oleg and Tejun to the Cc, because Oleg is definitely one of the workqueue locking masters, and Tejun has worked on it to make it way more robust, so they may have ideas. Oleg, Tejun? > It appears that the problem is not reproducible with init=/bin/bash and the > minimal set of modules loaded, so perhaps one of the (modular) drivers > removes a work item from keventd_wq in a racy way. It's not necessarily that they remove a work item racily - maybe there is something that frees a work-item without removing it at all (or, equally possible: maybe it does remove it, but has a racy re-attach that means that it gets added back to the workqueue again and then freed). > In the meantime I got another trace, this time with a slab corruption involved. > Note that it crashed in exactly the same place as previously. I'm leaving your crash log appended for the new cc's, and I would not be at all surprised to hear that the slab corruption is related. The whole 6b6b6b6b pattern does imply a use-after-free on the workqueue, after all, and slab corruption could easily be another result. Sadly, the 'size-512' thing doesn't really say much. There's a hint that it may be sound-related, because the second redzone report (the firxt kzalloc) has the data being - 16 bytes of zero - followed by 16 bytes of "PCM Playback Vol" and the module information there also says [snd]. But because it's a size-512 thing (ie kmalloc/kfree), it may be that the sound driver was just the next user of the invalid 512-byte block, and wasn't the _cause_ of the problem. So I don't know, but it might be worthwhile to see if the problem goes away if you disable sound. Just to narrow down what has changed. Linus --- > [ 210.450676] usb 1-2: GSM modem (1-port) converter now attached to ttyUSB0 > [ 210.450725] option 1-2:1.1: GSM modem (1-port) converter detected > [ 210.450997] usb 1-2: GSM modem (1-port) converter now attached to ttyUSB1 > [ 210.451044] option 1-2:1.2: GSM modem (1-port) converter detected > [ 210.451422] usb 1-2: GSM modem (1-port) converter now attached to ttyUSB2 > [ 210.451470] option 1-2:1.3: GSM modem (1-port) converter detected > [ 210.451746] usb 1-2: GSM modem (1-port) converter now attached to ttyUSB3 > [ 210.452092] PM: Finishing wakeup. > [ 210.452095] Restarting tasks ... > [ 210.452293] usb 5-2: USB disconnect, address 2 > [ 210.470519] done. > [ 210.845547] Slab corruption: size-512 start=ffff88007f1182b0, len=512 > [ 210.845718] Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. > [ 210.845880] Last user: [<ffffffff810c8bdb>](kfree+0x7c/0x88) > [ 210.846142] 050: 6b 6b 6b 6b 6b 6b 6b 6b 38 e7 60 01 00 88 ff ff > [ 210.847114] Prev obj: start=ffff88007f118098, len=512 > [ 210.847275] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. > [ 210.847436] Last user: [<ffffffffa01ffce0>](kzalloc+0xf/0x11 [snd]) > [ 210.847724] 000: 00 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 > [ 210.850735] 010: 50 43 4d 20 50 6c 61 79 62 61 63 6b 20 56 6f 6c > [ 210.851844] Next obj: start=ffff88007f1184c8, len=512 > [ 210.852220] Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. > [ 210.852384] Last user: [<ffffffff8106a4c8>](kzalloc+0xf/0x11) > [ 210.852655] 000: 17 ea 3b 81 ff ff ff ff 00 00 00 00 00 00 00 00 > [ 210.853981] 010: 24 01 00 00 00 00 00 00 0e a9 06 81 ff ff ff ff > [ 210.887295] e1000e 0000:01:00.0: irq 27 for MSI/MSI-X > [ 210.940224] e1000e 0000:01:00.0: irq 27 for MSI/MSI-X > [ 210.941776] ADDRCONF(NETDEV_UP): eth0: link is not ready > [ 211.061316] general protection fault: 0000 [#1] PREEMPT SMP > [ 211.061626] last sysfs file: /sys/class/pcmcia_socket/pcmcia_socket0/card_insert > [ 211.061891] CPU 0 > [ 211.062090] Modules linked in: ip6t_LOG xt_tcpudp xt_pkttype ipt_LOG xt_limit af_packet bnep sco rfcomm l2cap crc16 snd_pcm_oss snd_mixer_oss snd_seq binfmt_misc snd_seq_device ip6t_REJECT nf_conntrack_ipv6 ip6table_raw xt_NOTRACK ipt_REJECT xt_state iptable_raw iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 cpufreq_conservative ip_tables cpufreq_ondemand cpufreq_userspace cpufreq_powersave ip6table_filter acpi_cpufreq ip6_tables x_tables freq_table ipv6 microcode fuse loop sr_mod cdrom dm_mod btusb snd_hda_codec_realtek bluetooth arc4 snd_hda_intel ecb snd_hda_codec pcmcia snd_hwdep iwlcore snd_pcm mac80211 toshiba_acpi joydev snd_timer yenta_socket usbhid option video cfg80211 backlight rtc_cmos snd rsrc_nonstatic psmouse usb_storage hid output usbserial rtc_core iTCO_wdt soundcore battery pcmcia_core ac rfkill led_class button intel_agp rtc_lib crc_itu_t serio_raw e1000e iTCO_vendor_support snd_page_alloc uinput s! g ehci_hcd uhci_hcd sd_mod crc_t10dif usbcore ext3 jbd fan ahci libata thermal processor [last unloaded: iwlagn] > [ 211.064780] Pid: 9, comm: events/0 Not tainted 2.6.32-rc6-tst #169 PORTEGE R500 > [ 211.064780] RIP: 0010:[<ffffffff81054c07>] [<ffffffff81054c07>] worker_thread+0x15b/0x22a > [ 211.064780] RSP: 0018:ffff88007f0d9e40 EFLAGS: 00010046 > [ 211.064780] RAX: ffff88007f118308 RBX: ffff88007f09bd48 RCX: 6b6b6b6b6b6b6b6b > [ 211.064780] RDX: 6b6b6b6b6b6b6b6b RSI: 0000000000000000 RDI: ffff880001613d00 > [ 211.064780] RBP: ffff88007f0d9ee0 R08: ffff88007f0d9cf0 R09: ffff88007f0d9e02 > [ 211.064780] R10: ffff880001613d00 R11: 0000000000000000 R12: ffff88007f118300 > [ 211.064780] R13: ffff880001613d00 R14: ffff88007f0b9140 R15: ffff88007f0b9140 > [ 211.064780] FS: 0000000000000000(0000) GS:ffff880001600000(0000) knlGS:0000000000000000 > [ 211.064780] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b > [ 211.064780] CR2: 00007f7678984660 CR3: 000000007d3ce000 CR4: 00000000000006f0 > [ 211.064780] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 211.064780] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 211.064780] Process events/0 (pid: 9, threadinfo ffff88007f0d8000, task ffff88007f0b9140) > [ 211.064780] Stack: > [ 211.064780] 000000000000c918 ffff88007f0b9578 ffff88007f0d9fd8 ffff88007f0b9140 > [ 211.064780] <0> ffff880001613d08 ffff88007f0b9140 ffff880001613d18 6b6b6b6b6b6b6b6b > [ 211.064780] <0> 0000000000000000 ffff88007f0b9140 ffffffff81058289 ffff88007f0d9e98 > [ 211.064780] Call Trace: > [ 211.064780] [<ffffffff81058289>] ? autoremove_wake_function+0x0/0x38 > [ 211.064780] [<ffffffff81054aac>] ? worker_thread+0x0/0x22a > [ 211.064780] [<ffffffff81058062>] kthread+0x69/0x71 > [ 211.064780] [<ffffffff8100c16a>] child_rip+0xa/0x20 > [ 211.064780] [<ffffffff81057ff9>] ? kthread+0x0/0x71 > [ 211.064780] [<ffffffff8100c160>] ? child_rip+0x0/0x20 > [ 211.064780] Code: 74 12 4c 89 e6 4c 89 f7 ff 13 48 83 c3 08 48 83 3b 00 eb ec e8 3d ef ff ff 49 8b 45 08 4d 89 65 30 4c 89 ef 48 8b 08 48 8b 50 08 <48> 89 51 08 48 89 0a 48 89 40 08 48 89 00 e8 f6 11 24 00 49 8b > [ 211.064780] RIP [<ffffffff81054c07>] worker_thread+0x15b/0x22a > [ 211.064780] RSP <ffff88007f0d9e40> > [ 211.064780] ---[ end trace 586a2f2b96b909e0 ]--- > [ 211.064780] note: events/0[9] exited with preempt_count 1 > > Rafael > _______________________________________________ linux-pm mailing list linux-pm@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/linux-pm