On Tue, Jan 7, 2014 at 11:06 PM, Bradley Baetz <bbaetz@xxxxxxxxx> wrote: > Hi, > > On Tue, Jan 7, 2014 at 4:13 AM, Johannes Weiner <hannes@xxxxxxxxxxx> wrote: >> Hi Bradley, >> >> On Fri, Dec 27, 2013 at 02:21:21PM +1100, Bradley Baetz wrote: >>> Hi, >>> >>> I have a Dell laptop (Vostro 3560). When I boot Fedora 20 with the >>> acpi_backlight=vendor option, the kernel locks up hard during the boot >>> proces, when systemd runs udevadm trigger. This is a hard lockup - >>> magic-sysrq doesn't work, and neither does caps lock/vt-change/etc. >>> >>> I've bisected this to: >>> >>> commit 81c0a2bb515fd4daae8cab64352877480792b515 >>> Author: Johannes Weiner <hannes@xxxxxxxxxxx> >>> Date: Wed Sep 11 14:20:47 2013 -0700 >>> >>> mm: page_alloc: fair zone allocator policy >>> >>> which seemed really unrelated, but I've confirmed that: >>> >>> - the commit before this patch doesn't cause the problem, and the commit >>> afterwrads does >>> - reverting that patch from 3.12.0 fixes the problem >>> - reverting that patch (and the partial revert >>> fff4068cba484e6b0abe334ed6b15d5a215a3b25) from master also fixes the problem >>> - reverting that patch from the fedora 3.12.5-302.fc20 kernel fixes the >>> problem >>> - applying that patch to 3.11.0 causes the problem >>> >>> so I'm pretty sure that that is the patch that causes (or at least >>> triggers) this issue >>> >>> I'm using the acpi_backlight option to get the backlight working - without >>> this the backlight doesn't work at all. Removing 'acpi_backlight=vendor' >>> (or blacklisting the dell-laptop module, which is effectively the same >>> thing) fixes the issue. >>> >>> The lockup happens when systemd runs "udevadm trigger", not when the module >>> is loaded - I can reproduce the issue by booting into emergency mode, >>> remounting the filesystem as rw, starting up systemd-udevd and running >>> udevadm trigger manually. It dies a few seconds after loading the >>> dell-laptop module. >>> >>> This happens even if I don't boot into X (using >>> systemd.unit=multi-user.target) >>> >>> Triggering udev individually for each item doesn't trigger the issue ie: >>> >>> for i in `udevadm --debug trigger --type=devices --action=add --dry-run >>> --verbose`; do echo $i; udevadm --debug trigger --type=devices --action=add >>> --verbose --parent-match=$i; sleep 1; done >>> >>> works, so I haven't been able to work out what specific combination of >>> actions are causing this. >>> >>> With the acpi_backlight option, I can manually read/write to the sysfs >>> dell-laptop backlight file, and it works (and changes the backlight as >>> expected) >>> >>> This is 100% reproducible. I've also tested by powering off the laptop and >>> pulling the battery just in case one of the previous boots with the bisect >>> left the hardware in a strange state - no change. >> >> My patch aggressively spreads allocations over all zones in the >> system, but it should still respect dell-laptop's requirements for >> DMA32 memory. >> >> I wonder if the drastic change in allocation placement exposes an >> existing memory corruption. In fact, the dell-laptop module is >> confused when it comes to the page allocator interface, it does >> >> free_page((unsigned long)bufferpage); >> >> in the error path, where bufferpage is a page pointer that came out of >> alloc_page(), which will cause the page allocator to try to free the >> mem_map(!) page that backs the bufferpage page struct. So one failed >> load attempt of the module could plausibly corrupt internal state. >> >> Does the following resolve the problem? And if not, what are the >> "dell-laptop:" lines in the good and the bad kernel, and does the bad >> kernel trigger the WARNING? > > Nope, no luck. I added some more printk's arround the use of SMI. I've > transcribed the logs from a screenshot for the failing kernel (ie > master+your patch) ("Sending command" logs class, select, and > &command.ebx (with the %pa format string): > > dell-laptop: bufferpage (ffffea000263c680) in node 0 zone 1 (DMA32) > Sending command: 0, 2, 0x4253493198f1a000 > Command sent > dell-laptop: getting intensity > Sending command: 0, 2, 0x4253493198f1a000 > Command sent > dell-laptop: got intensity > dell-laptop: Setting intensity > Sending command: 1, 2, 0x4253493198f1a000 > > and then it locks up before returning from the SMI > > So some of the commands work, and they also return the same value for > the brightness, AND have parsed the same value from the SMBIOS table > for the ioport/value to use. (I added that later, but didn't take a > photo - they all return brightness of 2, which is the at-boot default > value) > > Without acpi_backlight=vendor: > > dell-laptop: bufferpage (ffffea0000fa0dc0) in node 0 zone 1 (DMA32) > > (no other logs, because the module's backlight interface isn't used > without that boot param) > > With your mm patches reverted: > > [ 12.773884] dell-laptop: bufferpage (ffffea0000fe0180) in node 0 > zone 1 (DMA32) > [ 12.775502] Sending command: 0, 2, 0x425349313f806000 > [ 12.777293] Command sent > [ 12.778950] dell-laptop: getting intensity > [ 12.780589] Sending command: 0, 2, 0x425349313f806000 > [ 12.782185] Command sent > [ 12.783679] dell-laptop: got intensity > [ 12.785202] dell-laptop: Setting intensity > [ 12.786715] Sending command: 1, 2, 0x425349313f806000 > [ 12.788892] Command sent > [ 12.790379] dell-laptop: set intensity > > (with the get/set repeated a bit later when X starts up) > > And on the broken kernel, when I boot into 'emergency' mode, manually > load dell-laptop, I get the same logs as the 'working' bit (including > the getting/got/setting/set lines). > > Looking at the code, I notice a few things odd with the dcdbas code, > although I don't think that they're the issue here > > 1. dcdbas_smi_request does outb/inb, and marks eax as an input, but > doesn't mark it as clobbered (I think; I don't have much experience > with gcc's asm). In practice, I can't see that being an issue > 2. dcdbas_smi_request says that it is "Called with smi_data_lock" but > that's only true for the calls *within* dcdbas.c. I think that that's > only a documentation issue, since is protecting a buffer that isn't > used here. (Dell-laptop has its own buffer and mutex). > > I'm still unable to manually reproduce this - the only way to repro is > 'try to boot normally', and while that's 100% reliable, it makes it a > bit hard to narrow a trigger down... So if I boot into 'emergency' mode and modprobe dell-laptop, it only locks up about 50% of the time. And if I boot with init=/bin/bash, and then load the module, it doesn't lock up at all (tried 5 times) I also tried making dell-laptop use the DMA zone (instead of DMA32), and that didn't help. Bradley -- To unsubscribe from this list: send the line "unsubscribe platform-driver-x86" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html