Hello,it looks like some hardlink-handling caused a problem with a list kernelspace, which caused the executing core to stall.
[2.] Full description of the problem/report:At 07:15:01 local time, a cronjob started some backup process. Essentially nothing else ran on the system. At 07:15:05 local time, the first kernel "WARNING" got logged, about list_add corruption, and to me it looks like 'do_unlinkat' might be involved in that. This is immediately followed by another "WARNING" about list_add double add, where 'do_unlinkat' is involved again. At 07:15:26 local time and every 22/23 seconds thereafter, rcu_sched self-detects a stall on the same CPU as the above "WARNING"s. At 10:something, I attempt to login (and fail because libpam doesn't respond to xscreensaver) and observe (via sshd, which works somehow) a mostly-hung system, as some processes hang, presumably those scheduled on the stalling core.
The tool "rsnapshot" should have immediately started "rm -rf"ing roughly 242560 hardlinked files, and "mv"ing around roughly 20 directories. I have no explanation for the 4 seconds gap.
[3.] Keywords (i.e., modules, networking, kernel):hardlink, __mnt_want_write, do_unlinkat, list_add corruption, list_add double add, fs.
[4.] Kernel version (from /proc/version):Kernel version: 4.9.0-1-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc version 6.3.0 20161229 (Debian 6.3.0-2) ) #1 SMP Debian 4.9.2-2 (2017-01-12)
Source: During boot on 11.2., there was this entry in kern.log:Feb 11 16:57:13 xpected kernel: [ 0.000000] Linux version 4.9.0-1-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc version 6.3.0 20161229 (Debian 6.3.0-2) ) #1 SMP Debian 4.9.2-2 (2017-01-12)
(Sanity check: yes, the machine ran continuously for nearly a month.)/proc/version isn't going to be helpful anymore, as I installed a newer Kernel towards the end of February, and I already restarted after the hang.
I'm aware that this is by far not up-to-date:The most recent version in Debian testing appears to be "4.9.6-3 (2017-01-28)", and kernel.org seems to list only 4.9.13 as the closest version. So if there have been lots of changes in this code path, feel free to close as INVALID or similar.
[5.] Output of Oops.. message (if applicable) with symbolic information resolved The most relevant lines of kern.log: [1781000.925727] [<ffffffff96328b84>] ? dump_stack+0x5c/0x78 [1781000.925729] [<ffffffff96076dbe>] ? __warn+0xbe/0xe0 [1781000.925731] [<ffffffff96076e3f>] ? warn_slowpath_fmt+0x5f/0x80 [1781000.925732] [<ffffffff9634746c>] ? __list_add+0x5c/0xb0 [1781000.925734] [<ffffffff9621c671>] ? d_alloc+0x51/0x70 [1781000.925736] [<ffffffff9620df6e>] ? __lookup_hash+0x3e/0x90 [1781000.925737] [<ffffffff96224441>] ? __mnt_want_write+0x51/0x60 [1781000.925739] [<ffffffff96212f9a>] ? do_unlinkat+0x1ca/0x330[1781000.925741] [<ffffffff965f9c3b>] ? system_call_fast_compare_end+0xc/0x9b
Please find the rest attached as khang_logs.tar.xz There's not much reason to assume that "do_unlinkat" is the cause.It just looks to me like a good place to start, since I expect "__list_add", "d_alloc" and "__lookup_hash" to be very well tested.
[6.] A small shell script or example program which triggers the problem (if possible)Not appliccable. I cannot reproduce this. This operation is part of a backup tool (rsnapshot) that runs roughly every <15 minutes. So the same scenario happened countless times before with kernel 4.9.2-2; and again several times in the time it takes me to write this mail with kernel 4.9.6-3.
[7.] Environment [7.1.] Software (add the output of the ver_linux script here)Not appliccable: Before the hang, the new kernel was already installed but not running, and after the hang I immediately rebooted.
There's only /var/log/kern.log.4.gz to tell me the old version number.For sake of privacy, I don't want to attach this file unreviewed, but made a copy. So if there's anything else I can look up, it'll still be there.
[7.2.] Processor information (from /proc/cpuinfo): vendor_id : AuthenticAMD cpu family : 21 model : 16 model name : AMD A8-5600K APU with Radeon(tm) HD Graphics stepping : 1 microcode : 0x6001119 cpu MHz : 1400.000 cpu cores : 4 Skipped the rest, as it is probably not too relevant here.In short: 1 CPU, 4 cores, x64 arch, setup includes no overclocking/overvolting/whatever.
System has been running stably for the last 4 years. [7.3.] Module information (from /proc/modules):See attached, as it is included in the stack trace in khang_kern.log (see file khang_logs.tar.xz)
[7.4.] Loaded driver and hardware information (/proc/ioports, /proc/iomem) [7.5.] PCI information ('lspci -vvv' as root) [7.6.] SCSI information (from /proc/scsi/scsi)Skipped, as I don't think this is going to be relevant. I'm not aware of any necessaity of non-standard drivers. The involved ext4 filesystem lies on a dmcrypt volume, so the physical harddrive shouldn't be too interesting, either. For sake of completeness: harddrive is a Western Digital Green, model WDC WD20EZRX-00DC0B0.
[7.7.] Other information that might be relevant to the problemSpecifically, the backup tool creates and removes a lot of hardlinks ("$ rsnapshot beta"), possibly even concurrently.
The concerned filesystem is just normal ext4:/dev/mapper/sdb1_crypt on /backup type ext4 (rw,relatime,errors=remount-ro,data=ordered)
Again, there's no "proof" that this is actually related to the backup process at all. It's just highly suspicious coincidence, and so I'm looking at filesystem-related calls more closely.
There *is* a small probability of a harddrive error, as the drive is somewhat old, with 22 k power-on hours. However, smartctl reports good values. Also, a harddrive error shouldn't cause a kernel hang anyway.
Still, that's enough doubtIf this turns out to be the wrong list for this, please tell me where to post instead.
I'm willing to provide more information if available. I'm not too optimistic about the root cause ever being found. Cheers, Ben Wiederhake
Attachment:
khang_logs.tar.xz
Description: application/xz