PROBLEM: list_add corruption, possibly related to do_unlinkat

Ben Wiederhake <ben.wiederhake@xxxxxxxxx> · Sat, 4 Mar 2017 13:34:29 +0100

Hello,

it looks like some hardlink-handling caused a problem with a list 
kernelspace, which caused the executing core to stall.

[2.] Full description of the problem/report:
At 07:15:01 local time, a cronjob started some backup process. 
Essentially nothing else ran on the system.
At 07:15:05 local time, the first kernel "WARNING" got logged, about 
list_add corruption, and to me it looks like 'do_unlinkat' might be 
involved in that.  This is immediately followed by another "WARNING" 
about list_add double add, where 'do_unlinkat' is involved again.
At 07:15:26 local time and every 22/23 seconds thereafter, rcu_sched 
self-detects a stall on the same CPU as the above "WARNING"s.
At 10:something, I attempt to login (and fail because libpam doesn't 
respond to xscreensaver) and observe (via sshd, which works somehow) a 
mostly-hung system, as some processes hang, presumably those scheduled 
on the stalling core.

The tool "rsnapshot" should have immediately started "rm -rf"ing roughly 
242560 hardlinked files, and "mv"ing around roughly 20 directories.  I 
have no explanation for the 4 seconds gap.

[3.] Keywords (i.e., modules, networking, kernel):
hardlink, __mnt_want_write, do_unlinkat, list_add corruption, list_add 
double add, fs.

[4.] Kernel version (from /proc/version):
Kernel version: 4.9.0-1-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc 
version 6.3.0 20161229 (Debian 6.3.0-2) ) #1 SMP Debian 4.9.2-2 (2017-01-12)
Source: During boot on 11.2., there was this entry in kern.log:
Feb 11 16:57:13 xpected kernel: [    0.000000] Linux version 
4.9.0-1-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc version 6.3.0 
20161229 (Debian 6.3.0-2) ) #1 SMP Debian 4.9.2-2 (2017-01-12)
(Sanity check: yes, the machine ran continuously for nearly a month.)
/proc/version isn't going to be helpful anymore, as I installed a newer 
Kernel towards the end of February, and I already restarted after the hang.

I'm aware that this is by far not up-to-date:
The most recent version in Debian testing appears to be "4.9.6-3 
(2017-01-28)", and kernel.org seems to list only 4.9.13 as the closest 
version.
So if there have been lots of changes in this code path, feel free to 
close as INVALID or similar.

[5.] Output of Oops.. message (if applicable) with symbolic information
        resolved

The most relevant lines of kern.log:

[1781000.925727]  [<ffffffff96328b84>] ? dump_stack+0x5c/0x78
[1781000.925729]  [<ffffffff96076dbe>] ? __warn+0xbe/0xe0
[1781000.925731]  [<ffffffff96076e3f>] ? warn_slowpath_fmt+0x5f/0x80
[1781000.925732]  [<ffffffff9634746c>] ? __list_add+0x5c/0xb0
[1781000.925734]  [<ffffffff9621c671>] ? d_alloc+0x51/0x70
[1781000.925736]  [<ffffffff9620df6e>] ? __lookup_hash+0x3e/0x90
[1781000.925737]  [<ffffffff96224441>] ? __mnt_want_write+0x51/0x60
[1781000.925739]  [<ffffffff96212f9a>] ? do_unlinkat+0x1ca/0x330
[1781000.925741]  [<ffffffff965f9c3b>] ? 
system_call_fast_compare_end+0xc/0x9b

Please find the rest attached as khang_logs.tar.xz

There's not much reason to assume that "do_unlinkat" is the cause.
It just looks to me like a good place to start, since I expect 
"__list_add", "d_alloc" and "__lookup_hash" to be very well tested.

[6.] A small shell script or example program which triggers the
        problem (if possible)

Not appliccable.  I cannot reproduce this.  This operation is part of a 
backup tool (rsnapshot) that runs roughly every <15 minutes.
So the same scenario happened countless times before with kernel 
4.9.2-2; and again several times in the time it takes me to write this 
mail with kernel 4.9.6-3.

[7.] Environment
[7.1.] Software (add the output of the ver_linux script here)

Not appliccable: Before the hang, the new kernel was already installed 
but not running, and after the hang I immediately rebooted.
There's only /var/log/kern.log.4.gz to tell me the old version number.
For sake of privacy, I don't want to attach this file unreviewed, but 
made a copy.  So if there's anything else I can look up, it'll still be 
there.

[7.2.] Processor information (from /proc/cpuinfo):

vendor_id	: AuthenticAMD
cpu family	: 21
model		: 16
model name	: AMD A8-5600K APU with Radeon(tm) HD Graphics
stepping	: 1
microcode	: 0x6001119
cpu MHz		: 1400.000
cpu cores	: 4

Skipped the rest, as it is probably not too relevant here.
In short: 1 CPU, 4 cores, x64 arch, setup includes no 
overclocking/overvolting/whatever.
System has been running stably for the last 4 years.

[7.3.] Module information (from /proc/modules):

See attached, as it is included in the stack trace in khang_kern.log 
(see file khang_logs.tar.xz)

[7.4.] Loaded driver and hardware information (/proc/ioports, /proc/iomem)
[7.5.] PCI information ('lspci -vvv' as root)
[7.6.] SCSI information (from /proc/scsi/scsi)

Skipped, as I don't think this is going to be relevant.  I'm not aware 
of any necessaity of non-standard drivers.  The involved ext4 filesystem 
lies on a dmcrypt volume, so the physical harddrive shouldn't be too 
interesting, either.
For sake of completeness: harddrive is a Western Digital Green, model 
WDC WD20EZRX-00DC0B0.

[7.7.] Other information that might be relevant to the problem

Specifically, the backup tool creates and removes a lot of hardlinks ("$ 
rsnapshot beta"), possibly even concurrently.
The concerned filesystem is just normal ext4:
/dev/mapper/sdb1_crypt on /backup type ext4 
(rw,relatime,errors=remount-ro,data=ordered)

Again, there's no "proof" that this is actually related to the backup 
process at all.  It's just highly suspicious coincidence, and so I'm 
looking at filesystem-related calls more closely.

There *is* a small probability of a harddrive error, as the drive is 
somewhat old, with 22 k power-on hours.  However, smartctl reports good 
values.  Also, a harddrive error shouldn't cause a kernel hang anyway.
Still, that's enough doubt

If this turns out to be the wrong list for this, please tell me where to 
post instead.
I'm willing to provide more information if available.
I'm not too optimistic about the root cause ever being found.

Cheers,
Ben Wiederhake
Attachment:
khang_logs.tar.xz

Description: application/xz