PROBLEM: list_add corruption, possibly related to do_unlinkat

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

it looks like some hardlink-handling caused a problem with a list kernelspace, which caused the executing core to stall.


[2.] Full description of the problem/report:
At 07:15:01 local time, a cronjob started some backup process. Essentially nothing else ran on the system. At 07:15:05 local time, the first kernel "WARNING" got logged, about list_add corruption, and to me it looks like 'do_unlinkat' might be involved in that. This is immediately followed by another "WARNING" about list_add double add, where 'do_unlinkat' is involved again. At 07:15:26 local time and every 22/23 seconds thereafter, rcu_sched self-detects a stall on the same CPU as the above "WARNING"s. At 10:something, I attempt to login (and fail because libpam doesn't respond to xscreensaver) and observe (via sshd, which works somehow) a mostly-hung system, as some processes hang, presumably those scheduled on the stalling core.

The tool "rsnapshot" should have immediately started "rm -rf"ing roughly 242560 hardlinked files, and "mv"ing around roughly 20 directories. I have no explanation for the 4 seconds gap.


[3.] Keywords (i.e., modules, networking, kernel):
hardlink, __mnt_want_write, do_unlinkat, list_add corruption, list_add double add, fs.


[4.] Kernel version (from /proc/version):
Kernel version: 4.9.0-1-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc version 6.3.0 20161229 (Debian 6.3.0-2) ) #1 SMP Debian 4.9.2-2 (2017-01-12)
Source: During boot on 11.2., there was this entry in kern.log:
Feb 11 16:57:13 xpected kernel: [ 0.000000] Linux version 4.9.0-1-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc version 6.3.0 20161229 (Debian 6.3.0-2) ) #1 SMP Debian 4.9.2-2 (2017-01-12)
(Sanity check: yes, the machine ran continuously for nearly a month.)
/proc/version isn't going to be helpful anymore, as I installed a newer Kernel towards the end of February, and I already restarted after the hang.

I'm aware that this is by far not up-to-date:
The most recent version in Debian testing appears to be "4.9.6-3 (2017-01-28)", and kernel.org seems to list only 4.9.13 as the closest version. So if there have been lots of changes in this code path, feel free to close as INVALID or similar.


[5.] Output of Oops.. message (if applicable) with symbolic information
        resolved

The most relevant lines of kern.log:

[1781000.925727]  [<ffffffff96328b84>] ? dump_stack+0x5c/0x78
[1781000.925729]  [<ffffffff96076dbe>] ? __warn+0xbe/0xe0
[1781000.925731]  [<ffffffff96076e3f>] ? warn_slowpath_fmt+0x5f/0x80
[1781000.925732]  [<ffffffff9634746c>] ? __list_add+0x5c/0xb0
[1781000.925734]  [<ffffffff9621c671>] ? d_alloc+0x51/0x70
[1781000.925736]  [<ffffffff9620df6e>] ? __lookup_hash+0x3e/0x90
[1781000.925737]  [<ffffffff96224441>] ? __mnt_want_write+0x51/0x60
[1781000.925739]  [<ffffffff96212f9a>] ? do_unlinkat+0x1ca/0x330
[1781000.925741] [<ffffffff965f9c3b>] ? system_call_fast_compare_end+0xc/0x9b

Please find the rest attached as khang_logs.tar.xz

There's not much reason to assume that "do_unlinkat" is the cause.
It just looks to me like a good place to start, since I expect "__list_add", "d_alloc" and "__lookup_hash" to be very well tested.


[6.] A small shell script or example program which triggers the
        problem (if possible)

Not appliccable. I cannot reproduce this. This operation is part of a backup tool (rsnapshot) that runs roughly every <15 minutes. So the same scenario happened countless times before with kernel 4.9.2-2; and again several times in the time it takes me to write this mail with kernel 4.9.6-3.


[7.] Environment
[7.1.] Software (add the output of the ver_linux script here)

Not appliccable: Before the hang, the new kernel was already installed but not running, and after the hang I immediately rebooted.
There's only /var/log/kern.log.4.gz to tell me the old version number.
For sake of privacy, I don't want to attach this file unreviewed, but made a copy. So if there's anything else I can look up, it'll still be there.


[7.2.] Processor information (from /proc/cpuinfo):

vendor_id	: AuthenticAMD
cpu family	: 21
model		: 16
model name	: AMD A8-5600K APU with Radeon(tm) HD Graphics
stepping	: 1
microcode	: 0x6001119
cpu MHz		: 1400.000
cpu cores	: 4

Skipped the rest, as it is probably not too relevant here.
In short: 1 CPU, 4 cores, x64 arch, setup includes no overclocking/overvolting/whatever.
System has been running stably for the last 4 years.


[7.3.] Module information (from /proc/modules):

See attached, as it is included in the stack trace in khang_kern.log (see file khang_logs.tar.xz)


[7.4.] Loaded driver and hardware information (/proc/ioports, /proc/iomem)
[7.5.] PCI information ('lspci -vvv' as root)
[7.6.] SCSI information (from /proc/scsi/scsi)

Skipped, as I don't think this is going to be relevant. I'm not aware of any necessaity of non-standard drivers. The involved ext4 filesystem lies on a dmcrypt volume, so the physical harddrive shouldn't be too interesting, either. For sake of completeness: harddrive is a Western Digital Green, model WDC WD20EZRX-00DC0B0.


[7.7.] Other information that might be relevant to the problem

Specifically, the backup tool creates and removes a lot of hardlinks ("$ rsnapshot beta"), possibly even concurrently.
The concerned filesystem is just normal ext4:
/dev/mapper/sdb1_crypt on /backup type ext4 (rw,relatime,errors=remount-ro,data=ordered)

Again, there's no "proof" that this is actually related to the backup process at all. It's just highly suspicious coincidence, and so I'm looking at filesystem-related calls more closely.

There *is* a small probability of a harddrive error, as the drive is somewhat old, with 22 k power-on hours. However, smartctl reports good values. Also, a harddrive error shouldn't cause a kernel hang anyway.
Still, that's enough doubt


If this turns out to be the wrong list for this, please tell me where to post instead.
I'm willing to provide more information if available.
I'm not too optimistic about the root cause ever being found.

Cheers,
Ben Wiederhake

Attachment: khang_logs.tar.xz
Description: application/xz


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux