RE: [ANNOUNCE] 4.1.3-rt3 - xmit queue timeout, oops, rcu stalls

John Dulaney <j_dulaney@xxxxxxxx> · Thu, 6 Aug 2015 18:19:37 -0400

----------------------------------------
> Subject: Re: [ANNOUNCE] 4.1.3-rt3 - xmit queue timeout, oops, rcu stalls
> To: bigeasy@xxxxxxxxxxxxx; linux-rt-users@xxxxxxxxxxxxxxx
> CC: nando@xxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; tglx@xxxxxxxxxxxxx; rostedt@xxxxxxxxxxx; jkacur@xxxxxxxxxx
> From: nando@xxxxxxxxxxxxxxxxxx
> Date: Thu, 6 Aug 2015 10:50:22 -0700
>
> On 07/25/2015 03:32 AM, Sebastian Andrzej Siewior wrote:
>> Dear RT folks!
>>
>> I'm pleased to announce the v4.1.3-rt3 patch set.
> ...
>
> I've had a few hangs with nothing left behind to debug... but today I
> find this:
>
> (NOTE: I'm attaching a file with the details, I don't know if my mailer
> will mangled these lines)
>
> ----
> Aug 5 10:46:18 localhost kernel: [ 2343.673560] WARNING: CPU: 3 PID: 43
> at net/sched/sch_generic.c:303 dev_watchdog+0x26f/0x280()
> Aug 5 10:46:18 localhost kernel: [ 2343.673561] NETDEV WATCHDOG: eth1
> (e1000e): transmit queue 0 timed out
> ----
>
> and then:
>
> ----
> Aug 5 10:46:18 localhost kernel: [ 2343.673679] e1000e 0000:04:00.0
> eth1: Reset adapter unexpectedly
> Aug 5 10:46:30 localhost kernel: [ 2355.706987] ata5.00: exception
> Emask 0x40 SAct 0x0 SErr 0x80800 action 0x6 frozen
> Aug 5 10:46:30 localhost kernel: [ 2355.706990] ata5: SError: { HostInt
> 10B8B }
> Aug 5 10:46:30 localhost kernel: [ 2355.707003] ata5.00: cmd
> a0/00:00:00:08:00/00:00:00:00:00/a0 tag 0 pio 16392 in
> Aug 5 10:46:30 localhost kernel: [ 2355.707003] Get event
> status notification 4a 01 00 00 10 00 00 00 08 00res
> 40/00:03:00:00:00/00:00:00:00:00/a0 Emask 0x44 (timeout)
> Aug 5 10:46:30 localhost kernel: [ 2355.707005] ata5.00: status: { DRDY }
> Aug 5 10:46:30 localhost kernel: [ 2355.707007] ata5: hard resetting link
> ----
>
> same one but later in the log:
>
> ----
> Aug 5 10:46:18 localhost kernel: WARNING: CPU: 3 PID: 43 at
> net/sched/sch_generic.c:303 dev_watchdog+0x26f/0x280()
> Aug 5 10:46:18 localhost kernel: NETDEV WATCHDOG: eth1 (e1000e):
> transmit queue 0 timed out
> ----
>
> Things apparently keep working and then:
>
> ----
> Aug 5 11:58:36 localhost kernel: [ 6678.122596] Network Receive[2409]:
> segfault at 28 ip 0000003c4c293ca9 sp 00007fb6f64dbb58 error 6 in
> libc-2.18.so[3c4c200000+1b4000]
> Aug 5 11:58:36 localhost kernel: Network Receive[2409]: segfault at 28
> ip 0000003c4c293ca9 sp 00007fb6f64dbb58 error 6 in
> libc-2.18.so[3c4c200000+1b4000]
> Aug 5 11:58:36 localhost kernel: timekeeping watchdog: Marking
> clocksource 'tsc' as unstable, because the skew is too large:
> Aug 5 11:58:36 localhost kernel: 'hpet' wd_now: 47ebf654 wd_last:
> c0debfe6 mask: ffffffff
> Aug 5 11:58:36 localhost kernel: 'tsc' cs_now: 154f6e564f7d cs_last:
> 7784d315c59 mask: ffffffffffffffff
> Aug 5 11:58:36 localhost systemd: Starting dnf makecache...
> Aug 5 11:58:36 localhost kernel: [ 6678.123233] timekeeping watchdog:
> Marking clocksource 'tsc' as unstable, because the skew is too large:
> Aug 5 11:58:36 localhost kernel: [ 6678.123237] 'hpet' wd_now:
> 47ebf654 wd_last: c0debfe6 mask: ffffffff
> Aug 5 11:58:36 localhost kernel: [ 6678.123238] 'tsc' cs_now:
> 154f6e564f7d cs_last: 7784d315c59 mask: ffffffffffffffff
> Aug 5 11:58:36 localhost kernel: [ 6678.146207] Switched to clocksource
> hpet
> Aug 5 11:58:36 localhost kernel: Switched to clocksource hpet
> Aug 5 11:58:36 localhost kernel: [ 6678.150087] BUG: unable to handle
> kernel NULL pointer dereference at 0000000000000ea0
> Aug 5 11:58:36 localhost kernel: [ 6678.150097] IP:
> [<ffffffffa05d922e>] nfs40_discover_server_trunking+0x5e/0x110 [nfsv4]
> Aug 5 11:58:36 localhost kernel: [ 6678.150098] PGD 7f3c83067 PUD
> 7f46fb067 PMD 0
> Aug 5 11:58:36 localhost kernel: [ 6678.150099] Oops: 0000 [#1] PREEMPT
> SMP
> ----
>
> And eventually (later) get a ton of these:
>
> ----
> Aug 5 11:59:36 localhost kernel: [ 6738.107181] INFO: rcu_preempt
> detected stalls on CPUs/tasks: {} (detected by 3, t=60002 jiffies,
> g=37092, c=37091, q=0)
> Aug 5 11:59:36 localhost kernel: [ 6738.107183] All QSes seen, last
> rcu_preempt kthread activity 1 (4301410925-4301410924),
> jiffies_till_next_fqs=3, root ->qsmask 0x0
> ----
>
> So something is left in a not good state...
>
> -- Fernando

Do you still have your box setup to capture a vmcore?  Also, is this my latest
build?  I've been having issues with LUKs.

If you do still have your system setup to capture a vmcore, maybe set:

kernel.panic_on_oops = 1
In your /etc/sysctl.conf and then reboot to this kernel.

John.
 		 	   		  --
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html