Re: generic/650 makes v6.0-rc client unusable

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Nov 09, 2022 / 10:36, Filipe Manana wrote:
> On Wed, Nov 9, 2022 at 4:22 AM Shinichiro Kawasaki
> <shinichiro.kawasaki@xxxxxxx> wrote:
> >
> > On Sep 04, 2022 / 21:15, Zorro Lang wrote:
> > > On Sat, Sep 03, 2022 at 06:43:29PM +0000, Chuck Lever III wrote:
> > > > While investigating some of the other issues that have been
> > > > reported lately, I've found that my v6.0-rc3 NFS/TCP client
> > > > goes off the rails often (but not always) during generic/650.
> > > >
> > > > This is the test that runs a workload while offlining and
> > > > onlining CPUs. My test client has 12 physical cores.
> > > >
> > > > The test appears to start normally, but then after a bit
> > > > the NFS server workload drops to zero and the NFS mount
> > > > disappears. I can't run programs (sudo, for example) on
> > > > the client. Can't log in, even on the console. The console
> > > > has a constant stream of "can't rotate log: Input/Output
> > > > error" type messages.
> >
> > I also observe this failure when I ran fstests using btrfs on my HDDs.
> > The failure is recreated almost always.
> 
> I'm wondering what do you get in dmesg, any traces?

I show the log I observed at the end of this e-mail [1]. No BUG message.
The WARN "didn't collect load info for all cpus, balancing is broken" is
repeated. But I once the hang without this WARN.

The last message left was from xfs "ctx ticket reservation ran out. Need to up
reservation". This is for the system disk, not for the test target file system.

> I've excluded the test from my runs for over an year now, due to some
> crash that I reported
> to the mm and cpu hotplug people here:
> 
> https://lore.kernel.org/linux-mm/CAL3q7H4AyrZ5erimDyO7mOVeppd5BeMw3CS=wGbzrMZrp56ktA@xxxxxxxxxxxxxx/
> 
> Unfortunately I had no reply from anyone who works or maintains those
> subsystems.
> 
> It didn't happen very often, and I haven't tested again with recent kernels.

Thanks for sharing your experience. Hmm, your failure symptom is different from
mine.


[1]

Nov 09 11:50:09 redsun40 root[3480]: run xfstest generic/650
Nov 09 11:50:09 redsun40 unknown: run fstests generic/650 at 2022-11-09 11:50:09
Nov 09 11:50:09 redsun40 systemd[1]: Started fstests-generic-650.scope - /usr/bin/bash -c test -w /proc/self/oom_score_adj && echo 250 > /proc/self/oom_score_adj; exec ./tests/generic/650.
Nov 09 11:50:11 redsun40 kernel: smpboot: CPU 10 is now offline
Nov 09 11:50:11 redsun40 kernel: MMIO Stale Data CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/processor_mmio_stale_data.html for more details.
Nov 09 11:50:11 redsun40 kernel: smpboot: CPU 14 is now offline
Nov 09 11:50:14 redsun40 kernel: smpboot: CPU 25 is now offline
Nov 09 11:50:15 redsun40 kernel: smpboot: Booting Node 0 Processor 14 APIC 0x1c
Nov 09 11:50:15 redsun40 kernel: x86/cpu: SGX disabled by BIOS.
Nov 09 11:50:15 redsun40 kernel: x86/tme: not enabled by BIOS
Nov 09 11:50:15 redsun40 kernel: CPU0: Thermal monitoring enabled (TM1)
Nov 09 11:50:15 redsun40 kernel: x86/cpu: User Mode Instruction Prevention (UMIP) activated
Nov 09 11:50:15 redsun40 kernel: smpboot: CPU 30 is now offline
Nov 09 11:50:17 redsun40 kernel: smpboot: CPU 2 is now offline
Nov 09 11:50:19 redsun40 kernel: smpboot: CPU 20 is now offline
Nov 09 11:50:22 redsun40 kernel: smpboot: CPU 31 is now offline
Nov 09 11:50:23 redsun40 kernel: smpboot: CPU 23 is now offline
Nov 09 11:50:24 redsun40 kernel: smpboot: Booting Node 0 Processor 10 APIC 0x14
Nov 09 11:50:26 redsun40 kernel: smpboot: CPU 10 is now offline
Nov 09 11:50:28 redsun40 kernel: smpboot: Booting Node 0 Processor 20 APIC 0x9
Nov 09 11:50:29 redsun40 kernel: smpboot: CPU 21 is now offline
Nov 09 11:50:30 redsun40 kernel: smpboot: CPU 16 is now offline
Nov 09 11:50:31 redsun40 /usr/sbin/irqbalance[1143]: WARNING, didn't collect load info for all cpus, balancing is broken
Nov 09 11:50:31 redsun40 kernel: smpboot: Booting Node 0 Processor 30 APIC 0x1d
Nov 09 11:50:32 redsun40 kernel: smpboot: CPU 18 is now offline
Nov 09 11:50:33 redsun40 kernel: smpboot: Booting Node 0 Processor 2 APIC 0x4
Nov 09 11:50:34 redsun40 kernel: smpboot: CPU 4 is now offline
Nov 09 11:50:35 redsun40 kernel: smpboot: CPU 19 is now offline
Nov 09 11:50:36 redsun40 kernel: smpboot: Booting Node 0 Processor 31 APIC 0x1f
Nov 09 11:50:37 redsun40 kernel: smpboot: CPU 27 is now offline
Nov 09 11:50:38 redsun40 kernel: smpboot: CPU 26 is now offline
Nov 09 11:50:39 redsun40 kernel: smpboot: CPU 11 is now offline
Nov 09 11:50:41 redsun40 /usr/sbin/irqbalance[1143]: WARNING, didn't collect load info for all cpus, balancing is broken

...

Nov 09 12:28:51 redsun40 kernel: smpboot: Booting Node 0 Processor 31 APIC 0x1f
Nov 09 12:28:52 redsun40 /usr/sbin/irqbalance[1143]: WARNING, didn't collect load info for all cpus, balancing is broken
Nov 09 12:28:52 redsun40 kernel: smpboot: Booting Node 0 Processor 14 APIC 0x1c
Nov 09 12:28:52 redsun40 /usr/sbin/irqbalance[1143]: WARNING, didn't collect load info for all cpus, balancing is broken
Nov 09 12:28:53 redsun40 kernel: smpboot: CPU 24 is now offline
Nov 09 12:28:55 redsun40 kernel: smpboot: Booting Node 0 Processor 26 APIC 0x15
Nov 09 12:28:57 redsun40 kernel: smpboot: CPU 29 is now offline
Nov 09 12:28:58 redsun40 kernel: smpboot: Booting Node 0 Processor 20 APIC 0x9
Nov 09 12:28:59 redsun40 kernel: smpboot: Booting Node 0 Processor 24 APIC 0x11
Nov 09 12:29:00 redsun40 kernel: x86: Booting SMP configuration:
Nov 09 12:29:00 redsun40 kernel: smpboot: Booting Node 0 Processor 1 APIC 0x2
Nov 09 12:29:01 redsun40 kernel: smpboot: CPU 19 is now offline
Nov 09 12:29:02 redsun40 /usr/sbin/irqbalance[1143]: WARNING, didn't collect load info for all cpus, balancing is broken
Nov 09 12:29:04 redsun40 kernel: smpboot: Booting Node 0 Processor 7 APIC 0xe
Nov 09 12:29:04 redsun40 kernel: smpboot: CPU 1 is now offline
Nov 09 12:29:04 redsun40 kernel: XFS (nvme0n1p3): ctx ticket reservation ran out. Need to up reservation


-- 
Shin'ichiro Kawasaki



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux