Re: generic/650 makes v6.0-rc client unusable

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Nov 09, 2022 at 10:36:04AM +0000, Filipe Manana wrote:
> On Wed, Nov 9, 2022 at 4:22 AM Shinichiro Kawasaki
> <shinichiro.kawasaki@xxxxxxx> wrote:
> >
> > On Sep 04, 2022 / 21:15, Zorro Lang wrote:
> > > On Sat, Sep 03, 2022 at 06:43:29PM +0000, Chuck Lever III wrote:
> > > > While investigating some of the other issues that have been
> > > > reported lately, I've found that my v6.0-rc3 NFS/TCP client
> > > > goes off the rails often (but not always) during generic/650.
> > > >
> > > > This is the test that runs a workload while offlining and
> > > > onlining CPUs. My test client has 12 physical cores.
> > > >
> > > > The test appears to start normally, but then after a bit
> > > > the NFS server workload drops to zero and the NFS mount
> > > > disappears. I can't run programs (sudo, for example) on
> > > > the client. Can't log in, even on the console. The console
> > > > has a constant stream of "can't rotate log: Input/Output
> > > > error" type messages.
> >
> > I also observe this failure when I ran fstests using btrfs on my HDDs.
> > The failure is recreated almost always.
> 
> I'm wondering what do you get in dmesg, any traces?
> 
> I've excluded the test from my runs for over an year now, due to some
> crash that I reported
> to the mm and cpu hotplug people here:
> 
> https://lore.kernel.org/linux-mm/CAL3q7H4AyrZ5erimDyO7mOVeppd5BeMw3CS=wGbzrMZrp56ktA@xxxxxxxxxxxxxx/
> 
> Unfortunately I had no reply from anyone who works or maintains those
> subsystems.
> 
> It didn't happen very often, and I haven't tested again with recent kernels.

I've been testing with xfs/btrfs/ext4 nightly, and haven't seen any
problems with the last two.  There's some very infrequent log accounting
problem that is probably a regression from Dave's recent round of log
refactorings, so once we're clear of the write race corruption problem,
I intend to inquire about that.

Granted I also don't have hundreds-of-cpus machines to test this kind of
stuff, so I don't know how well hotplug mania fares on a big iron.

I don't think it's valid to remove a test from the auto group because it
uncovers bugs.  If test runner folks want to put it in their own exclude
lists for their own convenience, that's fine with me.

--D

> >
> > > >
> > > > I haven't looked further into this yet. Actually I'm not
> > > > quite sure where to start looking.
> > > >
> > > > I recently switched this client from a local /home to an
> > > > NFS-mounted one, and that's where the xfstests are built
> > > > and run from, fwiw.
> > >
> > > If most of users complain generic/650, I'd like to exclude g/650 from the
> > > "auto" default run group. Any more points?
> >
> > +1. I wish to remove it from the "auto" group. Since I can not login to the test
> > machine after the failure, I suggest to put it in the "dangerous" group.
> >
> > --
> > Shin'ichiro Kawasaki



[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux