Re: [PATCH 5.15 000/183] 5.15.134-rc1 review

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, Oct 07, 2023 at 09:22:55PM -0400, Joel Fernandes wrote:
> On Fri, Oct 6, 2023 at 2:20 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> >
> > On Fri, Oct 06, 2023 at 01:57:14PM -0400, Liam R. Howlett wrote:
> > > * Paul E. McKenney <paulmck@xxxxxxxxxx> [231006 12:47]:
> > > > On Fri, Oct 06, 2023 at 12:20:38PM -0400, Liam R. Howlett wrote:
> > > > > * Naresh Kamboju <naresh.kamboju@xxxxxxxxxx> [231005 13:49]:
> > > > > > On Wed, 4 Oct 2023 at 23:33, Greg Kroah-Hartman
> > > > > > <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> > > > > > >
> > > > > > > This is the start of the stable review cycle for the 5.15.134 release.
> > > > > > > There are 183 patches in this series, all will be posted as a response
> > > > > > > to this one.  If anyone has any issues with these being applied, please
> > > > > > > let me know.
> > > > > > >
> > > > > > > Responses should be made by Fri, 06 Oct 2023 17:51:12 +0000.
> > > > > > > Anything received after that time might be too late.
> > > > > > >
> > > > > > > The whole patch series can be found in one patch at:
> > > > > > >         https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.15.134-rc1.gz
> > > > > > > or in the git tree and branch at:
> > > > > > >         git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.15.y
> > > > > > > and the diffstat can be found below.
> > > > > > >
> > > > > > > thanks,
> > > > > > >
> > > > > > > greg k-h
> > > > > >
> > > > > > Results from Linaro’s test farm.
> > > > > > Regressions on x86.
> > > > > >
> > > > > > Following kernel warning noticed on x86 while booting stable-rc 5.15.134-rc1
> > > > > > with selftest merge config built kernel.
> > > > > >
> > > > > > Reported-by: Linux Kernel Functional Testing <lkft@xxxxxxxxxx>
> > > > > >
> > > > > > Anyone noticed this kernel warning ?
> > > > > >
> > > > > > This is always reproducible while booting x86 with a given config.
> > > > >
> > > > > >From that config:
> > > > > #
> > > > > # RCU Subsystem
> > > > > #
> > > > > CONFIG_TREE_RCU=y
> > > > > # CONFIG_RCU_EXPERT is not set
> > > > > CONFIG_SRCU=y
> > > > > CONFIG_TREE_SRCU=y
> > > > > CONFIG_TASKS_RCU_GENERIC=y
> > > > > CONFIG_TASKS_RUDE_RCU=y
> > > > > CONFIG_TASKS_TRACE_RCU=y
> > > > > CONFIG_RCU_STALL_COMMON=y
> > > > > CONFIG_RCU_NEED_SEGCBLIST=y
> > > > > # end of RCU Subsystem
> > > > >
> > > > > #
> > > > > # RCU Debugging
> > > > > #
> > > > > CONFIG_PROVE_RCU=y
> > > > > # CONFIG_RCU_SCALE_TEST is not set
> > > > > # CONFIG_RCU_TORTURE_TEST is not set
> > > > > # CONFIG_RCU_REF_SCALE_TEST is not set
> > > > > CONFIG_RCU_CPU_STALL_TIMEOUT=21
> > > > > CONFIG_RCU_TRACE=y
> > > > > # CONFIG_RCU_EQS_DEBUG is not set
> > > > > # end of RCU Debugging
> > > > >
> > > > >
> > > > > >
> > > > > > x86 boot log:
> > > > > > -----
> > > > > > [    0.000000] Linux version 5.15.134-rc1 (tuxmake@tuxmake)
> > > > > > (x86_64-linux-gnu-gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils
> > > > > > for Debian) 2.40) #1 SMP @1696443178
> > > > > > ...
> > > > > > [    1.480701] ------------[ cut here ]------------
> > > > > > [    1.481296] WARNING: CPU: 0 PID: 13 at kernel/rcu/tasks.h:958
> > > > > > trc_inspect_reader+0x80/0xb0
> > > > > > [    1.481296] Modules linked in:
> > > > > > [    1.481296] CPU: 0 PID: 13 Comm: rcu_tasks_trace Not tainted 5.15.134-rc1 #1
> > > > > > [    1.481296] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> > > > > > 2.5 11/26/2020
> > > > > > [    1.481296] RIP: 0010:trc_inspect_reader+0x80/0xb0
> > > > >
> > > > > This function has changed a lot, including the dropping of this
> > > > > WARN_ON_ONCE().  The warning was replaced in 897ba84dc5aa ("rcu-tasks:
> > > > > Handle idle tasks for recently offlined CPUs") with something that looks
> > > > > equivalent so I'm not sure why it would not trigger in newer revisions.
> > > > >
> > > > > Obviously the behaviour I changed was the test for the task being idle.
> > > > > I am not sure how best to short-circuit that test from happening during
> > > > > boot as I am not familiar with the RCU code.
> > > >
> > > > The usual test for RCU's notion of early boot being completed is
> > > > (rcu_scheduler_active != RCU_SCHEDULER_INIT).
> > > >
> > > > Except that "ofl" should always be false that early in boot, at least
> > > > in mainline.
> > >
> > > Is this still true in the final version of the patch where we set the
> > > boot task as !idle until just before the early boot is finished?  I
> > > wouldn't think of this as 'early in boot' anymore as much as the entire
> > > kernel setup.  Maybe we need to shorten the time we stay in !idle mode
> > > for earlier kernels?
> >
> > In mainline, the ofl variable is defined as cpu_is_offline(cpu), and
> > during boot, the boot CPU is guaranteed to be online.  (As opposed to
> > the boot CPU's idle-task state.)
> >
> > > How frequent is this function called?  We could check something for
> > > early boot... or track down where the cpu is put online and restore idle
> > > before that happens?
> >
> > Once per RCU Tasks Trace grace period per reader seen to be blocking
> > that grace period.  Its performance is as issue, but not to anywhere
> > near the same extent as (say) rcu_read_lock_trace().
> >
> > > > > It's also worth noting that the bug this fixes wasn't exposed until the
> > > > > maple tree (added in v6.1) was used for the IRQ descriptors (added in
> > > > > v6.5).
> > > >
> > > > Lots of latent bugs, to be sure, even with rcutorture.  :-/
> > >
> > > The Right Thing is to fix the bug all the way back to the introduction,
> > > but what fallout makes the backport less desirable than living with the
> > > unexposed bug?
> >
> > You are quite right that it is possible for the risk of a backport to
> > exceed the risk of the original bug.
> >
> > I defer to Joel (CCed) on how best to resolve this in -stable.
> 
> Maybe I am missing something but this issue should also be happening
> in mainline right?
> 
> Even though mainline has 897ba84dc5aa ("rcu-tasks: Handle idle tasks
> for recently offlined CPUs") , the warning should still be happening
> due to Liam's "kernel/sched: Modify initial boot task idle setup"
> because the warning is just rearranged a bit but essentially the same.
> 
> IMHO, the right thing to do then is to drop Liam's patch from 5.15 and
> fix it in mainline (using the ideas described in this thread), then
> backport both that new fix and Liam's patch to 5.15.
> 
> Or is there a reason this warning does not show up on the mainline?
> 
> My impression is that dropping Liam's patch for the stable release and
> revisiting it later is a better approach since tiny RCU is used way
> less in the wild than tree/tasks RCU. Thoughts?

I think that this one is strange enough that we need to write down the
situation in detail, make sure we have all the corner cases covered in
both mainline and -stable, and decide what to do from there.

Yes, I know, this email thread contains much of this information, but
a little organizing of it would be good.

Would you like to put that together, or should I?  If me, I will get
a draft out by the end of this coming Tuesday, Pacific Time.

							Thanx, Paul



[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux