Re: [REGRESSION] 6.8-rc process is unable to exit and consumes a lot of cpu

Thorsten Leemhuis <regressions@xxxxxxxxxxxxx> · Sat, 24 Feb 2024 08:00:27 +0100

On 21.02.24 17:32, Linux regression tracking (Thorsten Leemhuis) wrote:
> [adding Al, Christian and a few lists to the list of recipients to
> ensure all affected parties are aware of this new report about a bug for
> which a fix is committed, but not yet mainlined]
> 
> Thread starts here:
> https://lore.kernel.org/all/6a150ddd-3267-4f89-81bd-6807700c57c1@xxxxxxxxxx/

[adding Linus now as well]

TWIMC, the quoted mail apparently did not get delivered to Al (I got a
"48 hours on the queue" warning from my hoster's MTA ~10 hours ago).

Ohh, and there is some suspicion that the problem Calvin[1] and Paul
(this thread, see quote below for the gist) encountered also causes
problems for bwrap (used by Flapak)[2].
[1] https://lore.kernel.org/all/ZcKOGpTXnlmfplGR@xxxxxxxxx/
[2] https://github.com/containers/bubblewrap/issues/620

Christian, Linus, all that makes me wonder if it might be wise to pick
up the revert[1] Al queued directly in case Al does not submit a PR
today or tomorrow for -rc6.

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git/commit/?h=fixes&id=7e4a205fe56b9092f0143dad6aa5fee081139b09

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

> On 21.02.24 16:56, Paul Holzinger wrote:
>> Hi Thorsten,
>>
>> On 21/02/2024 15:42, Linux regression tracking (Thorsten Leemhuis) wrote:
>>> On 21.02.24 15:31, Paul Holzinger wrote:
>>>> On 21/02/2024 15:20, Paul Holzinger wrote:
>>>>> we are seeing problems with the 6.8-rc kernels[1] in our CI systems,
>>>>> we see random process timeouts across our test suite. It appears that
>>>>> sometimes a process is unable to exit, nothing happens even if we send
>>>>> SIGKILL and instead the process consumes a lof of cpu.
>>>> [...]
>>> Thx for the report.
>>>
>>> Warning, this is not my area of expertise, so this might send you in the
>>> totally wrong direction.
>>>
>>> I briefly checked lore for similar reports and noticed this one when I
>>> searched for shrink_dcache_parent:
>>>
>>> https://lore.kernel.org/all/ZcKOGpTXnlmfplGR@xxxxxxxxx/
>>
>>> Do you think that might be related? A fix for this is pending in vfs.git.
>>>
>> yes that does seem very relevant. Running the sysrq command I get the
>> same backtrace as the reporter there so I think it is fair to assume
>> this is the same bug. Looking forward to get the fix into mainline.
> 
> FWIW, "the fix" afaics is 7e4a205fe56b90 ("Revert "get rid of
> DCACHE_GENOCIDE"") sitting 'fixes' of
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git for more than
> a week now.
> 
> I assume Al or Christian will send this to Linus soon. Christian in fact
> already mentioned that he plans to send another vfs fix to Linux, but
> that one iirc was sitting in another repo (but I might be mistaken there!).
> 
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.
> 
> P.S.: let me update regzbot while at it:
> 
> #regzbot introduced 57851607326a2beef21e67f83f4f53a90df8445a.
> #regzbot fix: Revert "get rid of DCACHE_GENOCIDE"