Re: Intermittent storage (dm-crypt?) freeze - regression 6.4->6.5

Jan Kara <jack@xxxxxxx> · Tue, 31 Oct 2023 15:01:36 +0100

On Tue 31-10-23 04:48:44, Marek Marczykowski-Górecki wrote:
> On Mon, Oct 30, 2023 at 06:50:35PM +0100, Mikulas Patocka wrote:
> > On Mon, 30 Oct 2023, Marek Marczykowski-Górecki wrote:
> > > Then retried with order=PAGE_ALLOC_COSTLY_ORDER and
> > > PAGE_ALLOC_COSTLY_ORDER back at 3, and also got similar crash.
> > 
> > So, does it mean that even allocating with order=PAGE_ALLOC_COSTLY_ORDER 
> > isn't safe?
> 
> That seems to be another bug, see below.
> 
> > Try enabling CONFIG_DEBUG_VM (it also needs CONFIG_DEBUG_KERNEL) and try 
> > to provoke a similar crash. Let's see if it crashes on one of the 
> > VM_BUG_ON statements.
> 
> This was very interesting idea. With this, immediately after login I get
> the crash like below. Which makes sense, as this is when pulseaudio
> starts and opens /dev/snd/*. I then tried with the dm-crypt commit
> reverted and still got the crash! But, after blacklisting snd_pcm,
> there is no BUG splat, but the storage freeze still happens on vanilla
> 6.5.6.

OK, great. Thanks for testing.

<snip snd_pcm bug>

> Plain 6.5.6 (so order = MAX_ORDER - 1, and PAGE_ALLOC_COSTLY_ORDER=3), in frozen state:
> [  143.196106] task:blkdiscard      state:D stack:13672 pid:4884  ppid:2025   flags:0x00000002
> [  143.196130] Call Trace:
> [  143.196139]  <TASK>
> [  143.196147]  __schedule+0x30e/0x8b0
> [  143.196162]  schedule+0x59/0xb0
> [  143.196175]  schedule_timeout+0x14c/0x160
> [  143.196193]  io_schedule_timeout+0x4b/0x70
> [  143.196207]  wait_for_completion_io+0x81/0x130
> [  143.196226]  submit_bio_wait+0x5c/0x90
> [  143.196241]  blkdev_issue_discard+0x94/0xe0
> [  143.196260]  blkdev_common_ioctl+0x79e/0x9c0
> [  143.196279]  blkdev_ioctl+0xc7/0x270
> [  143.196293]  __x64_sys_ioctl+0x8f/0xd0
> [  143.196310]  do_syscall_64+0x3c/0x90

So this shows there was bio submitted and it never ran to completion.

> for f in $(grep -l crypt /proc/*/comm); do head $f ${f/comm/stack}; done
<snip some backtraces>

So this shows dm-crypt layer isn't stuck anywhere. So the allocation path
itself doesn't seem to be locking up, looping or anything.

> Then tried:
>  - PAGE_ALLOC_COSTLY_ORDER=4, order=4 - cannot reproduce,
>  - PAGE_ALLOC_COSTLY_ORDER=4, order=5 - cannot reproduce,
>  - PAGE_ALLOC_COSTLY_ORDER=4, order=6 - freeze rather quickly
> 
> I've retried the PAGE_ALLOC_COSTLY_ORDER=4,order=5 case several times
> and I can't reproduce the issue there. I'm confused...

And this kind of confirms that allocations > PAGE_ALLOC_COSTLY_ORDER
causing hangs is most likely just a coincidence. Rather something either in
the block layer or in the storage driver has problems with handling bios
with sufficiently high order pages attached. This is going to be a bit
painful to debug I'm afraid. How long does it take for you trigger the
hang? I'm asking to get rough estimate how heavy tracing we can afford so
that we don't overwhelm the system...

								Honza

-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR