Re: [REGRESSION] v5.17-rc1+: FIFREEZE ioctl system call hangs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On 2022-08-17 08:19, Song Liu wrote:
On Mon, Aug 15, 2022 at 8:46 AM Vishal Verma
<vverma@xxxxxxxxxxxxxxxx> wrote:

Just saw this. I’m trying to understand whether this happens only
on md array or individual nvme drives (without any raid) too? The
commit you pointed added REQ_NOWAIT for md based arrays, but if it
is happening on individual nvme drives then that could point to
something with REQ_NOWAIT I think.

Agreed with this analysis.

I bisected again, this time I tested against the single nvme device.

I did it 2 times, and always ended up with

> git bisect start
> # good: [8bb7eca972ad531c9b149c0a51ab43a417385813] Linux 5.15
> git bisect good 8bb7eca972ad531c9b149c0a51ab43a417385813
> # bad: [df0cc57e057f18e44dac8e6c18aba47ab53202f9] Linux 5.16
> git bisect bad df0cc57e057f18e44dac8e6c18aba47ab53202f9
> # good: [2219b0ceefe835b92a8a74a73fe964aa052742a2] Merge tag 'soc-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
> git bisect good 2219b0ceefe835b92a8a74a73fe964aa052742a2
> # good: [206825f50f908771934e1fba2bfc2e1f1138b36a] Merge tag 'mtd/for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux
> git bisect good 206825f50f908771934e1fba2bfc2e1f1138b36a
> # bad: [4e1fddc98d2585ddd4792b5e44433dcee7ece001] tcp_cubic: fix spurious Hystart ACK train detections for not-cwnd-limited flows
> git bisect bad 4e1fddc98d2585ddd4792b5e44433dcee7ece001
> # good: [dbf49896187fd58c577fa1574a338e4f3672b4b2] Merge branch 'akpm' (patches from Andrew)
> git bisect good dbf49896187fd58c577fa1574a338e4f3672b4b2
> # good: [0ecca62beb12eeb13965ed602905c8bf53ac93d0] Merge tag 'ceph-for-5.16-rc1' of git://github.com/ceph/ceph-client
> git bisect good 0ecca62beb12eeb13965ed602905c8bf53ac93d0
> # bad: [7d5775d49e4a488bc8a07e5abb2b71a4c28aadbb] Merge tag 'printk-for-5.16-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
> git bisect bad 7d5775d49e4a488bc8a07e5abb2b71a4c28aadbb
> # good: [35c8fad4a703fdfa009ed274f80bb64b49314cde] Merge tag 'perf-tools-for-v5.16-2021-11-13' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
> git bisect good 35c8fad4a703fdfa009ed274f80bb64b49314cde
> # good: [6ea45c57dc176dde529ab5d7c4b3f20e52a2bd82] Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm
> git bisect good 6ea45c57dc176dde529ab5d7c4b3f20e52a2bd82
> # bad: [fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf] Linux 5.16-rc1
> git bisect bad fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf
> # good: [475c3f599582a34e189f047ed3fb7e90a295ea5b] sh: fix READ/WRITE redefinition warnings
> git bisect good 475c3f599582a34e189f047ed3fb7e90a295ea5b
> # good: [c3b68c27f58a07130382f3fa6320c3652ad76f15] Merge tag 'for-5.16/parisc-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
> git bisect good c3b68c27f58a07130382f3fa6320c3652ad76f15
> # good: [4a6b35b3b3f28df81fea931dc77c4c229cbdb5b2] xfs: sync xfs_btree_split macros with userspace libxfs
> git bisect good 4a6b35b3b3f28df81fea931dc77c4c229cbdb5b2
> # good: [dee2b702bcf067d7b6b62c18bdd060ff0810a800] kconfig: Add support for -Wimplicit-fallthrough
> git bisect good dee2b702bcf067d7b6b62c18bdd060ff0810a800
> # first bad commit: [fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf] Linux 5.16-rc1

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf

...but this doesn't make any sense, right?

However, I cannot reproduce with the commit before, i.e. dee2b702bcf0 didn't freeze during my 10 test runs. But with fa55b7dcdc (or any later commit), system will freeze on _every_ test run?!

I checked out 1bd297988b75 which never failed before, changed Makefile to PATCHLEVEL=16 and EXTRAVERSION=-rc1 and guess what: It's now failing, too.

So this sounds like some code changes behavior when KV is >=5.16-rc1. Is that possible?

Anyway, I started to test v5.10 (with PATCHLEVEL=16 and EXTRAVERSION=-rc1 set) which worked so I started another bisect session where I named all KV to 5.16-rc1.

I'll post my finding when this session is completed.


I am not able to reproduce this on 5.19+ kernel. I have:

[root@eth50-1 ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sr0 11:0 1 1024M 0 rom vda 253:0 0 32G 0 disk ├─vda1 253:1 0 2G 0 part /boot └─vda2 253:2 0 30G 0
part  / nvme0n1 259:0    0    4G  0 disk └─md0     9:0    0   12G  0
raid5 /root/mnt nvme2n1 259:1    0    4G  0 disk └─md0     9:0    0
12G  0 raid5 /root/mnt nvme3n1 259:2    0    4G  0 disk └─md0     9:0
0   12G  0 raid5 /root/mnt nvme1n1 259:3    0    4G  0 disk └─md0
9:0    0   12G  0 raid5 /root/mnt [root@eth50-1 ~]# for x in {1..100}
; do fsfreeze --unfreeze /root/mnt ; fsfreeze --freeze /root/mnt ;
done

Did I miss something?

Well, your reproducer doesn't work. Like written in my initial mail, executing `fsfreeze --freeze...` directly after boot doesn't even fail for me. The device/array must have seen some I/O to trigger this.

To be more precise:

During my current bisect session (where I set KV to 5.16-rc1 for all kernels), I noticed that my 'reproducer' failed:

To trigger the problem, it is not enough to create random I/O by copying some files for example.

I am using mysqld (MariaDB 10.6.8) and restore ~20GB of SQL dumps -- somehow this is triggering the problem in a reliable way. The mysqld is using O_DIRECT (https://mariadb.com/kb/en/innodb-system-variables/#innodb_flush_method) -- maybe Direct I/O is the trigger.

This process usually takes ~620s on my test system where I am experiencing the problem. After import I called `fsfreeze --freeze ...` against the mount point used by mysqld. When this command did not return (=fsfreeze was hanging), I marked revision as bad.

Since setting KV in all kernels to "5.16-rc1" I noticed that the import process sometimes "freezed" -- mysqld was still running and responsive (that's not the case when fsfreeze hangs for example) and `SHOW PROCESSLIST` showed the running imports with still increasing time counter. However, no data are read and written anymore. Although fsfreeze command works when this happens. Anyway, I marked revisions showing this behavior as bad, too.

I'll post my results when I finished this bisect session.


--
Regards,
Thomas




[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux