Hi Zhang, I just posted the patch that fixes this issue. Could you please try it and let me know how this patch works for you? In my testing, I don't see any excessive TURs issued with this patch in place. It takes around 12 minutes to run mkfs.ext4 on a freshly created dm-zoned device on top of a 14TB SCSI drive. The same test on top of a 14TB SATA drive takes around 10 minutes. These are direct attached drives on a physical server. I didn't test this patch on 4.19 kernel. If you have any findings about how it behaves, do let me know. Regards, Dmitry On Thu, 2019-10-31 at 16:20 +0800, zhangxiaoxu (A) wrote: > hi Dmitry, thanks for your reply. > > I also test it use the mainline, it also takes more than 1 hours. > my mechine has 64 CPUs core and the disk is SATA. > > when mkfs.ext4, I found the 'scsi_test_unit_ready' run more than 1000 times > per second by the different kworker. > and every 'scsi_test_unit_ready' takes more than 200us, and the interval > less than 20us. > So, I think your guess is right. > > but there is another question, why 4.19 branch takes more than 10 hour? > I will work on it, if any information about it, I will reply you. > > Thanks. > > my script: > dmzadm --format /dev/sdi > echo "0 21485322240 zoned /dev/sdi" | dmsetup create dmz-sdi > date; mkfs.ext4 /dev/mapper/dmz-sdi; date > > mainline: > [root@localhost ~]# uname -a > Linux localhost 5.4.0-rc5 #1 SMP Thu Oct 31 11:41:20 CST 2019 aarch64 aarch64 aarch64 GNU/Linux > > Thu Oct 31 13:58:55 CST 2019 > mke2fs 1.43.6 (29-Aug-2017) > Discarding device blocks: done > Creating filesystem with 2684354560 4k blocks and 335544320 inodes > Filesystem UUID: e0d8e01e-efa8-47fd-a019-b184e66f65b0 > Superblock backups stored on blocks: > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, > 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632, > 2560000000 > > Allocating group tables: done > Writing inode tables: done > Creating journal (262144 blocks): done > Writing superblocks and filesystem accounting information: done > > Thu Oct 31 15:01:01 CST 2019 > > after delete the 'check_events' on mainline: > [root@localhost ~]# uname -a > Linux localhost 5.4.0-rc5+ #2 SMP Thu Oct 31 15:07:36 CST 2019 aarch64 aarch64 aarch64 GNU/Linux > Thu Oct 31 15:19:56 CST 2019 > mke2fs 1.43.6 (29-Aug-2017) > Discarding device blocks: done > Creating filesystem with 2684354560 4k blocks and 335544320 inodes > Filesystem UUID: 735198e8-9df0-49fc-aaa8-23b0869dfa05 > Superblock backups stored on blocks: > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, > 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632, > 2560000000 > > Allocating group tables: done > Writing inode tables: done > Creating journal (262144 blocks): done > Writing superblocks and filesystem accounting information: done > > Thu Oct 31 15:30:51 CST 2019 > > 在 2019/10/27 10:56, Dmitry Fomichev 写道: > > Zhang, > > > > I just did some testing of this scenario with a recent kernel that includes this patch. > > > > The log below is a run in QEMU with 8 CPUs and it took 18.5 minutes to create the FS on a > > 14TB ATA drive. Doing the same thing on bare metal with 32 CPUs takes 10.5 minutes in my > > environment. However, when doing the same test with a SAS drive, the run takes 43 minutes. > > This is not quite the degradation you are observing, but still a big performance hit. > > > > Is the disk that you are using SAS or SATA? > > > > My current guess is that sd driver may generate some TEST UNIT READY commands to check if > > the drive is really online as a part of check_events() processing. For ATA drives, this is > > nearly a NOP since all TURs are completed internally in libata. But, in SCSI case, these > > blocking TURs are issued to the drive and certainly may degrade performance. > > > > The check_events() call has been added to bdev_device_is_dying() because simply calling > > bdev_queue_dying() doesn't cover the situation when the drive gets offlined in SCSI layer. > > It might be possible to only call check_events() once before every reclaim run and to avoid > > calling it in I/O mapping path. If this works, the overhead would likely be acceptable. > > I am going to take a look into this. > > > > Regards, > > Dmitry > > > > [root@xxx dmz]# uname -a > > Linux xxx 5.4.0-rc1-DMZ+ #1 SMP Fri Oct 11 11:23:13 PDT 2019 x86_64 x86_64 x86_64 GNU/Linux > > [root@xxx dmz]# lsscsi > > [0:0:0:0] disk QEMU QEMU HARDDISK 2.5+ /dev/sda > > [1:0:0:0] zbc ATA HGST HSH721415AL T240 /dev/sdb > > [root@xxx dmz]# ./setup-dmz test /dev/sdb > > [root@xxx dmz]# cat /proc/kallsyms | grep dmz_bdev_is_dying > > (standard input):90782:ffffffffc070a401 t dmz_bdev_is_dying.cold [dm_zoned] > > (standard input):90849:ffffffffc0706e10 t dmz_bdev_is_dying [dm_zoned] > > [root@xxx dmz]# time mkfs.ext4 /dev/mapper/test > > mke2fs 1.44.6 (5-Mar-2019) > > Discarding device blocks: done > > Creating filesystem with 3660840960 4k blocks and 457605120 inodes > > Filesystem UUID: 4536bacd-cfb5-41b2-b0bf-c2513e6e3360 > > Superblock backups stored on blocks: > > 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, > > 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, > > 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632, > > 2560000000 > > > > Allocating group tables: done > > Writing inode tables: done > > Creating journal (262144 blocks): done > > Writing superblocks and filesystem accounting information: done > > > > > > real 18m30.867s > > user 0m0.172s > > sys 0m11.198s > > > > > > On Sat, 2019-10-26 at 09:56 +0800, zhangxiaoxu (A) wrote: > > > Hi all, when I 'mkfs.ext4' on the dmz device based on 10T smr disk, > > > it takes more than 10 hours after apply 75d66ffb48efb3 ("dm zoned: > > > properly handle backing device failure"). > > > > > > After delete the 'check_events' in 'dmz_bdev_is_dying', it just > > > take less than 12 mins. > > > > > > I test it based on 4.19 branch. > > > Must we do the 'check_events' at mapping path, reclaim or metadata I/O? > > > > > > Thanks. > > > -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel