On 5/15/23 15:06, Hannes Reinecke wrote: > On 5/12/23 19:41, Ming Lin wrote: >> On Thu, May 11, 2023 at 11:56 AM Hannes Reinecke <hare@xxxxxxx> wrote: >>> >>> On 5/11/23 20:41, Ming Lin wrote: >>>> Hi list, >>>> >>>> I have an application that needs to use buffered_io to access SMR disk >>>> for good performance. >>>> >>>> From "ZBD Support Restrictions" at https://zonedstorage.io/docs/linux/overview >>>> " >>>> Direct IO Writes The kernel page cache does not guarantee that cached >>>> dirty pages will be flushed to a block device in sequential sector >>>> order. This can lead to unaligned write errors if an application uses >>>> buffered writes to write to the sequential write required zones of a >>>> device. To avoid this pitfall, applications that directly use a zoned >>>> block device without a file system should always use direct I/O >>>> operations to write to the sequential write required zones of a >>>> host-managed disk (that is, they should issue write() system calls >>>> with a block device "file open" that uses the O_DIRECT flag). >>>> " >>>> >>>> Raw zbd disk only supports direct_io. >>>> >>>> Does dm-zoned support buffered io (without O_DIRECT)? >>>> >>> Yes. But I _think_ the above paragraph is ever so slightly outdated, as >>> we've spent quite a lot of time fixing sequential writes (cf blk-zoned >>> etc). So while dm-zoned is using bufferet writes there won't be any >>> sequential write issues. >>> >>> At least, I have not uncovered any of those during testing. >> >> Hi Hannes, >> >> I use 5.10.90 kernel and smr disk capacity is 24T. >> I followed the below guide to create dm_zone device on top of smr disk. >> https://zonedstorage.io/docs/linux/dm >> >> Then mkfs.ext4 /dev/dm-0, but it seems hung. >> Any ideas? >> >> [37552.217472] device-mapper: uevent: version 1.0.3 >> [37552.217549] device-mapper: ioctl: 4.43.0-ioctl (2020-10-01) >> initialised: dm-devel@xxxxxxxxxx >> [37575.608500] device-mapper: zoned metadata: (dmz-5000cca2bfc0db21): >> DM-Zoned metadata version 2 >> [37575.608502] device-mapper: zoned metadata: (sdx): Host-managed >> zoned block device >> [37575.608503] device-mapper: zoned metadata: (sdx): 50782535680 >> 512-byte logical sectors (offset 0) >> [37575.608503] device-mapper: zoned metadata: (sdx): 96860 zones of >> 524288 512-byte logical sectors (offset 0) >> [37575.608504] device-mapper: zoned metadata: (dmz-5000cca2bfc0db21): >> 96860 zones of 524288 512-byte logical sectors >> [37575.609204] device-mapper: zoned: (dmz-5000cca2bfc0db21): Target >> device: 50771001344 512-byte logical sectors (6346375168 blocks) >> [38101.543353] INFO: task mkfs.ext4:1411791 blocked for more than 122 seconds. >> [38101.543380] Tainted: G OE 5.10.90.bm.1-amd64+ #2 >> [38101.543395] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >> disables this message. >> [38101.543411] task:mkfs.ext4 state:D stack: 0 pid:1411791 >> ppid:1388660 flags:0x00004000 >> [38101.543415] Call Trace: >> [38101.543422] __schedule+0x3fd/0x760 >> [38101.543425] schedule+0x46/0xb0 >> [38101.543426] io_schedule+0x12/0x40 >> [38101.543429] wait_on_page_bit+0x133/0x270 >> [38101.543431] ? __page_cache_alloc+0xa0/0xa0 >> [38101.543432] wait_on_page_writeback+0x25/0x70 >> [38101.543434] __filemap_fdatawait_range+0x86/0xf0 >> [38101.543435] file_write_and_wait_range+0x74/0xb0 >> [38101.543438] blkdev_fsync+0x16/0x40 >> [38101.543441] do_fsync+0x38/0x60 >> [38101.543442] __x64_sys_fsync+0x10/0x20 >> [38101.543445] do_syscall_64+0x2d/0x70 >> [38101.543446] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> >> === >> Below are the steps I did: >> >> root@smr_dev:~# blkzone reset /dev/sdx >> >> root@smr_dev:~# dmzadm --format /dev/sdx >> /dev/sdx: 50782535680 512-byte sectors (24215 GiB) >> Host-managed device >> 96860 zones, offset 0 >> 96860 zones of 524288 512-byte sectors (256 MiB) >> 65536 4KB data blocks per zone >> Resetting sequential zones >> Writing primary metadata set >> Writing mapping table >> Writing bitmap blocks >> Writing super block to sdx block 0 >> Writing secondary metadata set >> Writing mapping table >> Writing bitmap blocks >> Writing super block to sdx block 196608 >> Syncing disk >> Done. >> > Hmm. I don't actually see how many CMR zones the drive has. > >> root@smr_dev:~# dmzadm --start /dev/sdx >> /dev/sdx: 50782535680 512-byte sectors (24215 GiB) >> Host-managed device >> 96860 zones, offset 0 >> 96860 zones of 524288 512-byte sectors (256 MiB) >> 65536 4KB data blocks per zone >> sdx: starting dmz-5000cca2bfc0db21, metadata ver. 2, uuid >> 7495e21a-23d9-49f4-832a-76b32136078b >> >> root@smr_dev:~# mkfs.ext4 /dev/dm-0 >> mke2fs 1.44.5 (15-Dec-2018) >> Discarding device blocks: done >> Creating filesystem with 6346375168 4k blocks and 396648448 inodes >> Filesystem UUID: c47de06d-6cf6-4a85-9502-7830ca2f4526 >> Superblock backups stored on blocks: >> 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, >> 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, >> 102400000, 214990848, 512000000, 550731776, 644972544, 1934917632, >> 2560000000, 3855122432, 5804752896 >> >> Allocating group tables: done >> Writing inode tables: done >> Creating journal (262144 blocks): done >> Writing superblocks and filesystem accounting information: >> >> === >> At another terminal, >> >> root@smr_dev:~# ps aux | grep mkfs.ext4 >> root 1411791 2.8 0.0 30992 19864 pts/1 D+ 01:30 0:01 >> mkfs.ext4 /dev/dm-0 >> root 1413640 0.0 0.0 13972 2496 pts/0 S+ 01:31 0:00 >> grep mkfs.ext4 >> >> root@smr_dev:~# cat /proc/1411791/stack >> [<0>] wait_on_page_bit+0x133/0x270 >> [<0>] wait_on_page_writeback+0x25/0x70 >> [<0>] __filemap_fdatawait_range+0x86/0xf0 >> [<0>] file_write_and_wait_range+0x74/0xb0 >> [<0>] blkdev_fsync+0x16/0x40 >> [<0>] do_fsync+0x38/0x60 >> [<0>] __x64_sys_fsync+0x10/0x20 >> [<0>] do_syscall_64+0x2d/0x70 >> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Not sure if this is a bug, but doing a simple mkfs.ext4 on dm-zoned with a large SMR disk can take *a very loooooong* time. This is because mkfs.ext4 does a lot of random writes all over the place. So just running that, dm-zoned goes into heavy GC mode... To speed things up (and improve runtime performance), use the packed-metadata format: mkfs.ext4 -E packed_meta_blocks=1 Or do a mkfs.xfs to compare and see how much faster it is. > > But that just means that we're waiting for I/O to complete; there must > be another thread processing the I/O. > If this is the only active thread in you system something is seriously > hosed. > > But I guess I don't need to tell _you_ that :-) > > Cheers, > > Hannes -- Damien Le Moal Western Digital Research -- dm-devel mailing list dm-devel@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/dm-devel