Re: Does dm-zoned support buffered write?

Damien Le Moal <dlemoal@xxxxxxxxxx> · Mon, 15 May 2023 20:46:53 +0900

On 5/15/23 15:06, Hannes Reinecke wrote:
> On 5/12/23 19:41, Ming Lin wrote:
>> On Thu, May 11, 2023 at 11:56 AM Hannes Reinecke <hare@xxxxxxx> wrote:
>>>
>>> On 5/11/23 20:41, Ming Lin wrote:
>>>> Hi list,
>>>>
>>>> I have an application that needs to use buffered_io to access SMR disk
>>>> for good performance.
>>>>
>>>>   From "ZBD Support Restrictions" at https://zonedstorage.io/docs/linux/overview
>>>> "
>>>> Direct IO Writes The kernel page cache does not guarantee that cached
>>>> dirty pages will be flushed to a block device in sequential sector
>>>> order. This can lead to unaligned write errors if an application uses
>>>> buffered writes to write to the sequential write required zones of a
>>>> device. To avoid this pitfall, applications that directly use a zoned
>>>> block device without a file system should always use direct I/O
>>>> operations to write to the sequential write required zones of a
>>>> host-managed disk (that is, they should issue write() system calls
>>>> with a block device "file open" that uses the O_DIRECT flag).
>>>> "
>>>>
>>>> Raw zbd disk only supports direct_io.
>>>>
>>>> Does dm-zoned support buffered io (without O_DIRECT)?
>>>>
>>> Yes. But I _think_ the above paragraph is ever so slightly outdated, as
>>> we've spent quite a lot of time fixing sequential writes (cf blk-zoned
>>> etc). So while dm-zoned is using bufferet writes there won't be any
>>> sequential write issues.
>>>
>>> At least, I have not uncovered any of those during testing.
>>
>> Hi Hannes,
>>
>> I use 5.10.90 kernel and smr disk capacity is 24T.
>> I followed the below guide to create dm_zone device on top of smr disk.
>> https://zonedstorage.io/docs/linux/dm
>>
>> Then mkfs.ext4 /dev/dm-0, but it seems hung.
>> Any ideas?
>>
>> [37552.217472] device-mapper: uevent: version 1.0.3
>> [37552.217549] device-mapper: ioctl: 4.43.0-ioctl (2020-10-01)
>> initialised: dm-devel@xxxxxxxxxx
>> [37575.608500] device-mapper: zoned metadata: (dmz-5000cca2bfc0db21):
>> DM-Zoned metadata version 2
>> [37575.608502] device-mapper: zoned metadata: (sdx): Host-managed
>> zoned block device
>> [37575.608503] device-mapper: zoned metadata: (sdx):   50782535680
>> 512-byte logical sectors (offset 0)
>> [37575.608503] device-mapper: zoned metadata: (sdx):   96860 zones of
>> 524288 512-byte logical sectors (offset 0)
>> [37575.608504] device-mapper: zoned metadata: (dmz-5000cca2bfc0db21):
>>   96860 zones of 524288 512-byte logical sectors
>> [37575.609204] device-mapper: zoned: (dmz-5000cca2bfc0db21): Target
>> device: 50771001344 512-byte logical sectors (6346375168 blocks)
>> [38101.543353] INFO: task mkfs.ext4:1411791 blocked for more than 122 seconds.
>> [38101.543380]       Tainted: G           OE     5.10.90.bm.1-amd64+ #2
>> [38101.543395] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [38101.543411] task:mkfs.ext4       state:D stack:    0 pid:1411791
>> ppid:1388660 flags:0x00004000
>> [38101.543415] Call Trace:
>> [38101.543422]  __schedule+0x3fd/0x760
>> [38101.543425]  schedule+0x46/0xb0
>> [38101.543426]  io_schedule+0x12/0x40
>> [38101.543429]  wait_on_page_bit+0x133/0x270
>> [38101.543431]  ? __page_cache_alloc+0xa0/0xa0
>> [38101.543432]  wait_on_page_writeback+0x25/0x70
>> [38101.543434]  __filemap_fdatawait_range+0x86/0xf0
>> [38101.543435]  file_write_and_wait_range+0x74/0xb0
>> [38101.543438]  blkdev_fsync+0x16/0x40
>> [38101.543441]  do_fsync+0x38/0x60
>> [38101.543442]  __x64_sys_fsync+0x10/0x20
>> [38101.543445]  do_syscall_64+0x2d/0x70
>> [38101.543446]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> ===
>> Below are the steps I did:
>>
>> root@smr_dev:~# blkzone reset /dev/sdx
>>
>> root@smr_dev:~# dmzadm --format /dev/sdx
>> /dev/sdx: 50782535680 512-byte sectors (24215 GiB)
>>    Host-managed device
>>    96860 zones, offset 0
>>    96860 zones of 524288 512-byte sectors (256 MiB)
>>    65536 4KB data blocks per zone
>> Resetting sequential zones
>> Writing primary metadata set
>>    Writing mapping table
>>    Writing bitmap blocks
>>    Writing super block to sdx block 0
>> Writing secondary metadata set
>>    Writing mapping table
>>    Writing bitmap blocks
>>    Writing super block to sdx block 196608
>> Syncing disk
>> Done.
>>
> Hmm. I don't actually see how many CMR zones the drive has.
> 
>> root@smr_dev:~# dmzadm --start /dev/sdx
>> /dev/sdx: 50782535680 512-byte sectors (24215 GiB)
>>    Host-managed device
>>    96860 zones, offset 0
>>    96860 zones of 524288 512-byte sectors (256 MiB)
>>    65536 4KB data blocks per zone
>> sdx: starting dmz-5000cca2bfc0db21, metadata ver. 2, uuid
>> 7495e21a-23d9-49f4-832a-76b32136078b
>>
>> root@smr_dev:~# mkfs.ext4 /dev/dm-0
>> mke2fs 1.44.5 (15-Dec-2018)
>> Discarding device blocks: done
>> Creating filesystem with 6346375168 4k blocks and 396648448 inodes
>> Filesystem UUID: c47de06d-6cf6-4a85-9502-7830ca2f4526
>> Superblock backups stored on blocks:
>>          32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
>>          4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
>>          102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
>>          2560000000, 3855122432, 5804752896
>>
>> Allocating group tables: done
>> Writing inode tables: done
>> Creating journal (262144 blocks): done
>> Writing superblocks and filesystem accounting information:
>>
>> ===
>> At another terminal,
>>
>> root@smr_dev:~# ps aux | grep mkfs.ext4
>> root     1411791  2.8  0.0  30992 19864 pts/1    D+   01:30   0:01
>> mkfs.ext4 /dev/dm-0
>> root     1413640  0.0  0.0  13972  2496 pts/0    S+   01:31   0:00
>> grep mkfs.ext4
>>
>> root@smr_dev:~# cat /proc/1411791/stack
>> [<0>] wait_on_page_bit+0x133/0x270
>> [<0>] wait_on_page_writeback+0x25/0x70
>> [<0>] __filemap_fdatawait_range+0x86/0xf0
>> [<0>] file_write_and_wait_range+0x74/0xb0
>> [<0>] blkdev_fsync+0x16/0x40
>> [<0>] do_fsync+0x38/0x60
>> [<0>] __x64_sys_fsync+0x10/0x20
>> [<0>] do_syscall_64+0x2d/0x70
>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Not sure if this is a bug, but doing a simple mkfs.ext4 on dm-zoned with a large
SMR disk can take *a very loooooong* time. This is because mkfs.ext4 does a lot
of random writes all over the place. So just running that, dm-zoned goes into
heavy GC mode...

To speed things up (and improve runtime performance), use the packed-metadata
format: mkfs.ext4 -E packed_meta_blocks=1
Or do a mkfs.xfs to compare and see how much faster it is.

> 
> But that just means that we're waiting for I/O to complete; there must 
> be another thread processing the I/O.
> If this is the only active thread in you system something is seriously 
> hosed.
> 
> But I guess I don't need to tell _you_ that :-)
> 
> Cheers,
> 
> Hannes

-- 
Damien Le Moal
Western Digital Research

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/dm-devel