On 2024/10/23 20:13, Sedat Dilek wrote: > On Tue, Oct 22, 2024 at 11:22 AM Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> wrote: >> >> On 2024/10/22 14:59, Sedat Dilek wrote: >>> On Tue, Oct 22, 2024 at 5:13 AM Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> wrote: >>>> >>>> From: Zhang Yi <yi.zhang@xxxxxxxxxx> >>>> >>>> Hello! >>>> >>>> This patch series is the latest version based on my previous RFC >>>> series[1], which converts the buffered I/O path of ext4 regular files to >>>> iomap and enables large folios. After several months of work, almost all >>>> preparatory changes have been upstreamed, thanks a lot for the review >>>> and comments from Jan, Dave, Christoph, Darrick and Ritesh. Now it is >>>> time for the main implementation of this conversion. >>>> >>>> This series is the main part of iomap buffered iomap conversion, it's >>>> based on 6.12-rc4, and the code context is also depend on my anohter >>>> cleanup series[1] (I've put that in this seris so we can merge it >>>> directly), fixed all minor bugs found in my previous RFC v4 series. >>>> Additionally, I've update change logs in each patch and also includes >>>> some code modifications as Dave's suggestions. This series implements >>>> the core iomap APIs on ext4 and introduces a mount option called >>>> "buffered_iomap" to enable the iomap buffered I/O path. We have already >>>> supported the default features, default mount options and bigalloc >>>> feature. However, we do not yet support online defragmentation, inline >>>> data, fs_verify, fs_crypt, ext3, and data=journal mode, ext4 will fall >>>> to buffered_head I/O path automatically if you use those features and >>>> options. Some of these features should be supported gradually in the >>>> near future. >>>> >>>> Most of the implementations resemble the original buffered_head path; >>>> however, there are four key differences. >>>> >>>> 1. The first aspect is the block allocation in the writeback path. The >>>> iomap frame will invoke ->map_blocks() at least once for each dirty >>>> folio. To ensure optimal writeback performance, we aim to allocate a >>>> range of delalloc blocks that is as long as possible within the >>>> writeback length for each invocation. In certain situations, we may >>>> allocate a range of blocks that exceeds the amount we will actually >>>> write back. Therefore, >>>> 1) we cannot allocate a written extent for those blocks because it may >>>> expose stale data in such short write cases. Instead, we should >>>> allocate an unwritten extent, which means we must always enable the >>>> dioread_nolock option. This change could also bring many other >>>> benefits. >>>> 2) We should postpone updating the 'i_disksize' until the end of the I/O >>>> process, based on the actual written length. This approach can also >>>> prevent the exposure of zero data, which may occur if there is a >>>> power failure during an append write. >>>> 3) We do not need to pre-split extents during write-back, we can >>>> postpone this task until the end I/O process while converting >>>> unwritten extents. >>>> >>>> 2. The second reason is that since we always allocate unwritten space >>>> for new blocks, there is no risk of exposing stale data. As a result, >>>> we do not need to order the data, which allows us to disable the >>>> data=ordered mode. Consequently, we also do not require the reserved >>>> handle when converting the unwritten extent in the final I/O worker, >>>> we can directly start with the normal handle. >>>> >>>> Series details: >>>> >>>> Patch 1-10 is just another series of mine that refactors the fallocate >>>> functions[1]. This series relies on the code context of that but has no >>>> logical dependencies. I put this here just for easy access and merge. >>>> >>>> Patch 11-21 implement the iomap buffered read/write path, dirty folio >>>> write back path and mmap path for ext4 regular file. >>>> >>>> Patch 22-23 disable the unsupported online-defragmentation function and >>>> disable the changing of the inode journal flag to data=journal mode. >>>> Please look at the following patch for details. >>>> >>>> Patch 24-27 introduce "buffered_iomap" mount option (is not enabled by >>>> default now) to partially enable the iomap buffered I/O path and also >>>> enable large folio. >>>> >>>> >>>> About performance: >>>> >>>> Fio tests with psync on my machine with Intel Xeon Gold 6240 CPU with >>>> 400GB system ram, 200GB ramdisk and 4TB nvme ssd disk. >>>> >>>> fio -directory=/mnt -direct=0 -iodepth=$iodepth -fsync=$sync -rw=$rw \ >>>> -numjobs=${numjobs} -bs=${bs} -ioengine=psync -size=$size \ >>>> -runtime=60 -norandommap=0 -fallocate=none -overwrite=$overwrite \ >>>> -group_reportin -name=$name --output=/tmp/test_log >>>> >>> >>> Hi Zhang Yi, >>> >>> can you clarify about the FIO values for the diverse parameters? >>> >> >> Hi Sedat, >> >> Sure, the test I present here is a simple single-thread and single-I/O >> depth case with psync ioengine. Most of the FIO parameters are shown >> in the tables below. >> > > Hi Zhang Yi, > > Thanks for your reply. > > Can you share a FIO config file with all (relevant) settings? > Maybe it is in the below link? > > Link: https://packages.debian.org/sid/all/fio-examples/filelist No, I didn't have this configuration file. I simply wrote two straightforward scripts to do this test. This serves as a reference, primarily used for performance analysis in basic read/write operations with different backends. More complex cases should be adjusted based on the actual circumstances. I have attached the scripts, feel free to use them. I suggest adjusting the parameters according to your machine configuration and service I/O model. > >> For the rest, the 'iodepth' and 'numjobs' are always set to 1 and the >> 'size' is 40GB. During the write cache test, I also disable the write >> back process through: >> >> echo 0 > /proc/sys/vm/dirty_writeback_centisecs >> echo 100 > /proc/sys/vm/dirty_background_ratio >> echo 100 > /proc/sys/vm/dirty_ratio >> > > ^^ Ist this info in one of the patches? If not - can you add this info > to the next version's cover-letter? > > The patchset and improvements are valid only for powerful servers or > has a notebook user any benefits of this? The performance improvement is primarily attributed to the cost savings of the kernel software stack with large I/O. Therefore, when the CPU becomes a bottleneck, performance should improves, i.e. the faster the disk, the more pronounced the benefits, regardless of whether the system is a server or a notebook. Thanks, Yi. > If you have benchmark data, please share this. > > I can NOT promise if I will give that patchset a try. > > Best thanks. > > Best regards, > -Sedat- > >> Thanks, >> Yi. >> >>> >>>> == buffer read == >>>> >>>> buffer_head iomap + large folio >>>> type bs IOPS BW(MiB/s) IOPS BW(MiB/s) >>>> ------------------------------------------------------- >>>> hole 4K 576k 2253 762k 2975 +32% >>>> hole 64K 48.7k 3043 77.8k 4860 +60% >>>> hole 1M 2960 2960 4942 4942 +67% >>>> ramdisk 4K 443k 1732 530k 2069 +19% >>>> ramdisk 64K 34.5k 2156 45.6k 2850 +32% >>>> ramdisk 1M 2093 2093 2841 2841 +36% >>>> nvme 4K 339k 1323 364k 1425 +8% >>>> nvme 64K 23.6k 1471 25.2k 1574 +7% >>>> nvme 1M 2012 2012 2153 2153 +7% >>>> >>>> >>>> == buffer write == >>>> >>>> buffer_head iomap + large folio >>>> type Overwrite Sync Writeback bs IOPS BW IOPS BW(MiB/s) >>>> ---------------------------------------------------------------------- >>>> cache N N N 4K 417k 1631 440k 1719 +5% >>>> cache N N N 64K 33.4k 2088 81.5k 5092 +144% >>>> cache N N N 1M 2143 2143 5716 5716 +167% >>>> cache Y N N 4K 449k 1755 469k 1834 +5% >>>> cache Y N N 64K 36.6k 2290 82.3k 5142 +125% >>>> cache Y N N 1M 2352 2352 5577 5577 +137% >>>> ramdisk N N Y 4K 365k 1424 354k 1384 -3% >>>> ramdisk N N Y 64K 31.2k 1950 74.2k 4640 +138% >>>> ramdisk N N Y 1M 1968 1968 5201 5201 +164% >>>> ramdisk N Y N 4K 9984 39 12.9k 51 +29% >>>> ramdisk N Y N 64K 5936 371 8960 560 +51% >>>> ramdisk N Y N 1M 1050 1050 1835 1835 +75% >>>> ramdisk Y N Y 4K 411k 1609 443k 1731 +8% >>>> ramdisk Y N Y 64K 34.1k 2134 77.5k 4844 +127% >>>> ramdisk Y N Y 1M 2248 2248 5372 5372 +139% >>>> ramdisk Y Y N 4K 182k 711 186k 730 +3% >>>> ramdisk Y Y N 64K 18.7k 1170 34.7k 2171 +86% >>>> ramdisk Y Y N 1M 1229 1229 2269 2269 +85% >>>> nvme N N Y 4K 373k 1458 387k 1512 +4% >>>> nvme N N Y 64K 29.2k 1827 70.9k 4431 +143% >>>> nvme N N Y 1M 1835 1835 4919 4919 +168% >>>> nvme N Y N 4K 11.7k 46 11.7k 46 0% >>>> nvme N Y N 64K 6453 403 8661 541 +34% >>>> nvme N Y N 1M 649 649 1351 1351 +108% >>>> nvme Y N Y 4K 372k 1456 433k 1693 +16% >>>> nvme Y N Y 64K 33.0k 2064 74.7k 4669 +126% >>>> nvme Y N Y 1M 2131 2131 5273 5273 +147% >>>> nvme Y Y N 4K 56.7k 222 56.4k 220 -1% >>>> nvme Y Y N 64K 13.4k 840 19.4k 1214 +45% >>>> nvme Y Y N 1M 714 714 1504 1504 +111% >>>> >>>> Thanks, >>>> Yi. >>>> >>>> Major changes since RFC v4: >>>> - Disable unsupported online defragmentation, do not fall back to >>>> buffer_head path. >>>> - Wite and wait data back while doing partial block truncate down to >>>> fix a stale data problem. >>>> - Disable the online changing of the inode journal flag to data=journal >>>> mode. >>>> - Since iomap can zero out dirty pages with unwritten extent, do not >>>> write data before zeroing out in ext4_zero_range(), and also do not >>>> zero partial blocks under a started journal handle. >>>> >>>> [1] https://lore.kernel.org/linux-ext4/20241010133333.146793-1-yi.zhang@xxxxxxxxxx/ >>>> >>>> --- >>>> RFC v4: https://lore.kernel.org/linux-ext4/20240410142948.2817554-1-yi.zhang@xxxxxxxxxxxxxxx/ >>>> RFC v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@xxxxxxxxxxxxxxx/ >>>> RFC v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@xxxxxxxxxxxxxxx/ >>>> RFC v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@xxxxxxxxxxxxxxx/ >>>> >>>> >>>> Zhang Yi (27): >>>> ext4: remove writable userspace mappings before truncating page cache >>>> ext4: don't explicit update times in ext4_fallocate() >>>> ext4: don't write back data before punch hole in nojournal mode >>>> ext4: refactor ext4_punch_hole() >>>> ext4: refactor ext4_zero_range() >>>> ext4: refactor ext4_collapse_range() >>>> ext4: refactor ext4_insert_range() >>>> ext4: factor out ext4_do_fallocate() >>>> ext4: move out inode_lock into ext4_fallocate() >>>> ext4: move out common parts into ext4_fallocate() >>>> ext4: use reserved metadata blocks when splitting extent on endio >>>> ext4: introduce seq counter for the extent status entry >>>> ext4: add a new iomap aops for regular file's buffered IO path >>>> ext4: implement buffered read iomap path >>>> ext4: implement buffered write iomap path >>>> ext4: don't order data for inode with EXT4_STATE_BUFFERED_IOMAP >>>> ext4: implement writeback iomap path >>>> ext4: implement mmap iomap path >>>> ext4: do not always order data when partial zeroing out a block >>>> ext4: do not start handle if unnecessary while partial zeroing out a >>>> block >>>> ext4: implement zero_range iomap path >>>> ext4: disable online defrag when inode using iomap buffered I/O path >>>> ext4: disable inode journal mode when using iomap buffered I/O path >>>> ext4: partially enable iomap for the buffered I/O path of regular >>>> files >>>> ext4: enable large folio for regular file with iomap buffered I/O path >>>> ext4: change mount options code style >>>> ext4: introduce a mount option for iomap buffered I/O path >>>> >>>> fs/ext4/ext4.h | 17 +- >>>> fs/ext4/ext4_jbd2.c | 3 +- >>>> fs/ext4/ext4_jbd2.h | 8 + >>>> fs/ext4/extents.c | 568 +++++++++++---------------- >>>> fs/ext4/extents_status.c | 13 +- >>>> fs/ext4/file.c | 19 +- >>>> fs/ext4/ialloc.c | 5 + >>>> fs/ext4/inode.c | 755 ++++++++++++++++++++++++++++++------ >>>> fs/ext4/move_extent.c | 7 + >>>> fs/ext4/page-io.c | 105 +++++ >>>> fs/ext4/super.c | 185 ++++----- >>>> include/trace/events/ext4.h | 57 +-- >>>> 12 files changed, 1153 insertions(+), 589 deletions(-) >>>> >>>> -- >>>> 2.46.1 >>>> >>>> >>
#!/bin/bash ramdev=$1 nvmedev=$2 MOUNT_OPT="" test_size=40G function run_fio() { local rw=read local sync=$1 local bs=$2 local iodepth=$3 local numjobs=$4 local overwrite=$5 local name=1 local size=$6 fio -directory=/mnt -direct=0 -iodepth=$iodepth -fsync=$sync -rw=$rw \ -numjobs=${numjobs} -bs=${bs} -ioengine=psync -size=$size \ -runtime=60 -norandommap=0 -fallocate=none -overwrite=$overwrite \ -group_reportin -name=$name --output=/tmp/log cat /tmp/log >> /tmp/fio_result } function init_env() { local hole=$1 local size=$2 local dev=$3 rm -rf /mnt/* if [[ "$hole" == "1" ]]; then truncate -s $size /mnt/1.0.0 else xfs_io -f -c "pwrite 0 $size" /mnt/1.0.0 fi umount /mnt mount -o $MOUNT_OPT $dev /mnt } function reset_env() { local dev=$1 umount /mnt mount -o $MOUNT_OPT $dev /mnt } function do_one_test() { local sync=0 local hole=$1 local size=$2 local dev=$3 echo "-------------------" | tee -a /tmp/fio_result echo "=== 4K:" | tee -a /tmp/fio_result reset_env $dev run_fio $sync 4k 1 1 0 $size echo "=== 64K:" | tee -a /tmp/fio_result reset_env $dev run_fio $sync 64k 1 1 0 $size echo "=== 1M:" | tee -a /tmp/fio_result reset_env $dev run_fio $sync 1M 1 1 0 $size echo "-------------------" | tee -a /tmp/fio_result } function run_one_round() { local hole=$1 local size=$2 local dev=$3 init_env $hole $size $dev do_one_test $hole $size $dev } function run_test() { echo "---- TEST RAMDEV ----" | tee -a /tmp/fio_result mount -o $MOUNT_OPT $ramdev /mnt echo "----- 1. READ HOLE" | tee -a /tmp/fio_result run_one_round 1 $test_size $ramdev echo "----- 2. READ RAM DATA" | tee -a /tmp/fio_result run_one_round 0 $test_size $ramdev umount /mnt echo "---- TEST NVMEDEV ----" | tee -a /tmp/fio_result echo "----- 3. READ NVME DATA" | tee -a /tmp/fio_result mount -o $MOUNT_OPT $nvmedev /mnt run_one_round 0 $test_size $nvmedev umount /mnt } if [ -z "$ramdev" ] || [ -z "$nvmedev" ]; then echo "$0 <ramdev> <nvmedev>" exit fi umount /mnt mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 -F $ramdev mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 -F $nvmedev cp /tmp/fio_result /tmp/fio_result.old rm -f /tmp/fio_result ## TEST base ramdev echo "==== TEST BASE ====" | tee -a /tmp/fio_result MOUNT_OPT="nobuffered_iomap" run_test ## TEST iomap ramdev echo "==== TEST IOMAP ====" | tee -a /tmp/fio_result MOUNT_OPT="buffered_iomap" run_test
#!/bin/bash ramdev=$1 nvmedev=$2 MOUNT_OPT="" test_size=40G function run_fio() { local rw=write local sync=$1 local bs=$2 local iodepth=$3 local numjobs=$4 local overwrite=$5 local name=1 local size=$6 fio -directory=/mnt -direct=0 -iodepth=$iodepth -fsync=$sync -rw=$rw \ -numjobs=${numjobs} -bs=${bs} -ioengine=psync -size=$size \ -runtime=60 -norandommap=0 -fallocate=none -overwrite=$overwrite \ -group_reportin -name=$name --output=/tmp/log cat /tmp/log >> /tmp/fio_result } function init_env() { local dev=$1 rm -rf /mnt/* umount /mnt mount -o $MOUNT_OPT $dev /mnt } function reset_env() { local overwrite=$1 local dev=$2 if [[ "$overwrite" == "0" ]]; then rm -rf /mnt/* fi umount /mnt mount -o $MOUNT_OPT $dev /mnt } function do_one_test() { local sync=$1 local overwrite=$2 local size=$3 local dev=$4 echo "-------------------" | tee -a /tmp/fio_result echo "=== 4K:" | tee -a /tmp/fio_result reset_env $overwrite $dev run_fio $sync 4k 1 1 $overwrite $size echo "=== 64K:" | tee -a /tmp/fio_result reset_env $overwrite $dev run_fio $sync 64k 1 1 $overwrite $size echo "=== 1M:" | tee -a /tmp/fio_result reset_env $overwrite $dev run_fio $sync 1M 1 1 $overwrite $size echo "-------------------" | tee -a /tmp/fio_result } function run_one_round() { local sync=$1 local overwrite=$2 local size=$3 local dev=$4 echo "Sync:$sync, Overwrite:$overwrite" | tee -a /tmp/fio_result init_env $dev do_one_test $sync $overwrite $size $dev } function run_test() { echo "---- TEST RAMDEV ----" | tee -a /tmp/fio_result mount -o $MOUNT_OPT $ramdev /mnt echo "----- 1. WRITE CACHE" | tee -a /tmp/fio_result # Stop writeback echo 0 > /proc/sys/vm/dirty_writeback_centisecs echo 30000 > /proc/sys/vm/dirty_expire_centisecs echo 100 > /proc/sys/vm/dirty_background_ratio echo 100 > /proc/sys/vm/dirty_ratio run_one_round 0 0 $test_size $ramdev run_one_round 0 1 $test_size $ramdev echo "----- 2. WRITE RAM DISK" | tee -a /tmp/fio_result # Restore writeback echo 500 > /proc/sys/vm/dirty_writeback_centisecs echo 3000 > /proc/sys/vm/dirty_expire_centisecs echo 10 > /proc/sys/vm/dirty_background_ratio echo 20 > /proc/sys/vm/dirty_ratio run_one_round 0 0 $test_size $ramdev run_one_round 0 1 $test_size $ramdev run_one_round 1 0 $test_size $ramdev run_one_round 1 1 $test_size $ramdev umount /mnt echo "---- TEST NVMEDEV ----" | tee -a /tmp/fio_result echo "----- 3. WRITE NVME DISK" | tee -a /tmp/fio_result mount -o $MOUNT_OPT $nvmedev /mnt run_one_round 0 0 $test_size $nvmedev run_one_round 0 1 $test_size $nvmedev run_one_round 1 0 $test_size $nvmedev run_one_round 1 1 $test_size $nvmedev umount /mnt } if [ -z "$ramdev" ] || [ -z "$nvmedev" ]; then echo "$0 <ramdev> <nvmedev>" exit fi umount /mnt mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 -F $ramdev mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 -F $nvmedev cp /tmp/fio_result /tmp/fio_result.old rm -f /tmp/fio_result ## TEST base echo "==== TEST BASE ====" | tee -a /tmp/fio_result MOUNT_OPT="nobuffered_iomap" run_test ## TEST iomap echo "==== TEST IOMAP ====" | tee -a /tmp/fio_result MOUNT_OPT="buffered_iomap" run_test