Re: [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2024/10/23 20:13, Sedat Dilek wrote:
> On Tue, Oct 22, 2024 at 11:22 AM Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> wrote:
>>
>> On 2024/10/22 14:59, Sedat Dilek wrote:
>>> On Tue, Oct 22, 2024 at 5:13 AM Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> wrote:
>>>>
>>>> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>>>
>>>> Hello!
>>>>
>>>> This patch series is the latest version based on my previous RFC
>>>> series[1], which converts the buffered I/O path of ext4 regular files to
>>>> iomap and enables large folios. After several months of work, almost all
>>>> preparatory changes have been upstreamed, thanks a lot for the review
>>>> and comments from Jan, Dave, Christoph, Darrick and Ritesh. Now it is
>>>> time for the main implementation of this conversion.
>>>>
>>>> This series is the main part of iomap buffered iomap conversion, it's
>>>> based on 6.12-rc4, and the code context is also depend on my anohter
>>>> cleanup series[1] (I've put that in this seris so we can merge it
>>>> directly), fixed all minor bugs found in my previous RFC v4 series.
>>>> Additionally, I've update change logs in each patch and also includes
>>>> some code modifications as Dave's suggestions. This series implements
>>>> the core iomap APIs on ext4 and introduces a mount option called
>>>> "buffered_iomap" to enable the iomap buffered I/O path. We have already
>>>> supported the default features, default mount options and bigalloc
>>>> feature. However, we do not yet support online defragmentation, inline
>>>> data, fs_verify, fs_crypt, ext3, and data=journal mode, ext4 will fall
>>>> to buffered_head I/O path automatically if you use those features and
>>>> options. Some of these features should be supported gradually in the
>>>> near future.
>>>>
>>>> Most of the implementations resemble the original buffered_head path;
>>>> however, there are four key differences.
>>>>
>>>> 1. The first aspect is the block allocation in the writeback path. The
>>>>    iomap frame will invoke ->map_blocks() at least once for each dirty
>>>>    folio. To ensure optimal writeback performance, we aim to allocate a
>>>>    range of delalloc blocks that is as long as possible within the
>>>>    writeback length for each invocation. In certain situations, we may
>>>>    allocate a range of blocks that exceeds the amount we will actually
>>>>    write back. Therefore,
>>>> 1) we cannot allocate a written extent for those blocks because it may
>>>>    expose stale data in such short write cases. Instead, we should
>>>>    allocate an unwritten extent, which means we must always enable the
>>>>    dioread_nolock option. This change could also bring many other
>>>>    benefits.
>>>> 2) We should postpone updating the 'i_disksize' until the end of the I/O
>>>>    process, based on the actual written length. This approach can also
>>>>    prevent the exposure of zero data, which may occur if there is a
>>>>    power failure during an append write.
>>>> 3) We do not need to pre-split extents during write-back, we can
>>>>    postpone this task until the end I/O process while converting
>>>>    unwritten extents.
>>>>
>>>> 2. The second reason is that since we always allocate unwritten space
>>>>    for new blocks, there is no risk of exposing stale data. As a result,
>>>>    we do not need to order the data, which allows us to disable the
>>>>    data=ordered mode. Consequently, we also do not require the reserved
>>>>    handle when converting the unwritten extent in the final I/O worker,
>>>>    we can directly start with the normal handle.
>>>>
>>>> Series details:
>>>>
>>>> Patch 1-10 is just another series of mine that refactors the fallocate
>>>> functions[1]. This series relies on the code context of that but has no
>>>> logical dependencies. I put this here just for easy access and merge.
>>>>
>>>> Patch 11-21 implement the iomap buffered read/write path, dirty folio
>>>> write back path and mmap path for ext4 regular file.
>>>>
>>>> Patch 22-23 disable the unsupported online-defragmentation function and
>>>> disable the changing of the inode journal flag to data=journal mode.
>>>> Please look at the following patch for details.
>>>>
>>>> Patch 24-27 introduce "buffered_iomap" mount option (is not enabled by
>>>> default now) to partially enable the iomap buffered I/O path and also
>>>> enable large folio.
>>>>
>>>>
>>>> About performance:
>>>>
>>>> Fio tests with psync on my machine with Intel Xeon Gold 6240 CPU with
>>>> 400GB system ram, 200GB ramdisk and 4TB nvme ssd disk.
>>>>
>>>>  fio -directory=/mnt -direct=0 -iodepth=$iodepth -fsync=$sync -rw=$rw \
>>>>      -numjobs=${numjobs} -bs=${bs} -ioengine=psync -size=$size \
>>>>      -runtime=60 -norandommap=0 -fallocate=none -overwrite=$overwrite \
>>>>      -group_reportin -name=$name --output=/tmp/test_log
>>>>
>>>
>>> Hi Zhang Yi,
>>>
>>> can you clarify about the FIO values for the diverse parameters?
>>>
>>
>> Hi Sedat,
>>
>> Sure, the test I present here is a simple single-thread and single-I/O
>> depth case with psync ioengine. Most of the FIO parameters are shown
>> in the tables below.
>>
> 
> Hi Zhang Yi,
> 
> Thanks for your reply.
> 
> Can you share a FIO config file with all (relevant) settings?
> Maybe it is in the below link?
> 
> Link: https://packages.debian.org/sid/all/fio-examples/filelist

No, I didn't have this configuration file. I simply wrote two straightforward
scripts to do this test. This serves as a reference, primarily used for
performance analysis in basic read/write operations with different backends.
More complex cases should be adjusted based on the actual circumstances.

I have attached the scripts, feel free to use them. I suggest adjusting the
parameters according to your machine configuration and service I/O model.

> 
>> For the rest, the 'iodepth' and 'numjobs' are always set to 1 and the
>> 'size' is 40GB. During the write cache test, I also disable the write
>> back process through:
>>
>>  echo 0 > /proc/sys/vm/dirty_writeback_centisecs
>>  echo 100 > /proc/sys/vm/dirty_background_ratio
>>  echo 100 > /proc/sys/vm/dirty_ratio
>>
> 
> ^^ Ist this info in one of the patches? If not - can you add this info
> to the next version's cover-letter?
> 
> The patchset and improvements are valid only for powerful servers or
> has a notebook user any benefits of this?

The performance improvement is primarily attributed to the cost savings of
the kernel software stack with large I/O. Therefore, when the CPU becomes a
bottleneck, performance should improves, i.e. the faster the disk, the more
pronounced the benefits, regardless of whether the system is a server or a
notebook.

Thanks,
Yi.

> If you have benchmark data, please share this.
> 
> I can NOT promise if I will give that patchset a try.
> 
> Best thanks.
> 
> Best regards,
> -Sedat-
> 
>> Thanks,
>> Yi.
>>
>>>
>>>>  == buffer read ==
>>>>
>>>>                 buffer_head        iomap + large folio
>>>>  type     bs    IOPS    BW(MiB/s)  IOPS    BW(MiB/s)
>>>>  -------------------------------------------------------
>>>>  hole     4K    576k    2253       762k    2975     +32%
>>>>  hole     64K   48.7k   3043       77.8k   4860     +60%
>>>>  hole     1M    2960    2960       4942    4942     +67%
>>>>  ramdisk  4K    443k    1732       530k    2069     +19%
>>>>  ramdisk  64K   34.5k   2156       45.6k   2850     +32%
>>>>  ramdisk  1M    2093    2093       2841    2841     +36%
>>>>  nvme     4K    339k    1323       364k    1425     +8%
>>>>  nvme     64K   23.6k   1471       25.2k   1574     +7%
>>>>  nvme     1M    2012    2012       2153    2153     +7%
>>>>
>>>>
>>>>  == buffer write ==
>>>>
>>>>                                        buffer_head  iomap + large folio
>>>>  type   Overwrite Sync Writeback  bs   IOPS   BW    IOPS   BW(MiB/s)
>>>>  ----------------------------------------------------------------------
>>>>  cache      N    N    N    4K     417k    1631    440k    1719   +5%
>>>>  cache      N    N    N    64K    33.4k   2088    81.5k   5092   +144%
>>>>  cache      N    N    N    1M     2143    2143    5716    5716   +167%
>>>>  cache      Y    N    N    4K     449k    1755    469k    1834   +5%
>>>>  cache      Y    N    N    64K    36.6k   2290    82.3k   5142   +125%
>>>>  cache      Y    N    N    1M     2352    2352    5577    5577   +137%
>>>>  ramdisk    N    N    Y    4K     365k    1424    354k    1384   -3%
>>>>  ramdisk    N    N    Y    64K    31.2k   1950    74.2k   4640   +138%
>>>>  ramdisk    N    N    Y    1M     1968    1968    5201    5201   +164%
>>>>  ramdisk    N    Y    N    4K     9984    39      12.9k   51     +29%
>>>>  ramdisk    N    Y    N    64K    5936    371     8960    560    +51%
>>>>  ramdisk    N    Y    N    1M     1050    1050    1835    1835   +75%
>>>>  ramdisk    Y    N    Y    4K     411k    1609    443k    1731   +8%
>>>>  ramdisk    Y    N    Y    64K    34.1k   2134    77.5k   4844   +127%
>>>>  ramdisk    Y    N    Y    1M     2248    2248    5372    5372   +139%
>>>>  ramdisk    Y    Y    N    4K     182k    711     186k    730    +3%
>>>>  ramdisk    Y    Y    N    64K    18.7k   1170    34.7k   2171   +86%
>>>>  ramdisk    Y    Y    N    1M     1229    1229    2269    2269   +85%
>>>>  nvme       N    N    Y    4K     373k    1458    387k    1512   +4%
>>>>  nvme       N    N    Y    64K    29.2k   1827    70.9k   4431   +143%
>>>>  nvme       N    N    Y    1M     1835    1835    4919    4919   +168%
>>>>  nvme       N    Y    N    4K     11.7k   46      11.7k   46      0%
>>>>  nvme       N    Y    N    64K    6453    403     8661    541    +34%
>>>>  nvme       N    Y    N    1M     649     649     1351    1351   +108%
>>>>  nvme       Y    N    Y    4K     372k    1456    433k    1693   +16%
>>>>  nvme       Y    N    Y    64K    33.0k   2064    74.7k   4669   +126%
>>>>  nvme       Y    N    Y    1M     2131    2131    5273    5273   +147%
>>>>  nvme       Y    Y    N    4K     56.7k   222     56.4k   220    -1%
>>>>  nvme       Y    Y    N    64K    13.4k   840     19.4k   1214   +45%
>>>>  nvme       Y    Y    N    1M     714     714     1504    1504   +111%
>>>>
>>>> Thanks,
>>>> Yi.
>>>>
>>>> Major changes since RFC v4:
>>>>  - Disable unsupported online defragmentation, do not fall back to
>>>>    buffer_head path.
>>>>  - Wite and wait data back while doing partial block truncate down to
>>>>    fix a stale data problem.
>>>>  - Disable the online changing of the inode journal flag to data=journal
>>>>    mode.
>>>>  - Since iomap can zero out dirty pages with unwritten extent, do not
>>>>    write data before zeroing out in ext4_zero_range(), and also do not
>>>>    zero partial blocks under a started journal handle.
>>>>
>>>> [1] https://lore.kernel.org/linux-ext4/20241010133333.146793-1-yi.zhang@xxxxxxxxxx/
>>>>
>>>> ---
>>>> RFC v4: https://lore.kernel.org/linux-ext4/20240410142948.2817554-1-yi.zhang@xxxxxxxxxxxxxxx/
>>>> RFC v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@xxxxxxxxxxxxxxx/
>>>> RFC v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@xxxxxxxxxxxxxxx/
>>>> RFC v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@xxxxxxxxxxxxxxx/
>>>>
>>>>
>>>> Zhang Yi (27):
>>>>   ext4: remove writable userspace mappings before truncating page cache
>>>>   ext4: don't explicit update times in ext4_fallocate()
>>>>   ext4: don't write back data before punch hole in nojournal mode
>>>>   ext4: refactor ext4_punch_hole()
>>>>   ext4: refactor ext4_zero_range()
>>>>   ext4: refactor ext4_collapse_range()
>>>>   ext4: refactor ext4_insert_range()
>>>>   ext4: factor out ext4_do_fallocate()
>>>>   ext4: move out inode_lock into ext4_fallocate()
>>>>   ext4: move out common parts into ext4_fallocate()
>>>>   ext4: use reserved metadata blocks when splitting extent on endio
>>>>   ext4: introduce seq counter for the extent status entry
>>>>   ext4: add a new iomap aops for regular file's buffered IO path
>>>>   ext4: implement buffered read iomap path
>>>>   ext4: implement buffered write iomap path
>>>>   ext4: don't order data for inode with EXT4_STATE_BUFFERED_IOMAP
>>>>   ext4: implement writeback iomap path
>>>>   ext4: implement mmap iomap path
>>>>   ext4: do not always order data when partial zeroing out a block
>>>>   ext4: do not start handle if unnecessary while partial zeroing out a
>>>>     block
>>>>   ext4: implement zero_range iomap path
>>>>   ext4: disable online defrag when inode using iomap buffered I/O path
>>>>   ext4: disable inode journal mode when using iomap buffered I/O path
>>>>   ext4: partially enable iomap for the buffered I/O path of regular
>>>>     files
>>>>   ext4: enable large folio for regular file with iomap buffered I/O path
>>>>   ext4: change mount options code style
>>>>   ext4: introduce a mount option for iomap buffered I/O path
>>>>
>>>>  fs/ext4/ext4.h              |  17 +-
>>>>  fs/ext4/ext4_jbd2.c         |   3 +-
>>>>  fs/ext4/ext4_jbd2.h         |   8 +
>>>>  fs/ext4/extents.c           | 568 +++++++++++----------------
>>>>  fs/ext4/extents_status.c    |  13 +-
>>>>  fs/ext4/file.c              |  19 +-
>>>>  fs/ext4/ialloc.c            |   5 +
>>>>  fs/ext4/inode.c             | 755 ++++++++++++++++++++++++++++++------
>>>>  fs/ext4/move_extent.c       |   7 +
>>>>  fs/ext4/page-io.c           | 105 +++++
>>>>  fs/ext4/super.c             | 185 ++++-----
>>>>  include/trace/events/ext4.h |  57 +--
>>>>  12 files changed, 1153 insertions(+), 589 deletions(-)
>>>>
>>>> --
>>>> 2.46.1
>>>>
>>>>
>>
#!/bin/bash

ramdev=$1
nvmedev=$2

MOUNT_OPT=""
test_size=40G

function run_fio()
{
	local rw=read
	local sync=$1
	local bs=$2
	local iodepth=$3
	local numjobs=$4
	local overwrite=$5
	local name=1
	local size=$6

	fio -directory=/mnt -direct=0 -iodepth=$iodepth -fsync=$sync -rw=$rw \
	    -numjobs=${numjobs} -bs=${bs} -ioengine=psync -size=$size \
	    -runtime=60 -norandommap=0 -fallocate=none -overwrite=$overwrite \
	    -group_reportin -name=$name --output=/tmp/log

	cat /tmp/log >> /tmp/fio_result
}

function init_env()
{
	local hole=$1
	local size=$2
	local dev=$3

	rm -rf /mnt/*

	if [[ "$hole" == "1" ]]; then
		truncate -s $size /mnt/1.0.0
	else
		xfs_io -f -c "pwrite 0 $size" /mnt/1.0.0
	fi

	umount /mnt
	mount -o $MOUNT_OPT $dev /mnt
}

function reset_env()
{
	local dev=$1

	umount /mnt
	mount -o $MOUNT_OPT $dev /mnt
}

function do_one_test()
{
	local sync=0
	local hole=$1
	local size=$2
	local dev=$3

	echo "-------------------" | tee -a /tmp/fio_result

	echo "=== 4K:" | tee -a /tmp/fio_result
	reset_env $dev
	run_fio $sync 4k 1 1 0 $size

	echo "=== 64K:" | tee -a /tmp/fio_result
	reset_env $dev
	run_fio $sync 64k 1 1 0 $size

	echo "=== 1M:" | tee -a /tmp/fio_result
	reset_env $dev
	run_fio $sync 1M 1 1 0 $size

	echo "-------------------" | tee -a /tmp/fio_result
}

function run_one_round()
{
	local hole=$1
	local size=$2
	local dev=$3

	init_env $hole $size $dev
	do_one_test $hole $size $dev
}

function run_test()
{
	echo "---- TEST RAMDEV ----" | tee -a /tmp/fio_result
	mount -o $MOUNT_OPT $ramdev /mnt

	echo "----- 1. READ HOLE" | tee -a /tmp/fio_result
	run_one_round 1 $test_size $ramdev

	echo "----- 2. READ RAM DATA" | tee -a /tmp/fio_result
	run_one_round 0 $test_size $ramdev
	umount /mnt

	echo "---- TEST NVMEDEV ----" | tee -a /tmp/fio_result
	echo "----- 3. READ NVME DATA" | tee -a /tmp/fio_result
	mount -o $MOUNT_OPT $nvmedev /mnt
	run_one_round 0 $test_size $nvmedev
	umount /mnt
}

if [ -z "$ramdev" ] || [ -z "$nvmedev" ]; then
	echo "$0 <ramdev> <nvmedev>"
	exit
fi

umount /mnt
mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 -F $ramdev
mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 -F $nvmedev

cp /tmp/fio_result /tmp/fio_result.old
rm -f /tmp/fio_result

## TEST base ramdev
echo "==== TEST BASE ====" | tee -a /tmp/fio_result
MOUNT_OPT="nobuffered_iomap"
run_test

## TEST iomap ramdev
echo "==== TEST IOMAP ====" | tee -a /tmp/fio_result
MOUNT_OPT="buffered_iomap"
run_test
#!/bin/bash

ramdev=$1
nvmedev=$2

MOUNT_OPT=""
test_size=40G

function run_fio()
{
	local rw=write
	local sync=$1
	local bs=$2
	local iodepth=$3
	local numjobs=$4
	local overwrite=$5
	local name=1
	local size=$6

	fio -directory=/mnt -direct=0 -iodepth=$iodepth -fsync=$sync -rw=$rw \
	    -numjobs=${numjobs} -bs=${bs} -ioengine=psync -size=$size \
	    -runtime=60 -norandommap=0 -fallocate=none -overwrite=$overwrite \
	    -group_reportin -name=$name --output=/tmp/log

	cat /tmp/log >> /tmp/fio_result
}

function init_env()
{
	local dev=$1

	rm -rf /mnt/*
	umount /mnt
	mount -o $MOUNT_OPT $dev /mnt
}

function reset_env()
{
	local overwrite=$1
	local dev=$2

	if [[ "$overwrite" == "0" ]]; then
		rm -rf /mnt/*
	fi
	umount /mnt
	mount -o $MOUNT_OPT $dev /mnt
}

function do_one_test()
{
	local sync=$1
	local overwrite=$2
	local size=$3
	local dev=$4

	echo "-------------------" | tee -a /tmp/fio_result

	echo "=== 4K:" | tee -a /tmp/fio_result
	reset_env $overwrite $dev
	run_fio $sync 4k 1 1 $overwrite $size

	echo "=== 64K:" | tee -a /tmp/fio_result
	reset_env $overwrite $dev
	run_fio $sync 64k 1 1 $overwrite $size

	echo "=== 1M:" | tee -a /tmp/fio_result
	reset_env $overwrite $dev
	run_fio $sync 1M 1 1 $overwrite $size

	echo "-------------------" | tee -a /tmp/fio_result
}

function run_one_round()
{
	local sync=$1
	local overwrite=$2
	local size=$3
	local dev=$4

	echo "Sync:$sync, Overwrite:$overwrite" | tee -a /tmp/fio_result
	init_env $dev
	do_one_test $sync $overwrite $size $dev
}

function run_test()
{
	echo "---- TEST RAMDEV ----" | tee -a /tmp/fio_result
	mount -o $MOUNT_OPT $ramdev /mnt

	echo "----- 1. WRITE CACHE" | tee -a /tmp/fio_result
	# Stop writeback
	echo 0 > /proc/sys/vm/dirty_writeback_centisecs
	echo 30000 > /proc/sys/vm/dirty_expire_centisecs
	echo 100 > /proc/sys/vm/dirty_background_ratio
	echo 100 > /proc/sys/vm/dirty_ratio
	run_one_round 0 0 $test_size $ramdev
	run_one_round 0 1 $test_size $ramdev

	echo "----- 2. WRITE RAM DISK" | tee -a /tmp/fio_result
	# Restore writeback
	echo 500 > /proc/sys/vm/dirty_writeback_centisecs
	echo 3000 > /proc/sys/vm/dirty_expire_centisecs
	echo 10 > /proc/sys/vm/dirty_background_ratio
	echo 20 > /proc/sys/vm/dirty_ratio
	run_one_round 0 0 $test_size $ramdev
	run_one_round 0 1 $test_size $ramdev
	run_one_round 1 0 $test_size $ramdev
	run_one_round 1 1 $test_size $ramdev
	umount /mnt

	echo "---- TEST NVMEDEV ----" | tee -a /tmp/fio_result
	echo "----- 3. WRITE NVME DISK" | tee -a /tmp/fio_result
	mount -o $MOUNT_OPT $nvmedev /mnt
	run_one_round 0 0 $test_size $nvmedev
	run_one_round 0 1 $test_size $nvmedev
	run_one_round 1 0 $test_size $nvmedev
	run_one_round 1 1 $test_size $nvmedev
	umount /mnt
}

if [ -z "$ramdev" ] || [ -z "$nvmedev" ]; then
	echo "$0 <ramdev> <nvmedev>"
	exit
fi

umount /mnt
mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 -F $ramdev
mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 -F $nvmedev

cp /tmp/fio_result /tmp/fio_result.old
rm -f /tmp/fio_result

## TEST base
echo "==== TEST BASE ====" | tee -a /tmp/fio_result
MOUNT_OPT="nobuffered_iomap"
run_test

## TEST iomap
echo "==== TEST IOMAP ====" | tee -a /tmp/fio_result
MOUNT_OPT="buffered_iomap"
run_test

[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux