On 2022/12/15 17:00, Jan Kara wrote: > On Thu 15-12-22 16:24:49, Zhang Yi wrote: >> On 2022/12/15 1:01, Jan Kara wrote: >>> On Sat 03-12-22 18:39:56, Zhang Yi wrote: >>>> In the dio write path, we only take shared inode lock for the case of >>>> aligned overwriting initialized blocks inside EOF. But for overwriting >>>> preallocated blocks, it may only need to split unwritten extents, this >>>> procedure has been protected under i_data_sem lock, it's safe to >>>> release the exclusive inode lock and take shared inode lock. >>>> >>>> This could give a significant speed up for multi-threaded writes. Test >>>> on Intel Xeon Gold 6140 and nvme SSD with below fio parameters. >>>> >>>> direct=1 >>>> ioengine=libaio >>>> iodepth=10 >>>> numjobs=10 >>>> runtime=60 >>>> rw=randwrite >>>> size=100G >>>> >>>> And the test result are: >>>> Before: >>>> bs=4k IOPS=11.1k, BW=43.2MiB/s >>>> bs=16k IOPS=11.1k, BW=173MiB/s >>>> bs=64k IOPS=11.2k, BW=697MiB/s >>>> >>>> After: >>>> bs=4k IOPS=41.4k, BW=162MiB/s >>>> bs=16k IOPS=41.3k, BW=646MiB/s >>>> bs=64k IOPS=13.5k, BW=843MiB/s >>>> >>>> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx> >>>> --- >>>> It passed xfstests auto mode with 1k and 4k blocksize. >>> >>> Besides some naming nits (see below) I think this should work. But I have >>> to say I'm a bit uneasy about this because we will now be changing block >>> mapping from unwritten to written only with shared i_rwsem. OTOH that >>> happens during writeback as well so we should be fine and the gain is very >>> nice. >>> >> Thanks for advice, I will change the argument name to make it more reasonable. >> >>> Out of curiosity do you have a real usecase for this? >> >> No, I was just analyse the performance gap in our benchmark tests, and have >> some question and idea while reading the code. Maybe it could speed up the >> first time write in some database. :) >> >> Besides, I want to discuss it a bit more. I originally changed this code to >> switch to take the shared inode and also use ext4_iomap_overwrite_ops for >> the overwriting preallocated blocks case. It will postpone the splitting extent >> procedure to endio and could give a much more gain than this patch (+~27%). >> >> This patch: >> bs=4k IOPS=41.4k, BW=162MiB/s >> Postpone split: >> bs=4k IOPS=52.9k, BW=207MiB/s >> >> But I think it's maybe too radical. I looked at the history and notice in >> commit 0031462b5b39 ("ext4: Split uninitialized extents for direct I/O"), >> it introduce the flag EXT4_GET_BLOCKS_DIO(now it had been renamed to >> EXT4_GET_BLOCKS_PRE_IO) to make sure that the preallocated unwritten >> extent have been splitted before submitting the I/O, which is used to >> prevent the ENOSPC problem if the filesystem run out-of-space in the >> endio procedure. And 4 years later, commit 27dd43854227 ("ext4: introduce >> reserved space") reserve some blocks for metedata allocation. It looks >> like this commit could also slove the ENOSPC problem for most cases if we >> postpone extent splitting into the endio procedure. I don't find other >> side effect, so I think it may also fine if we do that. Do you have some >> advice or am I missing something? > > So you are right these days splitting of extents could be done only on IO > completion because we have a pool of blocks reserved for these cases. OTOH > this will make the pressure on the reserved pool higher and if we are > running out of space and there are enough operations running in parallel we > *could* run out of reserved blocks. So I wouldn't always defer extent > splitting to IO completion unless we have a practical and sufficiently > widespread usecase that would benefit from this optimization. > Sure, I think so, thanks for advice. Yi.