Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path

Ritesh Harjani <riteshh@xxxxxxxxxxxxx> · Wed, 25 Sep 2019 01:18:04 +0530

On 9/24/19 8:40 PM, Jan Kara wrote:
Hi Joseph!

On Wed 18-09-19 14:35:15, Joseph Qi wrote:
On 19/9/17 18:32, Ritesh Harjani wrote:
Hello,

This patch series is based on the upstream discussion with Jan
& Joseph @ [1].
It is based on top of Matthew's v3 ext4 iomap patch series [2]

Patch-1: Adds the ext4_ilock/unlock APIs and also replaces all
inode_lock/unlock instances from fs/ext4/*

For now I already accounted for trylock/lock issue symantics
(which was discussed here [3]) in the same patch,
since the this whole patch was around inode_lock/unlock API,
so I thought it will be best to address that issue in the same patch.
However, kindly let me know if otherwise.

Patch-2: Commit msg of this patch describes in detail about
what it is doing.
In brief - we try to first take the shared lock (instead of exclusive
lock), unless it is a unaligned_io or extend_io. Then in
ext4_dio_write_checks(), if we start with shared lock, we see
if we can really continue with shared lock or not. If not, then
we release the shared lock then acquire exclusive lock
and restart ext4_dio_write_checks().

Tested against few xfstests (with dioread_nolock mount option),
those ran fine (ext4 & generic).

I tried testing performance numbers on my VM (since I could not get
hold of any real h/w based test device). I could test the fact
that earlier we were trying to do downgrade_write() lock, but with
this patch, that path is now avoided for fio test case
(as reported by Joseph in [4]).
But for the actual results, I am not sure if VM machine testing could
really give the reliable perf numbers which we want to take a look at.
Though I do observe some form of perf improvements, but I could not
get any reliable numbers (not even with the same list of with/without
patches with which Joseph posted his numbers [1]).

@Joseph,
Would it be possible for you to give your test case a run with this
patches? That will be really helpful.

Branch for this is hosted at below tree.

https://github.com/riteshharjani/linux/tree/ext4-ilock-RFC

I've tested your branch, the result is:
mounting with dioread_nolock, it behaves the same like reverting
parallel dio reads + dioread_nolock;
while mounting without dioread_nolock, no improvement, or even worse.
Please refer the test data below.

fio -name=parallel_dio_reads_test -filename=/mnt/nvme0n1/testfile
-direct=1 -iodepth=1 -thread -rw=randrw -ioengine=psync -bs=$bs
-size=20G -numjobs=8 -runtime=600 -group_reporting

w/     = with parallel dio reads
w/o    = reverting parallel dio reads

This is with 16c54688592ce8 "ext4: Allow parallel DIO reads" reverted,
right?

He posted the same numbers where he posted previous reverts too,
which I thought we already noticed [1].
From [2] below, I assumed we knew this.

[2] - """
(note
that the patches actually improve performance of read-only DIO workload
when not using dioread_nolock as for that case, exclusive lock is 
replaced with a shared one)
"""

[1]  https://patchwork.ozlabs.org/patch/1153546/
[2] 
https://lore.kernel.org/linux-ext4/20190830153520.GB25069@xxxxxxxxxxxxxx/

w/o+   = reverting parallel dio reads + dioread_nolock
ilock  = ext4-ilock-RFC
ilock+ = ext4-ilock-RFC + dioread_nolock

bs=4k:
--------------------------------------------------------------
       |            READ           |           WRITE          |
--------------------------------------------------------------
w/    | 30898KB/s,7724,555.00us   | 30875KB/s,7718,479.70us  |
--------------------------------------------------------------
w/o   | 117915KB/s,29478,248.18us | 117854KB/s,29463,21.91us |
--------------------------------------------------------------

I'm really surprised by the numbers here. They would mean that when DIO

While testing my patches I noticed this again, but then when I saw [2]
above, I thought we were aware of this.
My bad, I should have brought this point up maybe once before going
ahead with implementing our discussed solution.

read takes i_rwsem exclusive lock instead of shared, it is a win for your
workload... Argh, now checking code in fs/direct-io.c I think I can see the
difference. The trick in do_blockdev_direct_IO() is:

         if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
                 inode_unlock(dio->inode);
         if (dio->is_async && retval == 0 && dio->result &&
             (iov_iter_rw(iter) == READ || dio->result == count))
                 retval = -EIOCBQUEUED;
         else
                 dio_await_completion(dio);

So actually only direct IO read submission is protected by i_rwsem with
DIO_LOCKING. Actual waiting for sync DIO read happens with i_rwsem dropped.

After some thought I think the best solution for this is to just finally
finish the conversion of ext4 so that dioread_nolock is the only DIO path.

Sorry, I still didn't get this completely. Could you please explain a 
bit more?

With i_rwsem held in shared mode even for "unlocked" DIO, it should be
actually relatively simple and most of the dances with unwritten extents
shouldn't be needed anymore.

Again, maybe it's related to above comment. Could you please give some
insights?

Or do you mean that we should do it like this-
So as of now in dioread_nolock, we allocate blocks, mark the entry into
extents as unwritten, then do the data IO, and then finally do the
conversion of unwritten to written extents.

So instead of that we first only reserve the disk blocks, (without
making any on-disk changes in extent tree), do the data IO and then
finally make an entry into extent tree on disk. And going
forward only keep this as the default path.

The above is something I have been looking into for enabling
dioread_nolock for powerpc platforms where blocksize < page_size.
This is based upon an upstream discussion between Ted and you :)

But even with above, in case of extending writes, we still
will have to zero out those extending blocks no? Which
will require an exclusive inode lock anyways for zeroing.
(same which has been done in XFS too).

So going with current discussed solution of mounting with
dioread_nolock to provide performance scalability in mixed read/write 
workload should be also the right approach, no?
Also looking at the numbers here [3] & [4], this patch also seems
to improve the performance with dioread_nolock mount option.
Please help me understand your thoughts on this.

[3] - https://marc.info/?l=linux-ext4&m=156921748126221&w=2
[4] - 
https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/ext4/fio-output/vanilla-vs-ilocknew-randrw-dioread-nolock-4K.png

-ritesh