Re: [RFC 0/2] ext4: Improve locking sequence in DIO write path

Ritesh Harjani <riteshh@xxxxxxxxxxxxx> · Thu, 26 Sep 2019 18:04:55 +0530

Hello Jan,

Thanks for answering it all and giving all the info.

On 9/25/19 2:53 PM, Jan Kara wrote:
On Wed 25-09-19 01:18:04, Ritesh Harjani wrote:
read takes i_rwsem exclusive lock instead of shared, it is a win for your
workload... Argh, now checking code in fs/direct-io.c I think I can see the
difference. The trick in do_blockdev_direct_IO() is:

          if (iov_iter_rw(iter) == READ && (dio->flags & DIO_LOCKING))
                  inode_unlock(dio->inode);
          if (dio->is_async && retval == 0 && dio->result &&
              (iov_iter_rw(iter) == READ || dio->result == count))
                  retval = -EIOCBQUEUED;
          else
                  dio_await_completion(dio);

So actually only direct IO read submission is protected by i_rwsem with
DIO_LOCKING. Actual waiting for sync DIO read happens with i_rwsem dropped.

After some thought I think the best solution for this is to just finally
finish the conversion of ext4 so that dioread_nolock is the only DIO path.

Sorry, I still didn't get this completely. Could you please explain a bit
more?

Well, currently we have two different locking schemes for DIO - the
"normal" case and the "dioread_nolock" case. And the "normal" case really
only exists because buffered writeback needed to be more careful (so that
nolock DIO cannot expose stale data) and nobody did the effort to make that
work when blocksize < pagesize. But having two different locking schemes
for DIO is really confusing to users and a maintenance burden so we want to
get rid of the old scheme once the "dioread_nolock" scheme works for all
the configurations.

Agreed.

With i_rwsem held in shared mode even for "unlocked" DIO, it should be
actually relatively simple and most of the dances with unwritten extents
shouldn't be needed anymore.

Again, maybe it's related to above comment. Could you please give some
insights?

Now that we hold i_rwsem in shared mode for all of DIO, it isn't really
"unlocked" anymore. Which actually very much limits the races with buffered
writes and thus page writeback (because we flush page cache before doing
DIO).

So looking at the code again based on your inputs from above, we should
be able to remove this condition <snip below> in ext4_dio_write_checks.

What I meant is, in DIO writes path we can start with shared lock
(which the current patch is doing), and only check for below conditions
<snip below> for it to continue using shared lock.
(i.e. we need not check for ext4_should_dioread_nolock(inode) anymore).

That means there should not be any race for non-extent based mode
too right?

Because for overwrites in DIO (for both extents & non-extents)-
(just reiterating)

1. Race against bufferedIO writes and DIO overwrites will be protected
via SHARED inode lock in DIO overwrites & EXCL lock in bufferedIO
writes. Plus we flush page cache before doing DIO.

2. Race against bufferedIO reads & DIO overwrites will be anyway 
protected since we don't do any block allocations during DIO overwrite.

3. Again race against DIO reads & DIO overwrites is not be a problem,
since no block allocation is done anyway. So no stale data will be
exposed.

<snip of change we should do in ext4_dio_write_checks>

        if (*iolock == EXT4_IOLOCK_SHARED &&
            (!IS_NOSEC(inode) || *unaligned_io || *extend ||
-            !ext4_should_dioread_nolock(inode) ||
             !ext4_overwrite_io(inode, offset, count))) {
                ext4_iunlock(inode, *iolock);
                *iolock = EXT4_IOLOCK_EXCL;
                ext4_ilock(inode, *iolock);
                goto restart;
        }

Or do you mean that we should do it like this-
So as of now in dioread_nolock, we allocate blocks, mark the entry into
extents as unwritten, then do the data IO, and then finally do the
conversion of unwritten to written extents.

So this allocation of extents as unwritten in dioread_nolock mode is now
mostly unnecessary. We hold i_rwsem for reading (or even for writing) while
submitting any DIO. Because we flush page cache before starting DIO and new
pages cannot be dirtied by buffered writes (i_rwsem), DIO cannot be racing
with page writeback and thus cannot expose stale block contents. There is
one exception though - which is why I wrote "mostly" above - pages can
still be dirtied through memory mappings (there's no i_rwsem protection for
mmap writes) and thus races with page writeback exposing stale data are still
theoretically possible. In fact the "normal" DIO mode has this kind of race
all the time ext4 exists.

Yes, now that you said I could see this below race even with current
code. Any other race too?

i.e.
ext4_dio_read			ext4_page_mkwrite()

    filemap_write_and_wait_range()
				     ext4_get_blocks()

    submit_bio
    // this could expose the stale data.

				     mark_buffer_dirty()

Ok. I guess we cannot use any exclusive inode lock in ext4_page_mkwrite,
because then we loose the parallelism that it offers right now.
Wonder how other FS protect this race (like XFS?)

I guess we could fix this by falling back to page writeback using unwritten
extents when we have dirty pages locked for writeback and see there's any
DIO in flight for the inode. Sadly that means we we cannot get rid of
writeback code using unwritten extents but at least it won't be hurting
performance in the common case.

1. So why to check for dirty pages locked for writeback (in
ext4_page_mkwrite)? we anyway lock the page which we modify or
allocate block for. So race against bufferedRead should not happen.

2. And even if we check for inode_dio_wait(), we still can't gurantee
that the next DIO may not snoopin while we are in ext4_page_mkwrite?

I am definitely missing something here.

So instead of that we first only reserve the disk blocks, (without
making any on-disk changes in extent tree), do the data IO and then
finally make an entry into extent tree on disk. And going
forward only keep this as the default path.

The above is something I have been looking into for enabling
dioread_nolock for powerpc platforms where blocksize < page_size.
This is based upon an upstream discussion between Ted and you :)

Yes, this is even better solution to dioread_nolock "problem" but it is
also more work 

Yes, I agree that we should do this incrementally.

then just dropping the old DIO locking mode and
fix writeback using unwritten extents for blocksize < pagesize.

So, I was looking into this (to support dioread_nolock for
blocksize < pagesize) but I could not find any history in git,
nor any thread which explains the problem.

I got below understanding while going over code -

1. This problem may be somehow related to bufferheads in the
extent conversion from unwritten to written in writeback path.
But I couldn't see what exactly is the issue there?

I see that the completion path happens via
ext4_end_io
   |- ext4_convert_unwritten_extents() // offset & size is properly 
taken care.
   |- ext4_release_io_end() (which is same in both DIO & writeback).

2. One thing which I noticed is the FIXME in
mpage_map_and_submit_buffers(). Where we
io->io_end->size we add the whole PAGE_SIZE.
I guess we should update the size here carefully
based on buffer_heads.

Could you please give some pointers as to what
is the limitation. Or some hint which I can go
and check by myself.

But even with above, in case of extending writes, we still
will have to zero out those extending blocks no? Which
will require an exclusive inode lock anyways for zeroing.
(same which has been done in XFS too).

Yes, extending writes are a different matter.

So going with current discussed solution of mounting with
dioread_nolock to provide performance scalability in mixed read/write
workload should be also the right approach, no?
Also looking at the numbers here [3] & [4], this patch also seems
to improve the performance with dioread_nolock mount option.
Please help me understand your thoughts on this.

Yes, your patches are definitely going in the right direction. What I'm
discussing is mostly how to make ext4 perform well for mixed read/write
workload by default without user having to use magic mount option.

Really thanks for your support here.

So can we get these patches upstream incrementally?
i.e.
1. As a first step, we can remove this condition
ext4_should_dioread_nolock from the current patchset
(as mentioned above too) &  get this patch rebased
on top of iomap pathset. We should be good to merge
this patchset then, right? Since this should be able
to improve the performance even without dioread_nolock
mount option.

2. Meanwhile I will continue to check for blocksize < pagesize
requirement for dioread_nolock. We can follow the plan
as you mentioned above then.

Your thoughts?

-ritesh