Re: [PATCH v6 15/28] btrfs: serialize data allocation and submit IOs

Josef Bacik <josef@xxxxxxxxxxxxxx> · Thu, 19 Dec 2019 09:01:35 -0500

On 12/19/19 1:54 AM, Naohiro Aota wrote:
On Tue, Dec 17, 2019 at 02:49:44PM -0500, Josef Bacik wrote:
On 12/12/19 11:09 PM, Naohiro Aota wrote:
To preserve sequential write pattern on the drives, we must serialize
allocation and submit_bio. This commit add per-block group mutex
"zone_io_lock" and find_free_extent_zoned() hold the lock. The lock is kept
even after returning from find_free_extent(). It is released when submiting
IOs corresponding to the allocation is completed.

Implementing such behavior under __extent_writepage_io() is almost
impossible because once pages are unlocked we are not sure when submiting
IOs for an allocated region is finished or not. Instead, this commit add
run_delalloc_hmzoned() to write out non-compressed data IOs at once using
extent_write_locked_rage(). After the write, we can call
btrfs_hmzoned_data_io_unlock() to unlock the block group for new
allocation.

Signed-off-by: Naohiro Aota <naohiro.aota@xxxxxxx>

Have you actually tested these patches with lock debugging on?  The 
submit_compressed_extents stuff is async, so the unlocker owner will not be 
the lock owner, and that'll make all sorts of things blow up. This is just 
straight up broken.

Yes, I have ran xfstests on this patch series with lockdeps and
KASAN. There was no problem with that.

For non-compressed writes, both allocation and submit is done in
run_delalloc_zoned(). Allocation is done in cow_file_range() and
submit is done in extent_write_locked_range(), so both are in the same
context, so both locking and unlocking are done by the same execution
context.

For compressed writes, again, allocation/lock is done under
cow_file_range() and submit is done in extent_write_locked_range() and
unlocked all in submit_compressed_extents() (this is called after
compression), so they are all in the same context and the lock owner
does the unlock.

I would really rather see a hmzoned block scheduler that just doesn't submit 
the bio's until they are aligned with the WP, that way this intellligence 
doesn't have to be dealt with at the file system layer. I get allocating in 
line with the WP, but this whole forcing us to allocate and submit the bio in 
lock step is just nuts, and broken in your subsequent patches.  This whole 
approach needs to be reworked. Thanks,

Josef

We tried this approach by modifying mq-deadline to wait if the first
queued request is not aligned at the write pointer of a zone. However,
running btrfs without the allocate+submit lock with this modified IO
scheduler did not work well at all. With write intensive workloads, we
observed that a very long wait time was very often necessary to get a
fully sequential stream of requests starting at the write pointer of a
zone. The wait time we observed was sometimes in larger than 60 seconds,
at which point we gave up.

This is because we will only write out the pages we've been handed but do 
cow_file_range() for a possibly larger delalloc range, so as you say there can 
be a large gap in time between writing one part of the range and writing the 
next part.

You actually solve this with your patch, by doing the cow_file_range and then 
following it up with the extent_write_locked_range() for the range you just cow'ed.

There is no need for the locking in this case, you could simply do that and then 
have a modified block scheduler that keeps the bio's in the correct order.  I 
imagine if you just did this with your original block layer approach it would 
work fine.  Thanks,

Josef