Re: [PATCH v9 38/41] btrfs: extend zoned allocator to use dedicated tree-log block group

Naohiro Aota <naohiro.aota@xxxxxxx> · Tue, 10 Nov 2020 15:37:41 +0900

On Tue, Nov 03, 2020 at 03:47:33PM -0500, Josef Bacik wrote:
On 10/30/20 9:51 AM, Naohiro Aota wrote:
This is the 1/3 patch to enable tree log on ZONED mode.

The tree-log feature does not work on ZONED mode as is. Blocks for a
tree-log tree are allocated mixed with other metadata blocks, and btrfs
writes and syncs the tree-log blocks to devices at the time of fsync(),
which is different timing from a global transaction commit. As a result,
both writing tree-log blocks and writing other metadata blocks become
non-sequential writes that ZONED mode must avoid.

We can introduce a dedicated block group for tree-log blocks so that
tree-log blocks and other metadata blocks can be separated write streams.
As a result, each write stream can now be written to devices separately.
"fs_info->treelog_bg" tracks the dedicated block group and btrfs assign
"treelog_bg" on-demand on tree-log block allocation time.

This commit extends the zoned block allocator to use the block group.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@xxxxxxx>
Signed-off-by: Naohiro Aota <naohiro.aota@xxxxxxx>

If you're going to remove an entire block group from being allowed to 
be used for metadata you are going to need to account for it in the 
space_info, otherwise we're going to end up with nasty ENOSPC corners 
here.

Indeed. I'll add a dedicated space_info for treelog or, at least, separate
the block group from other metadata space_info. But, I'll address this
later in v11.

But this begs the question, do we want the tree log for zoned?  We 
could just commit the transaction and call it good enough.  We lose 
performance, but zoned isn't necessarily about performance.

We have a large performance drop without tree-log (-o notreelog). Here is a
dbench (32 clients) result on SMR HDD.

With treelog:    153.509  MB/s	
Without treelog:  21.9651 MB/s

So, there is 85% drop of the throughput. I think this degradation is too large.

If we do then at a minimum we're going to need to remove this block 
group from the space info counters for metadata.  Thanks,

Josef