Re: [PATCH v2 2/3] btrfs: zoned: move superblock logging zone location

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Thu, 4 Mar 2021 23:00:04 +0000

On 2021/03/05 0:20, David Sterba wrote:
> On Wed, Mar 03, 2021 at 05:55:47PM +0900, Naohiro Aota wrote:
>> This commit moves the location of superblock logging zones basing on the
>> fixed address instead of the fixed zone number.
>>
>> By locating the superblock zones using fixed addresses, we can scan a
>> dumped file system image without the zone information. And, no drawbacks
>> exist.
>>
>> The following zones are reserved as the circular buffer on zoned btrfs.
>>   - The primary superblock: zone at LBA 0 and the next zone
>>   - The first copy: zone at LBA 16G and the next zone
>>   - The second copy: zone at LBA 256G and the next zone
>>
>> If the location of the zones are outside of disk, we don't record the
>> superblock copy.
>>
>> The addresses are much larger than the usual superblock copies locations.
>> The copies' locations are decided to support possible future larger zone
>> size, not to overlap the log zones. We support zone size up to 8GB.
> 
> One thing I don't see is that the reserved space for superblock is fixed
> regardless of the actual device zone size. In exclude_super_stripes.
> 
> 0-16G for primary
> ... and now what, 16G would be the next copy thus reserving 16 up to 32G
> 
> So the 64G offset for the 1st copy is more suitable:
> 
> 0    -  16G primary
> 64G  -  80G 1st copy
> 256G - 272G 2nd copy
> 
> This still does not sound great because it just builds on the original
> offsets from 10 years ago.  The device sizes are expected to be in
> terabytes but all the superblocks are in the first terabyte.

I do not see an issue with that. For HDDs, one would ideally want each copy
under a different head but determining which head serves which LBA is not
possible with standard commands. LBAs are generally distributed initially across
one head (platter side) up to one or more zones, then goes on the next head
backward (other side of the same platter), and on to the following head/platter.
So distribution is first vertical then goes inward (and when reaching middle of
the platter, everything starts again from the spindle outward).

0/64G/256G likely gives you different heads. No way to tell for certain though.

> What if we do that like
> 
> 0   -  16G
> 1T  -  1T+16G
> 8T  -  8T+16G
> 
> The HDD sizes start somewhere at 4T so the first two copies cover the
> small sizes, larger have all three copies. But we could go wild even
> more, like 0/4T/16T.

That would work for HDDs. We are at 20T with SMR now and the lowest SMR capacity
is 14T. For regular disks, yes, 4T is kind of the starting point for enterprise
drives. Consumer/NAS drives can start lower though, at 1T or 2T.
To be able able to cover all cases nicely, I would suggest not exceeding 1T for
the SB copies.

> I'm not sure if the capacities for non-HDD are going to be also that
> large, I could not find anything specific, the only existing ZNS is some
> DC ZN540 but no details.

That one is sampling at 2T capacity now. This likely will be a lower boundary
and higher capacities will be available. Not sure yet up to what point. Likely,
different models at 4T, 6T, 8T, 16T... will be available. So kind of the same
story as for HDDs. Keeping the SB copies within the first TB will allow
supporting all models.

So I kind of like your initial suggestion:

0    -  16G primary
64G  -  80G 1st copy
256G - 272G 2nd copy

And we could even do:

0    -  16G primary
128G - 160G 1st copy
512G - 544G 2nd copy

Which would also safely allow larger zone sizes beyond 8G for ZNS too.
(I do not think this will happen anytime soon though, but with these values, we
are safer).

> 
> We need to get this right (best effort), so I'll postpone this patch
> until it's all sorted.
> 

-- 
Damien Le Moal
Western Digital Research