Re: [PATCH v10 11/41] btrfs: implement log-structured superblock for ZONED mode

Johannes Thumshirn <Johannes.Thumshirn@xxxxxxx> · Tue, 24 Nov 2020 09:30:28 +0000

On 23/11/2020 18:49, David Sterba wrote:
> On Tue, Nov 10, 2020 at 08:26:14PM +0900, Naohiro Aota wrote:
>> Superblock (and its copies) is the only data structure in btrfs which has a
>> fixed location on a device. Since we cannot overwrite in a sequential write
>> required zone, we cannot place superblock in the zone. One easy solution is
>> limiting superblock and copies to be placed only in conventional zones.
>> However, this method has two downsides: one is reduced number of superblock
>> copies. The location of the second copy of superblock is 256GB, which is in
>> a sequential write required zone on typical devices in the market today.
>> So, the number of superblock and copies is limited to be two.  Second
>> downside is that we cannot support devices which have no conventional zones
>> at all.
>>
>> To solve these two problems, we employ superblock log writing. It uses two
>> zones as a circular buffer to write updated superblocks. Once the first
>> zone is filled up, start writing into the second buffer. Then, when the
>> both zones are filled up and before start writing to the first zone again,
>> it reset the first zone.
>>
>> We can determine the position of the latest superblock by reading write
>> pointer information from a device. One corner case is when the both zones
>> are full. For this situation, we read out the last superblock of each
>> zone, and compare them to determine which zone is older.
>>
>> The following zones are reserved as the circular buffer on ZONED btrfs.
>>
>> - The primary superblock: zones 0 and 1
>> - The first copy: zones 16 and 17
>> - The second copy: zones 1024 or zone at 256GB which is minimum, and next
>>   to it
> 
> I was thinking about that, again. We need a specification. The above is
> too vague.
> 
> - supported zone sizes
>   eg. if device has 256M, how does it work? I think we can support
>   zones from some range (256M-1G), where filling the zone will start
>   filing the other zone, leaving the remaining space empty if needed,
>   effectively reserving the logical range [0..2G] for superblock
> 
> - related to the above, is it necessary to fill the whole zone?
>   if both zones are filled, assuming 1G zone size, do we really expect
>   the user to wait until 2G of data are read?
>   with average reading speed 150MB/s, reading 2G will take about 13
>   seconds, just to find the latest copy of the superblock(!)
> 
> - what are exact offsets of the superblocks
>   primary (64K), ie. not from the beginning
>   as partitioning is not supported, nor bootloaders, we don't need to
>   worry about overwriting them
> 
> - what is an application supposed to do when there's a garbage after a
>   sequence of valid superblocks (all zeros can be considered a valid
>   termination block)
> 
> The idea is to provide enough information for a 3rd party tool to read
> the superblock (blkid, progs) and decouple the format from current
> hardware capabilities. If the zones are going to be large in the future
> we might consider allowing further flexibility, or fix the current zone
> maximum to 1G and in the future add a separate incompat bit that would
> extend the maximum to say 10G.
> 

We don't need to do that. All we need to do for finding the valid superblock
is a report zones call, get the write pointer and then read from 
write-pointer - sizeof(struct brtfs_super_block). There is no need for scanning
a whole zone. The last thing that was written will be right before the write
pointer.