Re: [LSF/MM/BPF TOPIC] bcachefs

Viacheslav Dubeyko <slava@xxxxxxxxxxx> · Thu, 4 Jan 2024 10:43:49 +0300

> On Jan 3, 2024, at 8:52 PM, Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote:
> 
> On Wed, Jan 03, 2024 at 10:39:50AM +0300, Viacheslav Dubeyko wrote:
>> 
>> 
>>> On Jan 2, 2024, at 7:05 PM, Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote:
>>> 
>>> On Tue, Jan 02, 2024 at 11:02:59AM +0300, Viacheslav Dubeyko wrote:
>>>> 
>>>> 
>>>>> On Jan 2, 2024, at 1:56 AM, Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote:
>>>>> 
>>>>> LSF topic: bcachefs status & roadmap
>>>>> 
>>>> 
>>>> <skipped>
>>>> 
>>>>> 
>>>>> A delayed allocation for btree nodes mode is coming, which is the main
>>>>> piece needed for ZNS support
>>>>> 
>>>> 
>>>> I could miss some emails. But have you shared the vision of ZNS support
>>>> architecture for the case of bcachefs already? It will be interesting to hear
>>>> the high-level concept.
>>> 
>>> There's not a whole lot to it. bcache/bcachefs allocation is already
>>> bucket based, where the model is that we allocate a bucket, then write
>>> to it sequentially and never overwrite until the whole bucket is reused.
>>> 
>>> The main exception has been btree nodes, which are log structured and
>>> typically smaller than a bucket; that doesn't break the "no overwrites"
>>> property ZNS wants, but it does mean writes within a bucket aren't
>>> happening sequentially.
>>> 
>>> So I'm adding a mode where every time we do a btree node write we write
>>> out the whole node to a new location, instead of appending at an
>>> existing location. It won't be as efficient for random updates across a
>>> large working set, but in practice that doesn't happen too much; average
>>> btree write size has always been quite high on any filesystem I've
>>> looked at.
>>> 
>>> Aside from that, it's mostly just plumbing and integration; bcachefs on
>>> ZNS will work pretty much just the same as bcachefs on regular block devices.
>> 
>> I assume that you are aware about limited number of open/active zones
>> on ZNS device. It means that you can open for write operations
>> only N zones simultaneously (for example, 14 zones for the case of WDC
>> ZNS device). Can bcachefs survive with such limitation? Can you limit the number
>> of buckets for write operations?
> 
> Yes, open/active zones correspond to write points in the bcachefs
> allocator. The default number of write points is 32 for user writes plus
> a few for internal ones, but it's not a problem to run with fewer.
> 

Frankly speaking, the 14 open/active zones limitation is extreme case.
Samsung provides bigger number for available open/active zones in ZNS SSD.
Even WDC made some promise to increase this number. But what’s the minimal
possible number of write pointers that can give opportunity for bcachefs still work
and survive in the environment of limited number of open/active zones?

So, does this change from default 32 write pointers to smaller number require
modification of file system driver logic (or maybe even on-disk layout)?
Or this is configurable parameter of file system? Is it internal configuration parameter
or end-user can configure the number of write pointers?

I see from documentation that expected size of the bucket is 128KB - 8MB.
Will 256MB - 2GB bucket size being digested by bcachefs without any modifications?
Or it could require some modification of logic (or even on-disk layout)? It’s pretty
significant bucket size change for my taste.

Thanks,
Slava.