Bluestore onode improvements suggestion.

Igor Fedotov <ifedotov@xxxxxxx> · Thu, 16 May 2019 17:42:07 +0300

Hi cephers!

Let me share some ideas on Bluestore onode extent map format 
refactoring. The primary drivers for this proposal are some design flaws 
and complexities in extent map sharding, e.g. one causing 
https://tracker.ceph.com/issues/38272

1) I think we can consider an introduction of logical extent size 
granularity. Let's align it with device block size (usually equal to 4KB).

This change will bring some issues in handling small and/or unaligned 
write and zero ops though:

  - For such writes we'll need to read head and tail extents to build 
the whole 4K block. Which we currently do for non-shared and 
non-compressed extents anyway. As a benefit having merged extent 
pointing to a single disk block might improve subsequent reads.

- For small/unaligned zero op handling we'll need to perform actual 
write rather than updating extent map only. And (without special care) a 
sequence of such ops wouldn't actually release space allocated by zeroed 
extents as it might do now.

The major benefit is the significant reduction in amount of extents one 
can expect within some logical span, see below for details.

2) We can prohibit extents/blobs protruding out of 
bluestore_max_blob_size boundaries (64KB for ssd and 512 KB for HDD by 
default). Spanning writes to be split if needed.

As disadvantages one can name potentially less benefit from compression 
and higher amount of blob entries.

But as a result we'll have permanently available sharding points - one 
can always split extent map at known logical offsets and be sure that 
there are no blobs spanning over several resulting shards.

The above constraints also reduces maximum amount of extent/shard 
entries single onode might have. Which probably allows to encode them 
more efficiently, e.g. using bitmaps.

With byte granularity and 128MB maximum onode size one could have up to 
128MB entries.

In the proposed model this reduces to 32K (=128MB /4KB) for extents and 
2K (=128MB/64KB(=max_blob_size_ssd)) for shards.

Maximum amount of extents/blobs per the smallest shard will be 128 (512K 
/ 4K). This might still produce quite heavily encoded shard - the 
suggestion is to perform a sort of garbage collection in this case (when 
shard encoded size reaches some threshold) - collect all (for 
max_blob_size span) the relevant data (including ones in compressed and 
shared blobs) and overwrite it to a single resulting blob.

Generally the above changes are probably enough to fix spanning blob 
issue and simplify sharding. But it might worth to consider the 
following onode substructures' format changes.

3) Refactor shard_info encoding. Currently it takes around 6 bytes per 
single entry (3-4 bytes for "offset" field and 1-2 bytes  for "bytes" 
field).

3.1) Firstly we don't need "bytes" field any more as we don't need to 
estimate average extent size to make decisions during resharding. 
Without "bytes" field we'll need max 8 KB (=2K (max shards) * 4) to keep 
all possible shard_info entries.

3.2) Additionally as we have "granular" shard boundaries (64K/512K 
aligned) we can use bitmap to track them instead. Bit=1 - shard start, 
bit=0 - shard continuation. For 2K shards one needs 256 bytes (=2048/8)  
to keep the full map. I.e. this scheme becomes beneficial over current 
one at 64+ shards. To keep low shard amounts more efficiently we can use 
mixed scheme that utilizes offset enumeration for low density cases and 
evolve to bitmap for higher ones.

4) Replace existing logical extent + blob encoding scheme with a new one:

Current  encoding:

(<extent info: offs, len, blob_offs> + (<blob_id> | <full blob>))*

can probably be replaced with:

(<full blob> + <new_extent_info>)*

where new_extent_info is either

= bitmap (bit length = max extents per shard = 128) => 128 / 8 = max 16 
bytes

or

= (<extent info: offs, len, blob_offs>)*, i.e. set of previous extent 
infos relevant to this specific blob.

The encoding format choice is made by comparison estimated sizes for 
both methods. If estimate current average extent_info's size at 8 bytes 
we can see that bitmap is beneficial for 3+ extents per blob.

This way we cap full set of extents to 16 bytes per blob. Compare to 512 
bytes (= 8 (avg. extent_info) * 64 (=128/2 max standalone extents per 
blob) for the original approach. And we're still on par with the 
original approach when 1-3 extents per blob are present.

What do you think?

Thanks,

Igor