Bluestore onode improvements suggestion.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi cephers!

Let me share some ideas on Bluestore onode extent map format refactoring. The primary drivers for this proposal are some design flaws and complexities in extent map sharding, e.g. one causing https://tracker.ceph.com/issues/38272

1) I think we can consider an introduction of logical extent size granularity. Let's align it with device block size (usually equal to 4KB).

This change will bring some issues in handling small and/or unaligned write and zero ops though:

  - For such writes we'll need to read head and tail extents to build the whole 4K block. Which we currently do for non-shared and non-compressed extents anyway. As a benefit having merged extent pointing to a single disk block might improve subsequent reads.

- For small/unaligned zero op handling we'll need to perform actual write rather than updating extent map only. And (without special care) a sequence of such ops wouldn't actually release space allocated by zeroed extents as it might do now.

The major benefit is the significant reduction in amount of extents one can expect within some logical span, see below for details.

2) We can prohibit extents/blobs protruding out of bluestore_max_blob_size boundaries (64KB for ssd and 512 KB for HDD by default). Spanning writes to be split if needed.

As disadvantages one can name potentially less benefit from compression and higher amount of blob entries.

But as a result we'll have permanently available sharding points - one can always split extent map at known logical offsets and be sure that there are no blobs spanning over several resulting shards.

The above constraints also reduces maximum amount of extent/shard entries single onode might have. Which probably allows to encode them more efficiently, e.g. using bitmaps.

With byte granularity and 128MB maximum onode size one could have up to 128MB entries.

In the proposed model this reduces to 32K (=128MB /4KB) for extents and 2K (=128MB/64KB(=max_blob_size_ssd)) for shards.

Maximum amount of extents/blobs per the smallest shard will be 128 (512K / 4K). This might still produce quite heavily encoded shard - the suggestion is to perform a sort of garbage collection in this case (when shard encoded size reaches some threshold) - collect all (for max_blob_size span) the relevant data (including ones in compressed and shared blobs) and overwrite it to a single resulting blob.

Generally the above changes are probably enough to fix spanning blob issue and simplify sharding. But it might worth to consider the following onode substructures' format changes.

3) Refactor shard_info encoding. Currently it takes around 6 bytes per single entry (3-4 bytes for "offset" field and 1-2 bytes  for "bytes" field).

3.1) Firstly we don't need "bytes" field any more as we don't need to estimate average extent size to make decisions during resharding. Without "bytes" field we'll need max 8 KB (=2K (max shards) * 4) to keep all possible shard_info entries.

3.2) Additionally as we have "granular" shard boundaries (64K/512K aligned) we can use bitmap to track them instead. Bit=1 - shard start, bit=0 - shard continuation. For 2K shards one needs 256 bytes (=2048/8)  to keep the full map. I.e. this scheme becomes beneficial over current one at 64+ shards. To keep low shard amounts more efficiently we can use mixed scheme that utilizes offset enumeration for low density cases and evolve to bitmap for higher ones.

4) Replace existing logical extent + blob encoding scheme with a new one:

Current  encoding:

(<extent info: offs, len, blob_offs> + (<blob_id> | <full blob>))*

can probably be replaced with:

(<full blob> + <new_extent_info>)*

where new_extent_info is either

= bitmap (bit length = max extents per shard = 128) => 128 / 8 = max 16 bytes

or

= (<extent info: offs, len, blob_offs>)*, i.e. set of previous extent infos relevant to this specific blob.

The encoding format choice is made by comparison estimated sizes for both methods. If estimate current average extent_info's size at 8 bytes we can see that bitmap is beneficial for 3+ extents per blob.

This way we cap full set of extents to 16 bytes per blob. Compare to 512 bytes (= 8 (avg. extent_info) * 64 (=128/2 max standalone extents per blob) for the original approach. And we're still on par with the original approach when 1-3 extents per blob are present.


What do you think?


Thanks,

Igor





[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux