Hi Sage, et al.
Let me share some ideas on the subj.
The major rationale is to be able to perform on-demand allocation from
the main device space when BlueFS lacks the available space instead of
relying on background rebalance procedure. Applicable for gift part
only, reclaiming procedure can be executed as usual.
The (major?) issue with that IMO is the circular dependency that appears
if we do request allocation from within BlueFS. The latter might need to
allocate an additional space in response to some RocksDB activity (e.g.
compaction). Which will invoke Bluestore allocator followed by a sync
write to DB to save FreelistManager(FM) data for newly allocated space.
And I doubt RocksDB will handle this sequence properly.
Hence if we go this way we should break the circle. IMO one can do that
by eliminating DB usage for BlueFS space tracking. In fact we have full
BlueFS extent list replica within BlueFS itself - available via
BlueFS::get_block_extents() call. So let's start using it for BlueStore
FM/Allocator init (surely in respect of BlueFS extents only) rather than
read from DB. And hence no need to update FM/DB on BlueFS gifting
allocation.
There is an additional mean (BlueStore::_reconcile_bluefs_freespace())
to sync BlueFS extent lists tracked by both FM and BlueFS. This is to
handle potential unexpected termination after BlueStore allocation and
before BlueFS log commit. But I think it's probably an overkill and we
can rely exclusively on BlueFS log. Indeed on recovery we can simply
behave as no corresponding Bluestore allocations have happened if BlueFS
log wasn't committed. Look like there should be no valid BlueFS data in
this area. And Bluestore can treat it as free.
Once this is resolved the following redesign sub-tasks seem to be more
or less straightforward - on internal BlueFS allocation failure call
Bluestore allocator, update the log and flush it in a regular manner
(looks like there is no even need to do the immediate flush after such
an allocation).
What do you think? Haven't I missed something crucial?
Thanks,
Igor