some thoughts on BlueFS space gift redesign

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Sage, et al.

Let me share some ideas on the subj.

The major rationale is to be able to perform on-demand allocation from the main device space when BlueFS lacks the available space instead of relying on background rebalance procedure. Applicable for gift part only, reclaiming procedure can be executed as usual.

The (major?) issue with that IMO is the circular dependency that appears if we do request allocation from within BlueFS. The latter might need to allocate an additional space in response to some RocksDB activity (e.g. compaction). Which will invoke Bluestore allocator followed by a sync write to DB to save FreelistManager(FM) data for newly allocated space. And I doubt RocksDB will handle this sequence properly.

Hence if we go this way we should break the circle. IMO one can do that by eliminating DB usage for BlueFS space tracking. In fact we have full BlueFS extent list replica within BlueFS itself - available via BlueFS::get_block_extents() call. So let's start using it for BlueStore FM/Allocator init (surely in respect of BlueFS extents only) rather than read from DB. And hence no need to update FM/DB on BlueFS gifting allocation.

There is an additional mean (BlueStore::_reconcile_bluefs_freespace()) to sync BlueFS extent lists tracked by both FM and BlueFS. This is to handle potential unexpected termination after BlueStore allocation and before BlueFS log commit. But I think it's probably an overkill and we can rely exclusively on BlueFS log. Indeed on recovery we can simply behave as no corresponding Bluestore allocations have happened if BlueFS log wasn't committed. Look like there should be no valid BlueFS data in this area. And Bluestore can treat it as free.

Once this is resolved the following redesign sub-tasks seem to be more or less straightforward - on internal BlueFS allocation failure call Bluestore allocator, update the log and flush it in a regular manner (looks like there is no even need to do the immediate flush after such an allocation).

What do you think? Haven't I missed something crucial?


Thanks,

Igor





[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux