Single DB/WAL volume shareable among multiple BlueStore instances

Igor Fedotov <ifedotov@xxxxxxx> · Tue, 15 May 2018 18:43:18 +0300

Hi folks,

let me share some ideas on improving BlueStore DB/WAL volume usage.

Current BlueStore/BlueFS design requires *standalone* DB/WAL volume(s) 
per *each* OSD (aka BlueStore instance).

This results in a need for static physical volume partitioning prior to 
OSD deploying. Which always produces questions about proper sizing and 
tend to be error-prone and/or ineffective (we should either allocate 
much spare space or can face a spill over to slow device). An ability to 
resize such volumes (or add additional ones) can help here but it looks 
neither quite simple nor elegant.

I've got an idea how one can eliminate the above 'standalone volume per 
instance' requirement and proceed with a single DB/WAL volume shared 
among all(or some) OSD instances at the specific node.

Preface: currently BlueFS is able to operate using scattered extents 
from multiple devices which simplifies development efforts for my 
proposal below significantly.

So the idea is to introduce some arbitrating mechanics to allow multiple 
OSDs to allocate/release space from the single volume.

The simplest approach might be something like this:

1) logically split the space into pretty large chunks (e.g. 1Gb) which 
are allocation units.

2) Put a table at the beginning of the volume (or in a separate file) 
which tracks the usage of these chunks. Two bytes per chunk containing 
either OSD id or free block signature seem sufficient.

3) Each BlueStore(BlueFS) instance that needs more space locks the 
table, do the lookup for free chunk(s) and marks them busy. Then unlock 
the table. Not sure about the best

locking mechanics implementation yet - exclusive access to some file? 
Anything else? Under Windows one can use named mutexes  for such 
purposes but AFAIK Linux lacks that ability...

Since such allocation requests tend to happen quite seldom and they are 
fast enough there is no need in high-performance or super-effective 
means here. Even simple locked file polling seems to be enough.

4) If BlueFS is smart enough to produce full empty chunks one day - it 
can release them using the same approach.

Two additional issues should be solved:

1) Initialize the table. Can we do that using ceph-volume?

2) Release the corresponding space when removing specific OSD. We need 
to unmark all the related extents. Some CLI interface that locks the 
table using the same means and do a cleanup? Where does the actual code 
performing the cleanup reside?

Any thoughts/comments?

Thanks,

Igor

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html