Re: Single DB/WAL volume shareable among multiple BlueStore instances

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 5/15/2018 6:57 PM, Sage Weil wrote:
On Tue, 15 May 2018, Igor Fedotov wrote:
Hi folks,

let me share some ideas on improving BlueStore DB/WAL volume usage.

Current BlueStore/BlueFS design requires *standalone* DB/WAL volume(s) per
*each* OSD (aka BlueStore instance).

This results in a need for static physical volume partitioning prior to OSD
deploying. Which always produces questions about proper sizing and tend to be
error-prone and/or ineffective (we should either allocate much spare space or
can face a spill over to slow device). An ability to resize such volumes (or
add additional ones) can help here but it looks neither quite simple nor
elegant.

I've got an idea how one can eliminate the above 'standalone volume per
instance' requirement and proceed with a single DB/WAL volume shared among
all(or some) OSD instances at the specific node.

Preface: currently BlueFS is able to operate using scattered extents from
multiple devices which simplifies development efforts for my proposal below
significantly.

So the idea is to introduce some arbitrating mechanics to allow multiple OSDs
to allocate/release space from the single volume.

The simplest approach might be something like this:

1) logically split the space into pretty large chunks (e.g. 1Gb) which are
allocation units.

2) Put a table at the beginning of the volume (or in a separate file) which
tracks the usage of these chunks. Two bytes per chunk containing either OSD id
or free block signature seem sufficient.

3) Each BlueStore(BlueFS) instance that needs more space locks the table, do
the lookup for free chunk(s) and marks them busy. Then unlock the table. Not
sure about the best

locking mechanics implementation yet - exclusive access to some file? Anything
else? Under Windows one can use named mutexes  for such purposes but AFAIK
Linux lacks that ability...

Since such allocation requests tend to happen quite seldom and they are fast
enough there is no need in high-performance or super-effective means here.
Even simple locked file polling seems to be enough.

4) If BlueFS is smart enough to produce full empty chunks one day - it can
release them using the same approach.


Two additional issues should be solved:

1) Initialize the table. Can we do that using ceph-volume?

2) Release the corresponding space when removing specific OSD. We need to
unmark all the related extents. Some CLI interface that locks the table using
the same means and do a cleanup? Where does the actual code performing the
cleanup reside?


Any thoughts/comments?
I'm still questioning how valuable this would be.  It seems important if
the number of OSDs (HDDs) sharing the SSD device changes over time, but
that seems like it's not something that happens in real life.  Assuming
you *do* have a fixed 5:1 ratio of OSDs to the SSD (or whatever the ratio
is), I'm not sure when you would significantly diverge from an even split
5 ways.  Usually the OSDs on the same node will participate in the same
pool(s), so you'd expect the utilization and metadata usage to be
relatively uniform.
The rationales behind this proposal is to eliminate that fixed ratio claim on initial deployment. Currently we might easily face a case when initial DB partition sizing isn't good enough
- either too small or too large.
And it looks like we're unable to recover from that without full node redeployment. With the proposed approach you can tune the system over the time - either remove a single OSD and share the released space among the rest  or bring-up yet another OSD to better utilize
the resources.
Secondly - IMO it's just simpler for management purposes to have single volume rather than multiple partitions.
But certainly that's just a tiny pleasant bonus...

The second thought is that we could instead make bluefs (and bluestore)
deal with resizing gracefully and do the separation of the device into
constituent pieces using LVM.  Which, incidentally, does exactly the 1GB
chunking thing internally (if I'm remembering correctly)?  Having offline
resizing of these devices is something we're probably going to want
anyway, so going this path would let us tick off a few other scenarios as
well.  (I think I already added bluefs device expansion in
ceph-bluestore-tool, but it is very primitive.)
Indeed, LVM might be the good choice for the above issues too. Should we recommend(enforce?) its usage
for DB/WAL?
Speaking about graceful resizing do you mean it eventually should be done on-demand online from running BlueStore?
Or offline resizing is the maximum we're planning to support?

Thanks,
Igor
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux