Re: mon upgrades and leveldb->rocksdb conversion

Sage Weil <sweil@xxxxxxxxxx> · Thu, 11 Oct 2018 21:48:56 +0000 (UTC)

On Thu, 11 Oct 2018, Joao Eduardo Luis wrote:
> On 10/11/2018 07:00 PM, Gregory Farnum wrote:
> > On Thu, Oct 11, 2018 at 7:11 AM Joao Eduardo Luis <joao@xxxxxxx> wrote:
> >>
> >> On 10/11/2018 02:58 PM, Sage Weil wrote:
> >>>
> >>> I'm worried the above it a lot of complexity and opportunity for bugs
> >>> (and work to implement) for not a lot of gain.  What if we instead make
> >>> ceph-monstore-tool have a 'convert' function that will do a conversion
> >>> offline?  The admin can take each mon down in turn, convert it, and bring
> >>> it back up.  Provisioning tools could automate this process.
> >>>
> >>> This will require ~2x the disk space for the conversion.  OTOH, if space
> >>> is tight, the user can also just blow away the mon entirely and create
> >>> it, and let the normal sync bring it back into quorum...
> >>
> >> The problem with both approaches is that, during this period, the quorum
> >> is degraded.
> >>
> >> We can argue that the way to prevent that is to add a new monitor, let
> >> it sync, and then remove an old mon, but we may not have spare hardware
> >> to make this work.
> >>
> >> I do agree that this would be a complex solution for something that
> >> would be used 3, maybe 5 times in the lifespan of a cluster< but this is
> >> also the sort of thing that shouldn't make the user jump through hoops
> >> to accomplish.
> > 
> > Seems to me that if your cluster is in that much danger from a
> > degraded mon cluster, you've designed your mon cluster failure
> > tolerances badly?
> 
> Well, I don't think it's that uncommon for clusters to be running with 3
> monitors. Drop one for offline conversion, and we can't tolerate a
> single failure without loss of quorum. It only takes chance for
> something this trivial to become a bad day for someone.

FWIW the offline conversion should only take a minute or two, even for 
large clusters and fat mons... much less in most cases.  Even if they do 
have a second mon failure I don't think it would lead to a significant 
outage.

My vote is for simple!

...and, FWIW, given that my initial diagnosis is probably wrong, simply 
changing the kv_type from leveldb -> rocksdb and restarting the mon should 
be sufficient.  The mons could even do this on their own if they wanted 
too without external tooling.  We could make a 'tell' command to do it...

sage

> > I guess we should be more explicit about goals here though: I've just
> > realized that while so far rocksdb conversions are not mandatory, we
> > may want to drop support for leveldb in a future release? In which
> > case some kind of automated conversion is definitely a higher priority
> > — in the past, the only users of functionality like this have been
> > those with clusters old enough to be on leveldb and sufficiently
> > large-scale enough that rocksdb is necessary to handle the compaction
> > stress.
> 
> Hadn't thought of the "drop support for leveldb" argument, but that's
> something that could be in the cards I suppose.
> 
> As for my initial motivation, it mostly stems from deployments that see
> their stores grow abnormally, and at some point I find myself
> contemplating recreating them with rocksdb. It's never a fun experience
> for anyone. *sigh*
> 
>   -Joao
> 
> 
> 
>