Re: mon upgrades and leveldb->rocksdb conversion

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 11 Oct 2018 11:00:45 -0700

On Thu, Oct 11, 2018 at 7:11 AM Joao Eduardo Luis <joao@xxxxxxx> wrote:
>
> On 10/11/2018 02:58 PM, Sage Weil wrote:
> >
> > I'm worried the above it a lot of complexity and opportunity for bugs
> > (and work to implement) for not a lot of gain.  What if we instead make
> > ceph-monstore-tool have a 'convert' function that will do a conversion
> > offline?  The admin can take each mon down in turn, convert it, and bring
> > it back up.  Provisioning tools could automate this process.
> >
> > This will require ~2x the disk space for the conversion.  OTOH, if space
> > is tight, the user can also just blow away the mon entirely and create
> > it, and let the normal sync bring it back into quorum...
>
> The problem with both approaches is that, during this period, the quorum
> is degraded.
>
> We can argue that the way to prevent that is to add a new monitor, let
> it sync, and then remove an old mon, but we may not have spare hardware
> to make this work.
>
> I do agree that this would be a complex solution for something that
> would be used 3, maybe 5 times in the lifespan of a cluster< but this is
> also the sort of thing that shouldn't make the user jump through hoops
> to accomplish.

Seems to me that if your cluster is in that much danger from a
degraded mon cluster, you've designed your mon cluster failure
tolerances badly?

I guess we should be more explicit about goals here though: I've just
realized that while so far rocksdb conversions are not mandatory, we
may want to drop support for leveldb in a future release? In which
case some kind of automated conversion is definitely a higher priority
— in the past, the only users of functionality like this have been
those with clusters old enough to be on leveldb and sufficiently
large-scale enough that rocksdb is necessary to handle the compaction
stress.
-Greg