On 10/11/2018 07:00 PM, Gregory Farnum wrote: > On Thu, Oct 11, 2018 at 7:11 AM Joao Eduardo Luis <joao@xxxxxxx> wrote: >> >> On 10/11/2018 02:58 PM, Sage Weil wrote: >>> >>> I'm worried the above it a lot of complexity and opportunity for bugs >>> (and work to implement) for not a lot of gain. What if we instead make >>> ceph-monstore-tool have a 'convert' function that will do a conversion >>> offline? The admin can take each mon down in turn, convert it, and bring >>> it back up. Provisioning tools could automate this process. >>> >>> This will require ~2x the disk space for the conversion. OTOH, if space >>> is tight, the user can also just blow away the mon entirely and create >>> it, and let the normal sync bring it back into quorum... >> >> The problem with both approaches is that, during this period, the quorum >> is degraded. >> >> We can argue that the way to prevent that is to add a new monitor, let >> it sync, and then remove an old mon, but we may not have spare hardware >> to make this work. >> >> I do agree that this would be a complex solution for something that >> would be used 3, maybe 5 times in the lifespan of a cluster< but this is >> also the sort of thing that shouldn't make the user jump through hoops >> to accomplish. > > Seems to me that if your cluster is in that much danger from a > degraded mon cluster, you've designed your mon cluster failure > tolerances badly? Well, I don't think it's that uncommon for clusters to be running with 3 monitors. Drop one for offline conversion, and we can't tolerate a single failure without loss of quorum. It only takes chance for something this trivial to become a bad day for someone. > I guess we should be more explicit about goals here though: I've just > realized that while so far rocksdb conversions are not mandatory, we > may want to drop support for leveldb in a future release? In which > case some kind of automated conversion is definitely a higher priority > — in the past, the only users of functionality like this have been > those with clusters old enough to be on leveldb and sufficiently > large-scale enough that rocksdb is necessary to handle the compaction > stress. Hadn't thought of the "drop support for leveldb" argument, but that's something that could be in the cards I suppose. As for my initial motivation, it mostly stems from deployments that see their stores grow abnormally, and at some point I find myself contemplating recreating them with rocksdb. It's never a fun experience for anyone. *sigh* -Joao