Re: storing pg logs outside of rocksdb

Sage Weil <sweil@xxxxxxxxxx> · Thu, 29 Mar 2018 20:04:51 +0000 (UTC)

On Wed, 28 Mar 2018, Matt Benjamin wrote:
> On Wed, Mar 28, 2018 at 1:44 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
> > On 03/28/2018 12:21 PM, Adam C. Emerson wrote:
> >
> > 2) It sure feels like conceptually the pglog should be represented as a
> > per-pg ring buffer rather than key/value data.  Maybe there are really
> > important reasons that it shouldn't be, but I don't currently see them.  As
> > far as the objectstore is concerned, it seems to me like there are valid
> > reasons to provide some kind of log interface and perhaps that should be
> > used for pg_log.  That sort of opens the door for different object store
> > implementations fulfilling that functionality in whatever ways the author
> > deems fit.
> 
> In the reddit lingo, pretty much this.  We should be concentrating on
> this direction, or ruling it out.

Yeah, +1

It seems like step 1 is a proof of concept branch that encodes 
pg_log_entry_t's and writes them to a simple ring buffer.  The first 
questions to answer is (a) whether this does in fact improve things 
significantly and (b) whether we want to have an independent ring buffer 
for each PG or try to mix them into one big one for the whole OSD (or 
maybe per shard).

The second question is how that fares on HDDs.  My guess is that the 
current rocksdb strategy is better because it reduces the number of IOs 
and the additional data getting compacted (and CPU usage) isn't the 
limiting factor on HDD perforamnce (IOPS are).  (But maybe we'll get lucky 
and the new strategy will be best for both HDD and SSD..)

Then we have to modify PGLog to be a complete implementation.  A strict 
ring buffer probably won't work because the PG log might not trim and 
because log entries are variable length, so there'll probably need to be 
some simple mapping table (vs a trivial start/end ring buffer position) to 
deal with that.  We have to trim the log periodically, so every so many 
entries we may want to realign with a min_alloc_size boundary.  We 
someones have to back up and rewrite divergent portions of the log (during 
peering) so we'll need to sort out whether that is a complete 
reencode/rewrite or whether we keep encoded entries in ram (individually 
or in chunks), etc etc.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html