Re: storing pg logs outside of rocksdb

Josh Durgin <jdurgin@xxxxxxxxxx> · Thu, 29 Mar 2018 14:08:44 -0700

On 03/29/2018 01:04 PM, Sage Weil wrote:
On Wed, 28 Mar 2018, Matt Benjamin wrote:
On Wed, Mar 28, 2018 at 1:44 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
On 03/28/2018 12:21 PM, Adam C. Emerson wrote:

2) It sure feels like conceptually the pglog should be represented as a
per-pg ring buffer rather than key/value data.  Maybe there are really
important reasons that it shouldn't be, but I don't currently see them.  As
far as the objectstore is concerned, it seems to me like there are valid
reasons to provide some kind of log interface and perhaps that should be
used for pg_log.  That sort of opens the door for different object store
implementations fulfilling that functionality in whatever ways the author
deems fit.

In the reddit lingo, pretty much this.  We should be concentrating on
this direction, or ruling it out.

Yeah, +1

It seems like step 1 is a proof of concept branch that encodes
pg_log_entry_t's and writes them to a simple ring buffer.  The first
questions to answer is (a) whether this does in fact improve things
significantly and (b) whether we want to have an independent ring buffer
for each PG or try to mix them into one big one for the whole OSD (or
maybe per shard).

The second question is how that fares on HDDs.  My guess is that the
current rocksdb strategy is better because it reduces the number of IOs
and the additional data getting compacted (and CPU usage) isn't the
limiting factor on HDD perforamnce (IOPS are).  (But maybe we'll get lucky
and the new strategy will be best for both HDD and SSD..)

This is what we discussed in the perf call today. It seems like keeping
an omap-based implementation for HDD, for seek-optimization, makes
sense. We could move the current read/write PGLog logic into a new
ObjectStore, and then bluestore could use its own SSD-optimized
implementation when on SSD, while HDD and FileStore keep the old logic.

I think there's agreement that we need to rewrite the PGLog disk
encoding in terms of a new non-key-value interface, though what that
interface looks like isn't exactly clear yet. The more important
question in my mind is how to do this most efficiently in bluestore on
SSD.

Then we have to modify PGLog to be a complete implementation.  A strict
ring buffer probably won't work because the PG log might not trim and
because log entries are variable length, so there'll probably need to be
some simple mapping table (vs a trivial start/end ring buffer position) to
deal with that.  We have to trim the log periodically, so every so many
entries we may want to realign with a min_alloc_size boundary.  We
someones have to back up and rewrite divergent portions of the log (during
peering) so we'll need to sort out whether that is a complete
reencode/rewrite or whether we keep encoded entries in ram (individually
or in chunks), etc etc.

Yes, I brought this up too - rewriting the whole thing is fine for
prototyping, and finding the best non-peering performance, but for
the larger logs we'll want on faster devices, we'll need to do some
smaller overwrites. Hence, the interface can't be a strict fifo.

I'm not sure we need a mapping table on-disk though - we read in the
entire log into memory at start up, and could generate an in-memory
mapping of on-disk offsets at that point. We could also enforce
an upper limit on log entry size and add padding to simplify things.

Another aspect that hasn't come up yet is keeping a strict limit on the
log size, so that we can keep a bounded ring buffer instead of growing
it extensively and incurring extra overhead during recovery/backfill.
Right now we set min_last_complete_ondisk based on the 
acting_recovery_backfill set, so we end up not trimming the log during 
backfill and async recovery.

Is there any reason not to trim the logs on the acting set at least? The
async recovery and backfill shards need the longer log to stay
contiguous with the acting set/not restart backfill on interval change,
but among the acting set I don't see what issues this would cause.

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html