Re: storing pg logs outside of rocksdb

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 29 Mar 2018 14:15:10 -0700



On Thu, Mar 29, 2018 at 2:08 PM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
> On 03/29/2018 01:04 PM, Sage Weil wrote:
>>
>> On Wed, 28 Mar 2018, Matt Benjamin wrote:
>>>
>>> On Wed, Mar 28, 2018 at 1:44 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
>>>>
>>>> On 03/28/2018 12:21 PM, Adam C. Emerson wrote:
>>>>
>>>> 2) It sure feels like conceptually the pglog should be represented as a
>>>> per-pg ring buffer rather than key/value data.  Maybe there are really
>>>> important reasons that it shouldn't be, but I don't currently see them.
>>>> As
>>>> far as the objectstore is concerned, it seems to me like there are valid
>>>> reasons to provide some kind of log interface and perhaps that should be
>>>> used for pg_log.  That sort of opens the door for different object store
>>>> implementations fulfilling that functionality in whatever ways the
>>>> author
>>>> deems fit.
>>>
>>>
>>> In the reddit lingo, pretty much this.  We should be concentrating on
>>> this direction, or ruling it out.
>>
>>
>> Yeah, +1
>>
>> It seems like step 1 is a proof of concept branch that encodes
>> pg_log_entry_t's and writes them to a simple ring buffer.  The first
>> questions to answer is (a) whether this does in fact improve things
>> significantly and (b) whether we want to have an independent ring buffer
>> for each PG or try to mix them into one big one for the whole OSD (or
>> maybe per shard).
>>
>> The second question is how that fares on HDDs.  My guess is that the
>> current rocksdb strategy is better because it reduces the number of IOs
>> and the additional data getting compacted (and CPU usage) isn't the
>> limiting factor on HDD perforamnce (IOPS are).  (But maybe we'll get lucky
>> and the new strategy will be best for both HDD and SSD..)
>
>
> This is what we discussed in the perf call today. It seems like keeping
> an omap-based implementation for HDD, for seek-optimization, makes
> sense. We could move the current read/write PGLog logic into a new
> ObjectStore, and then bluestore could use its own SSD-optimized
> implementation when on SSD, while HDD and FileStore keep the old logic.
>
> I think there's agreement that we need to rewrite the PGLog disk
> encoding in terms of a new non-key-value interface, though what that
> interface looks like isn't exactly clear yet. The more important
> question in my mind is how to do this most efficiently in bluestore on
> SSD.
>
>> Then we have to modify PGLog to be a complete implementation.  A strict
>> ring buffer probably won't work because the PG log might not trim and
>> because log entries are variable length, so there'll probably need to be
>> some simple mapping table (vs a trivial start/end ring buffer position) to
>> deal with that.  We have to trim the log periodically, so every so many
>> entries we may want to realign with a min_alloc_size boundary.  We
>> someones have to back up and rewrite divergent portions of the log (during
>> peering) so we'll need to sort out whether that is a complete
>> reencode/rewrite or whether we keep encoded entries in ram (individually
>> or in chunks), etc etc.
>
>
> Yes, I brought this up too - rewriting the whole thing is fine for
> prototyping, and finding the best non-peering performance, but for
> the larger logs we'll want on faster devices, we'll need to do some
> smaller overwrites. Hence, the interface can't be a strict fifo.
>
> I'm not sure we need a mapping table on-disk though - we read in the
> entire log into memory at start up, and could generate an in-memory
> mapping of on-disk offsets at that point. We could also enforce
> an upper limit on log entry size and add padding to simplify things.
>
> Another aspect that hasn't come up yet is keeping a strict limit on the
> log size, so that we can keep a bounded ring buffer instead of growing
> it extensively and incurring extra overhead during recovery/backfill.
> Right now we set min_last_complete_ondisk based on the
> acting_recovery_backfill set, so we end up not trimming the log during
> backfill and async recovery.
>
> Is there any reason not to trim the logs on the acting set at least? The
> async recovery and backfill shards need the longer log to stay
> contiguous with the acting set/not restart backfill on interval change,
> but among the acting set I don't see what issues this would cause.

If we do that, any node which goes down temporarily has a good chance
of no longer being contiguous, and having to shift from recovery to
backfill, or restart backfill.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html