On Thu, Mar 29, 2018 at 2:08 PM, Josh Durgin <jdurgin@xxxxxxxxxx> wrote: > On 03/29/2018 01:04 PM, Sage Weil wrote: >> >> On Wed, 28 Mar 2018, Matt Benjamin wrote: >>> >>> On Wed, Mar 28, 2018 at 1:44 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote: >>>> >>>> On 03/28/2018 12:21 PM, Adam C. Emerson wrote: >>>> >>>> 2) It sure feels like conceptually the pglog should be represented as a >>>> per-pg ring buffer rather than key/value data. Maybe there are really >>>> important reasons that it shouldn't be, but I don't currently see them. >>>> As >>>> far as the objectstore is concerned, it seems to me like there are valid >>>> reasons to provide some kind of log interface and perhaps that should be >>>> used for pg_log. That sort of opens the door for different object store >>>> implementations fulfilling that functionality in whatever ways the >>>> author >>>> deems fit. >>> >>> >>> In the reddit lingo, pretty much this. We should be concentrating on >>> this direction, or ruling it out. >> >> >> Yeah, +1 >> >> It seems like step 1 is a proof of concept branch that encodes >> pg_log_entry_t's and writes them to a simple ring buffer. The first >> questions to answer is (a) whether this does in fact improve things >> significantly and (b) whether we want to have an independent ring buffer >> for each PG or try to mix them into one big one for the whole OSD (or >> maybe per shard). >> >> The second question is how that fares on HDDs. My guess is that the >> current rocksdb strategy is better because it reduces the number of IOs >> and the additional data getting compacted (and CPU usage) isn't the >> limiting factor on HDD perforamnce (IOPS are). (But maybe we'll get lucky >> and the new strategy will be best for both HDD and SSD..) > > > This is what we discussed in the perf call today. It seems like keeping > an omap-based implementation for HDD, for seek-optimization, makes > sense. We could move the current read/write PGLog logic into a new > ObjectStore, and then bluestore could use its own SSD-optimized > implementation when on SSD, while HDD and FileStore keep the old logic. > > I think there's agreement that we need to rewrite the PGLog disk > encoding in terms of a new non-key-value interface, though what that > interface looks like isn't exactly clear yet. The more important > question in my mind is how to do this most efficiently in bluestore on > SSD. > >> Then we have to modify PGLog to be a complete implementation. A strict >> ring buffer probably won't work because the PG log might not trim and >> because log entries are variable length, so there'll probably need to be >> some simple mapping table (vs a trivial start/end ring buffer position) to >> deal with that. We have to trim the log periodically, so every so many >> entries we may want to realign with a min_alloc_size boundary. We >> someones have to back up and rewrite divergent portions of the log (during >> peering) so we'll need to sort out whether that is a complete >> reencode/rewrite or whether we keep encoded entries in ram (individually >> or in chunks), etc etc. > > > Yes, I brought this up too - rewriting the whole thing is fine for > prototyping, and finding the best non-peering performance, but for > the larger logs we'll want on faster devices, we'll need to do some > smaller overwrites. Hence, the interface can't be a strict fifo. > > I'm not sure we need a mapping table on-disk though - we read in the > entire log into memory at start up, and could generate an in-memory > mapping of on-disk offsets at that point. We could also enforce > an upper limit on log entry size and add padding to simplify things. > > Another aspect that hasn't come up yet is keeping a strict limit on the > log size, so that we can keep a bounded ring buffer instead of growing > it extensively and incurring extra overhead during recovery/backfill. > Right now we set min_last_complete_ondisk based on the > acting_recovery_backfill set, so we end up not trimming the log during > backfill and async recovery. > > Is there any reason not to trim the logs on the acting set at least? The > async recovery and backfill shards need the longer log to stay > contiguous with the acting set/not restart backfill on interval change, > but among the acting set I don't see what issues this would cause. If we do that, any node which goes down temporarily has a good chance of no longer being contiguous, and having to shift from recovery to backfill, or restart backfill. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html