Adding ceph-devel. This is a discussion Allen and I were having earlier about what the future post-BlueStore native SPDK, nvme-focused ObjectStore backend could look like. The short description is something like "log-structured btree", the idea being that we would do the cleaning in the software layer so that the nvme's garbage collection generally has nothing to do. I would like to start thinking about this sooner rather than later so that we have a good idea of what we want to build and, hopefully, can time having a testable backend prototype with the futures-ified OSD IO path. sage On Thu, 21 Dec 2017, Allen Samuels wrote: > Yes. But you mix the data AND the b-tree into the same log. > > Essentially, each ObjectStore transaction generates a single contiguous > write to the log. Data and metadata are intermixed and formatted such > that you can re-read the last open stripe on a restart to restore. In > the log you have data pages, whole metadata pages (b-tree pages) and > metadata deltas (transactional edits) intermixed such that you can > always read the stripe from the beginning and discern which is which (as > well as torn stripe-writes....). > > Logically, a write consists of modifying a number of in-memory blocks > (some data and some metadata). The data written to the log is ALWAYS the > deltas, thus conceptually after the log write we are left with some > number of "dirty" memory blocks that can only be reconstructed by going > back to the last non-delta version and applying the deltas from the log > since that time. [As a key optimization, whole block writes/deltas leave > you with a memory block that's NOT dirty once the log is written ;) ] > When memory is "full", you can bulk-up a log write with some > un-associated dirty blocks which allows them to be cleaned and > discarded. By writing deltas, you automatically get the "combining" > behavior (this is a strength of LSM algos) but only on the individual > B-tree blocks (not the entire tree), in essence, if you have multiple > transactions on metadata that's within the same B-tree block [a VERY > frequent situation] this system automatically combines those like what > you get with Rocks-ish LSM stuff. > > You need to limit how long the delta chain for a metadata block becomes > (to improve restart times by bounding the amount of data that needs to > be read in to reconstruct the dirty block from it's accumulated deltas). > The cheapest thing to do is to simply ensure that all blocks get flushed > across a backing store stripe change. That's relatively easy to do by > regulating the maximum amount of dirty blocks and then simply writing > all dirty blocks at the start of each stripe (when you close one stripe > and open up the next one). > > Log cleaning/compaction is relatively easy, you just read each stripe > from from to end, parsing as you go. You'll need to lookup each object > to determine if this is an old, version of the object or if it's the > "live" object. If it's live, you just mark it dirty (in the in-memory > buffer) and go on, it'll get dumped with the next write (which you might > have to artificially trigger). > > One key issue is how you address blocks. Some schemes use physical > addresses for blocks, this forces the cleaning activity to dirty > additional metadata (the pointer to this object), however for data this > tends to be localized (the pointer to the data is typically writing in > the same stripe). Alternatively, you can create a logical/physical > address mapping table so that you can move blocks and update this > micro-table without affecting the logical addresses in the upper level > metadata blocks. However, this does create an additional mapping table > which needs to be checkpointed/recovered (separate from the other data) > and is frequently implemented as an in-memory table, costing 6-ish bytes > of DRAM per page. > > Allen Samuels > R&D Engineering Fellow > > Western Digital(r) > Email: allen.samuels@xxxxxxx > Office: +1-408-801-7030 > Mobile: +1-408-780-6416 > > > -----Original Message----- > From: Sage Weil [mailto:sweil@xxxxxxxxxx] > Sent: Thursday, December 21, 2017 9:45 AM > To: Allen Samuels <Allen.Samuels@xxxxxxx> > Subject: RE: New ObjectStore > > log structured b-tree thing? > > On Thu, 21 Dec 2017, Allen Samuels wrote: > > > Whew. > > > > Stay away from LSM. They optimize for developer time at the expense of run-time resource consumption. > > > > Allen Samuels > > R&D Engineering Fellow > > > > Western Digital(r) > > Email: allen.samuels@xxxxxxx > > Office: +1-408-801-7030 > > Mobile: +1-408-780-6416 > > > > > > -----Original Message----- > > From: Sage Weil [mailto:sweil@xxxxxxxxxx] > > Sent: Thursday, December 21, 2017 8:41 AM > > To: Allen Samuels <Allen.Samuels@xxxxxxx> > > Subject: Re: New ObjectStore > > > > On Thu, 21 Dec 2017, Allen Samuels wrote: > > > w.r.t. today's discussion about a new ObjectStore. Moving the > > > fast-paths of the OSD into a reactive framework (SeaStar) is great and > > > will certainly improve performance significantly for small object I/O. > > > > > > However, IMO, using RocksDB (LSM) as a foundational building block is > > > a severely sub-optimal choice and will likely serve to severely limit > > > the performance boost that you'll achieve with the re-write. For > > > large-scale systems, I'm going to make a wild-speculation and suggest > > > that you won't see any actual throughput improvement from the > > > re-write, because the write-amp for metadata will end up being the > > > limiting factor - you'll have PLENTY of CPU that's idle waiting for > > > your I/O subsystem and end up running at the same speed as today. > > > > Yeah, completely agree.. the seastar objectstore (SeaStore for maximum > > confusion) won't use rocksdb. We're talking about the futures kv > > interface possibility as a glue layer to allow a mid-term > > semi-futures-based bluestore. Not sure it'll make sense, but it's an > > option. > > > > So the new thing won't use rocksdb at all. There don't appear to be > > any existing kv choices, but that's probably a good thing as it'll > > force us to build metadata capabilities specific to our needs. That > > might be something LSM-ish, might not, I don't have a very clear > > picture of this yet (beyond that it should be log structured :). > > > > sage > > > > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html