RE: New ObjectStore

Allen Samuels <Allen.Samuels@xxxxxxx> · Fri, 5 Jan 2018 02:30:45 +0000

I completely agree with Greg about this being essentially YALSSS (yet-another-log-structure-storage-system). And I completely agree that the historical problem with these has been the cost of compaction/garbage collection (hereinafter, GC). 

What's different NOW is the choice of media to be optimized for -- flash (now) vs. HDD (in-the-before-time) {Note: SMR HDD looks more like flash and much of the following applies to them too}. That changes the picture significantly. That's because the underlying flash media is already log structured. Historically, we've thought about the placement of data by manipulating the address (i.e., by LBA), that's meaningless in an FTL (flash-translation-layer) world. Data is actually placed in temporal order (i.e., in the order you write it) regardless of the notional LBA that you assign (the FTL maintains a mapping table that allows you to translate the host's LBA into the physical location on the media). Hence as you write and re-write data, there's going to be a GC process going on regardless of how LBA's are assigned -- you can't stop it, it's how the media works.

The choice you have is whether you want to let the drive manage this process itself OR whether you have the host manage this process (as I proposed). Regardless of your choice, the amount of traffic to the backing media is the same (well, almost, see below). Most of the performance variance with SSDs can be directly traced to the collision of the background GC process of the controller with the Foreground read/write activity of the host. By doing the GC in the host you have a hope of choreographing this dance to your advantage, if you leave it to the controller you will simply have a much larger storage performance variance (since murphy's law will inevitably cause a drive GC event right when you least want it). 

Some of the disadvantages of host-managed GC: 

(1) Increased PCIe usage since GC traffic is now present.
(2) Decreased CPU performance due to collisions on the DRAM bus with the traffic from (1)
(3) increased host CPU usage (to manage the GC process)
(4) More complicated host code with longer-time-to-market.

Some of the advantages of host-managed GC are:

(5) reduced background GC operations, since the host knows which blocks are live and which blocks are NOT live you don't have to re-write dead data. (Yes -- you can simulate this with TRIM operations but don't forget to disadvantage the drive-managed case with this extra CPU and controller cost -- TRIM operations are often implemented as exceptions in the drive controller and hence often aren't nearly as cheap as you would think. I'm also guessing that the timing of the arrival of the TRIM operation itself can be significant. Reduced GC traffic will provide increased host performance (less GC writes means more room for foreground operations) AND increased drive lifetime. Of course, you can only realize these benefits if the cost of determining whether a page is live or dead is sufficiently inexpensive (this is likely access pattern dependent too :().

(6) Reduced storage performance variance. IMO, the host is better able to micro-schedule the GC activity by "gearing" it to front-end activity. This smooths out the bursts of background activity which are the source of the performance variance. It's micro-level storage QoS, but it's biggest effect is a decrease in worst-case latencies.

(7) Increased performance in consumer-grade SSDs. A key difference in the performance of enterprise-grade SSDs vs consumer-grade SSDs is the hardware costs associated with the FTL. Generally speaking an Enterprise class SSD has sufficient DRAM dedicated so that the FTL translation of the host's LBA into the physical location is done entirely in DRAM meaning that a single front-end operation generates one back-end operation. Whereas in a consumer-grade SSD, some/much/all of this translation table is stored in flash not DRAM, meaning that a single front-end host operation causes multiple operations on the backend media. With the proposed log-structured architecture, FTL operations in the controller are reduced in that:

(a) writes are always sequential -- usually the highest performing usage pattern :) 
(b) reads are much cheaper, that's because the FTL translation structure built from all sequential writes is much cheaper to search -- possibly requiring NO additional media accesses, i.e., same performance level as an enterprise-class drive.

(There are other issues with consumer-grade SSDs, that would need to be addressed too).

More grist for the mill......

Allen Samuels  
R&D Engineering Fellow 

Western Digital® 
Email:  allen.samuels@xxxxxxx 
Office:  +1-408-801-7030
Mobile: +1-408-780-6416 

-----Original Message-----
From: Gregory Farnum [mailto:gfarnum@xxxxxxxxxx] 
Sent: Thursday, January 04, 2018 2:22 PM
To: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
Cc: Allen Samuels <Allen.Samuels@xxxxxxx>; Sage Weil <sweil@xxxxxxxxxx>
Subject: Re: New ObjectStore

On Thu, Jan 4, 2018 at 8:21 AM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> Adding ceph-devel.
>
> This is a discussion Allen and I were having earlier about what the 
> future post-BlueStore native SPDK, nvme-focused ObjectStore backend 
> could look like.  The short description is something like 
> "log-structured btree", the idea being that we would do the cleaning 
> in the software layer so that the nvme's garbage collection generally has nothing to do.
>
> I would like to start thinking about this sooner rather than later so 
> that we have a good idea of what we want to build and, hopefully, can 
> time having a testable backend prototype with the futures-ified OSD IO path.

The below sounds like a pretty standard log-structured storage system to me. That has its benefits, but the main drawback of any log-structured system persists — it's a data fragmentation nightmare.

Now, I agree this is the way to go, and I think we can control the input and IO scheduling enough to handle the issues. But we have to figure out how to do so — successful log-structured systems right I know of are carefully restricted and designed so that the fragmentation isn't an issue. I'm not sure how we can make that happen against eg RBD where we know we're going to get a lot of random 4K IOs, some of which overwrite repeatedly but a bunch of which won't on any useful timescale. When I've blue-skied this myself in the past, it's involved fundamental changes to RADOS like making all objects append-only, with compaction rules that both clients and OSDs know about and can coordinate on.

Taking a step back, I'm wary of doing this to early. As I mentioned in CDM, I'm not sure we'll have a good idea what kind of interface we want until we're farther along with working on the OSD, and I think that may have a big impact on how the storage system should look. And I'm not just talking about whether we want callbacks or polling, whether we want both readable and committed notifications, etc. If we're *really* targeting this to NVMe and faster storage, we should think about how we'll use persistent memory technologies. (This may sound familiar to a few of you...) Maybe they become an integral part of our persistence story, and in NVRAM we maintain a very simple journal of client IO in the form of incoming messages that is used for "commit", and we flush that out to a backing NVMe device in giant efficient batches, and the allowed ObjectStore write latency is in the range of a second. A system for doing that has very different requirements from one where we need to persist data via the ObjectStore and send back a client commit within 100 microseconds.

>
> sage
>
>
> On Thu, 21 Dec 2017, Allen Samuels wrote:
>> Yes. But you mix the data AND the b-tree into the same log.
>>
>> Essentially, each ObjectStore transaction generates a single 
>> contiguous write to the log. Data and metadata are intermixed and 
>> formatted such that you can re-read the last open stripe on a restart 
>> to restore. In the log you have data pages, whole metadata pages 
>> (b-tree pages) and metadata deltas (transactional edits) intermixed 
>> such that you can always read the stripe from the beginning and 
>> discern which is which (as well as torn stripe-writes....).
>>
>> Logically, a write consists of modifying a number of in-memory blocks 
>> (some data and some metadata). The data written to the log is ALWAYS 
>> the deltas, thus conceptually after the log write we are left with 
>> some number of "dirty" memory blocks that can only be reconstructed 
>> by going back to the last non-delta version and applying the deltas 
>> from the log since that time. [As a key optimization, whole block 
>> writes/deltas leave you with a memory block that's NOT dirty once the 
>> log is written ;) ] When memory is "full", you can bulk-up a log 
>> write with some un-associated dirty blocks which allows them to be 
>> cleaned and discarded. By writing deltas, you automatically get the "combining"
>> behavior (this is a strength of LSM algos) but only on the individual 
>> B-tree blocks (not the entire tree), in essence, if you have multiple 
>> transactions on metadata that's within the same B-tree block [a VERY 
>> frequent situation] this system automatically combines those like 
>> what you get with Rocks-ish LSM stuff.
>>
>> You need to limit how long the delta chain for a metadata block 
>> becomes (to improve restart times by bounding the amount of data that 
>> needs to be read in to reconstruct the dirty block from it's accumulated deltas).
>> The cheapest thing to do is to simply ensure that all blocks get 
>> flushed across a backing store stripe change. That's relatively easy 
>> to do by regulating the maximum amount of dirty blocks and then 
>> simply writing all dirty blocks at the start of each stripe (when you 
>> close one stripe and open up the next one).
>>
>> Log cleaning/compaction is relatively easy, you just read each stripe 
>> from from to end, parsing as you go. You'll need to lookup each 
>> object to determine if this is an old, version of the object or if 
>> it's the "live" object. If it's live, you just mark it dirty (in the 
>> in-memory
>> buffer) and go on, it'll get dumped with the next write (which you 
>> might have to artificially trigger).
>>
>> One key issue is how you address blocks. Some schemes use physical 
>> addresses for blocks, this forces the cleaning activity to dirty 
>> additional metadata (the pointer to this object), however for data 
>> this tends to be localized (the pointer to the data is typically 
>> writing in the same stripe). Alternatively, you can create a 
>> logical/physical address mapping table so that you can move blocks 
>> and update this micro-table without affecting the logical addresses 
>> in the upper level metadata blocks. However, this does create an 
>> additional mapping table which needs to be checkpointed/recovered 
>> (separate from the other data) and is frequently implemented as an 
>> in-memory table, costing 6-ish bytes of DRAM per page.
>>
>> Allen Samuels
>> R&D Engineering Fellow
>>
>> Western Digital(r)
>> Email:  allen.samuels@xxxxxxx
>> Office:  +1-408-801-7030
>> Mobile: +1-408-780-6416
>>
>>
>> -----Original Message-----
>> From: Sage Weil [mailto:sweil@xxxxxxxxxx]
>> Sent: Thursday, December 21, 2017 9:45 AM
>> To: Allen Samuels <Allen.Samuels@xxxxxxx>
>> Subject: RE: New ObjectStore
>>
>> log structured b-tree thing?
>>
>> On Thu, 21 Dec 2017, Allen Samuels wrote:
>>
>> > Whew.
>> >
>> > Stay away from LSM. They optimize for developer time at the expense of run-time resource consumption.
>> >
>> > Allen Samuels
>> > R&D Engineering Fellow
>> >
>> > Western Digital(r)
>> > Email:  allen.samuels@xxxxxxx
>> > Office:  +1-408-801-7030
>> > Mobile: +1-408-780-6416
>> >
>> >
>> > -----Original Message-----
>> > From: Sage Weil [mailto:sweil@xxxxxxxxxx]
>> > Sent: Thursday, December 21, 2017 8:41 AM
>> > To: Allen Samuels <Allen.Samuels@xxxxxxx>
>> > Subject: Re: New ObjectStore
>> >
>> > On Thu, 21 Dec 2017, Allen Samuels wrote:
>> > > w.r.t. today's discussion about a new ObjectStore. Moving the 
>> > > fast-paths of the OSD into a reactive framework (SeaStar) is 
>> > > great and will certainly improve performance significantly for small object I/O.
>> > >
>> > > However, IMO, using RocksDB (LSM) as a foundational building 
>> > > block is a severely sub-optimal choice and will likely serve to 
>> > > severely limit the performance boost that you'll achieve with the 
>> > > re-write. For large-scale systems, I'm going to make a 
>> > > wild-speculation and suggest that you won't see any actual 
>> > > throughput improvement from the re-write, because the write-amp 
>> > > for metadata will end up being the limiting factor - you'll have 
>> > > PLENTY of CPU that's idle waiting for your I/O subsystem and end up running at the same speed as today.

Just a note, while this would obviously never be an endpoint, I would consider that outcome a HUGE success. Idling CPUs mean fewer people running into recovery cost inflation issues (and probably better-scaling recovery from our having made everything discrete and countable anyway), the ability to pack more drives into a single system for better economics, etc. Let's not discourage ourselves from climbing Mount Denali first just because we want to make it to Mt Everest! ;) -Greg

>> >
>> > Yeah, completely agree.. the seastar objectstore (SeaStore for 
>> > maximum
>> > confusion) won't use rocksdb.  We're talking about the futures kv 
>> > interface possibility as a glue layer to allow a mid-term 
>> > semi-futures-based bluestore.  Not sure it'll make sense, but it's 
>> > an option.
>> >
>> > So the new thing won't use rocksdb at all.  There don't appear to 
>> > be any existing kv choices, but that's probably a good thing as 
>> > it'll force us to build metadata capabilities specific to our 
>> > needs.  That might be something LSM-ish, might not, I don't have a 
>> > very clear picture of this yet (beyond that it should be log structured :).
>> >
>> > sage
>> >
>> >
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f