> -----Original Message----- > From: Sage Weil [mailto:sweil@xxxxxxxxxx] > Sent: Monday, August 22, 2016 6:09 PM > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx> > Cc: ceph-devel@xxxxxxxxxxxxxxx > Subject: RE: bluestore blobs REVISITED > > On Mon, 22 Aug 2016, Allen Samuels wrote: > > Another possibility is to "bin" the lextent table into known, fixed, > > offset ranges. Suppose each oNode had a fixed range of LBA keys > > associated with the lextent table: say [0..128K), [128K..256K), ... > > Yeah, I think that's the way to do it. Either a set<uint32_t> > lextent_key_offsets or uint64_t lextent_map_chunk_size to specify the > granularity. Need to do some actual estimates on this scheme to make sure we're actually landing on a real solution and not just another band-aid that we have to rip off (painfully) at some future time. > > > It eliminates the need to put a "lower_bound" into the KV Store > > directly. Though it'll likely be a bit more complex above and somewhat > > less efficient. > > FWIW this is basically wat the iterator does, I think. It's a separate rocksdb > operation to create a snapshot to iterate over (and we don't rely on that > feature anywhere). It's still more expensive than raw gets just because it has > to have fingers in all the levels to ensure that it finds all the keys for the given > range, while get can stop once it finds a key or a tombstone (either in a cache > or a higher level of the tree). > > But I'm not super thrilled about this complexity. I still hope > (wishfully?) we can get the lextent map small enough that we can leave it in > the onode. Otherwise we really are going to net +1 kv fetch for every > operation (3, in the clone case... onode, then lextent, then shared blob). I don't see the feasibility of leaving the lextent map in the oNode. It's just too big for the 4K random write case. I know that's not indicative of real-world usage. But it's what people use to measure systems.... BTW, w.r.t. lower-bound, my reading of the BloomFilter / Prefix stuff suggests that it's would relatively trivial to ensure that bloom-filters properly ignore the "offset" portion of an lextent key. Meaning that I believe that a "lower_bound" operator ought to be relatively easy to implement without triggering the overhead of a snapshot, again, provided that you provided a custom bloom-filter implementation that did the right thing. So the question is do we want to avoid monkeying with RocksDB and go with a binning approach (TBD ensuring that this is a real solution to the problem) OR do we bite the bullet and solve the lower-bound lookup problem? BTW, on the "binning" scheme, perhaps the solution is just put the current "binning value" [presumeably some log2 thing] into the oNode itself -- it'll just be a byte. Then you're only stuck with the complexity of deciding when to do a "split" if the current bin has gotten too large (some arbitrary limit on size of the encoded lexent-bin) > > sage > > > > > > > > -----Original Message----- > > > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- > > > owner@xxxxxxxxxxxxxxx] On Behalf Of Allen Samuels > > > Sent: Sunday, August 21, 2016 10:27 AM > > > To: Sage Weil <sweil@xxxxxxxxxx> > > > Cc: ceph-devel@xxxxxxxxxxxxxxx > > > Subject: Re: bluestore blobs REVISITED > > > > > > I wonder how hard it would be to add a "lower-bound" fetch like stl. > > > That would allow the kv store to do the fetch without incurring the > > > overhead of a snapshot for the iteration scan. > > > > > > Shared blobs were always going to trigger an extra kv fetch no matter > what. > > > > > > Sent from my iPhone. Please excuse all typos and autocorrects. > > > > > > > On Aug 21, 2016, at 12:08 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: > > > > > > > >> On Sat, 20 Aug 2016, Allen Samuels wrote: > > > >> I have another proposal (it just occurred to me, so it might not > > > >> survive more scrutiny). > > > >> > > > >> Yes, we should remove the blob-map from the oNode. > > > >> > > > >> But I believe we should also remove the lextent map from the > > > >> oNode and make each lextent be an independent KV value. > > > >> > > > >> However, in the special case where each extent --exactly-- maps > > > >> onto a blob AND the blob is not referenced by any other extent > > > >> (which is the typical case, unless you're doing compression with > > > >> strange-ish > > > >> overlaps) > > > >> -- then you encode the blob in the lextent itself and there's no > > > >> separate blob entry. > > > >> > > > >> This is pretty much exactly the same number of KV fetches as what > > > >> you proposed before when the blob isn't shared (the typical case) > > > >> -- except the oNode is MUCH MUCH smaller now. > > > > > > > > I think this approach makes a lot of sense! The only thing I'm > > > > worried about is that the lextent keys are no longer known when > > > > they are being fetched (since they will be a logical offset), > > > > which means we'll have to use an iterator instead of a simple get. > > > > The former is quite a bit slower than the latter (which can make > > > > use of the rocksdb caches and/or key bloom filters more easily). > > > > > > > > We could augment your approach by keeping *just* the lextent > > > > offsets in the onode, so that we know exactly which lextent key to > > > > fetch, but then I'm not sure we'll get much benefit (lextent > > > > metadata size goes down by ~1/3, but then we have an extra get for > cloned objects). > > > > > > > > Hmm, the other thing to keep in mind is that for RBD the common > > > > case is that lots of objects have clones, and many of those > > > > objects' blobs will be shared. > > > > > > > > sage > > > > > > > >> So for the non-shared case, you fetch the oNode which is > > > >> dominated by the xattrs now (so figure a couple of hundred bytes > > > >> and not much CPU cost to deserialize). And then fetch from the KV > > > >> for the lextent (which is 1 fetch -- unless it overlaps two > > > >> previous lextents). If it's the optimal case, the KV fetch is > > > >> small (10-15 bytes) and trivial to deserialize. If it's an > > > >> unshared/local blob then you're ready to go. If the blob is > > > >> shared (locally or globally) then you'll have to go fetch that one too. > > > >> > > > >> This might lead to the elimination of the local/global blob thing > > > >> (I think you've talked about that before) as now the only "local" > > > >> blobs are the unshared single extent blobs which are stored > > > >> inline with the lextent entry. You'll still have the special > > > >> cases of promoting unshared > > > >> (inline) blobs to global blobs -- which is probably similar to > > > >> the current promotion "magic" on a clone operation. > > > >> > > > >> The current refmap concept may require some additional work. I > > > >> believe that we'll have to do a reconstruction of the refmap, but > > > >> fortunately only for the range of the current I/O. That will be a > > > >> bit more expensive, but still less expensive than reconstructing > > > >> the entire refmap for every oNode deserialization, Fortunately I > > > >> believe the refmap is only really needed for compression cases or > > > >> RBD cases > > > without "trim" > > > >> (this is the case to optimize -- it'll make trim really important > > > >> for performance). > > > >> > > > >> Best of both worlds???? > > > >> > > > >> Allen Samuels > > > >> SanDisk |a Western Digital brand > > > >> 2880 Junction Avenue, Milpitas, CA 95134 > > > >> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx > > > >> > > > >> > > > >>> -----Original Message----- > > > >>> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- > > > >>> owner@xxxxxxxxxxxxxxx] On Behalf Of Allen Samuels > > > >>> Sent: Friday, August 19, 2016 7:16 AM > > > >>> To: Sage Weil <sweil@xxxxxxxxxx> > > > >>> Cc: ceph-devel@xxxxxxxxxxxxxxx > > > >>> Subject: RE: bluestore blobs > > > >>> > > > >>>> -----Original Message----- > > > >>>> From: Sage Weil [mailto:sweil@xxxxxxxxxx] > > > >>>> Sent: Friday, August 19, 2016 6:53 AM > > > >>>> To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx> > > > >>>> Cc: ceph-devel@xxxxxxxxxxxxxxx > > > >>>> Subject: RE: bluestore blobs > > > >>>> > > > >>>> On Fri, 19 Aug 2016, Allen Samuels wrote: > > > >>>>>> -----Original Message----- > > > >>>>>> From: Sage Weil [mailto:sweil@xxxxxxxxxx] > > > >>>>>> Sent: Thursday, August 18, 2016 8:10 AM > > > >>>>>> To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx> > > > >>>>>> Cc: ceph-devel@xxxxxxxxxxxxxxx > > > >>>>>> Subject: RE: bluestore blobs > > > >>>>>> > > > >>>>>> On Thu, 18 Aug 2016, Allen Samuels wrote: > > > >>>>>>>> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph- > devel- > > > >>>>>>>> owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil > > > >>>>>>>> Sent: Wednesday, August 17, 2016 7:26 AM > > > >>>>>>>> To: ceph-devel@xxxxxxxxxxxxxxx > > > >>>>>>>> Subject: bluestore blobs > > > >>>>>>>> > > > >>>>>>>> I think we need to look at other changes in addition to the > > > >>>>>>>> encoding performance improvements. Even if they end up > > > >>>>>>>> being good enough, these changes are somewhat orthogonal > > > >>>>>>>> and at > > > least > > > >>>>>>>> one of them should give us something that is even faster. > > > >>>>>>>> > > > >>>>>>>> 1. I mentioned this before, but we should keep the encoding > > > >>>>>>>> bluestore_blob_t around when we load the blob map. If it's > > > >>>>>>>> not changed, don't reencode it. There are no blockers for > > > >>>>>>>> implementing this > > > >>>>>> currently. > > > >>>>>>>> It may be difficult to ensure the blobs are properly marked > dirty... > > > >>>>>>>> I'll see if we can use proper accessors for the blob to > > > >>>>>>>> enforce this at compile time. We should do that anyway. > > > >>>>>>> > > > >>>>>>> If it's not changed, then why are we re-writing it? I'm > > > >>>>>>> having a hard time thinking of a case worth optimizing where > > > >>>>>>> I want to re-write the oNode but the blob_map is unchanged. > > > >>>>>>> Am I missing > > > >>>> something obvious? > > > >>>>>> > > > >>>>>> An onode's blob_map might have 300 blobs, and a single write > > > >>>>>> only updates one of them. The other 299 blobs need not be > > > >>>>>> reencoded, just > > > >>>> memcpy'd. > > > >>>>> > > > >>>>> As long as we're just appending that's a good optimization. > > > >>>>> How often does that happen? It's certainly not going to help > > > >>>>> the RBD 4K random write problem. > > > >>>> > > > >>>> It won't help the (l)extent_map encoding, but it avoids almost > > > >>>> all of the blob reencoding. A 4k random write will update one > > > >>>> blob out of > > > >>>> ~100 (or whatever it is). > > > >>>> > > > >>>>>>>> 2. This turns the blob Put into rocksdb into two memcpy > stages: > > > >>>>>>>> one to assemble the bufferlist (lots of bufferptrs to each > > > >>>>>>>> untouched > > > >>>>>>>> blob) into a single rocksdb::Slice, and another memcpy > > > >>>>>>>> somewhere inside rocksdb to copy this into the write buffer. > > > >>>>>>>> We could extend the rocksdb interface to take an iovec so > > > >>>>>>>> that the first memcpy isn't needed (and rocksdb will > > > >>>>>>>> instead iterate over our buffers and copy them directly into its > write buffer). > > > >>>>>>>> This is probably a pretty small piece of the overall time... > > > >>>>>>>> should verify with a profiler > > > >>>>>> before investing too much effort here. > > > >>>>>>> > > > >>>>>>> I doubt it's the memcpy that's really the expensive part. > > > >>>>>>> I'll bet it's that we're transcoding from an internal to an > > > >>>>>>> external representation on an element by element basis. If > > > >>>>>>> the iovec scheme is going to help, it presumes that the > > > >>>>>>> internal data structure essentially matches the external > > > >>>>>>> data structure so that only an iovec copy is required. I'm > > > >>>>>>> wondering how compatible this is with the current concepts > > > >>>>>>> of > > > lextext/blob/pextent. > > > >>>>>> > > > >>>>>> I'm thinking of the xattr case (we have a bunch of strings to > > > >>>>>> copy > > > >>>>>> verbatim) and updated-one-blob-and-kept-99-unchanged case: > > > >>>>>> instead of memcpy'ing them into a big contiguous buffer and > > > >>>>>> having rocksdb memcpy > > > >>>>>> *that* into it's larger buffer, give rocksdb an iovec so that > > > >>>>>> they smaller buffers are assembled only once. > > > >>>>>> > > > >>>>>> These buffers will be on the order of many 10s to a couple > > > >>>>>> 100s of > > > >>> bytes. > > > >>>>>> I'm not sure where the crossover point for constructing and > > > >>>>>> then traversing an iovec vs just copying twice would be... > > > >>>>> > > > >>>>> Yes this will eliminate the "extra" copy, but the real problem > > > >>>>> is that the oNode itself is just too large. I doubt removing > > > >>>>> one extra copy is going to suddenly "solve" this problem. I > > > >>>>> think we're going to end up rejiggering things so that this > > > >>>>> will be much less of a problem than it is now -- time will tell. > > > >>>> > > > >>>> Yeah, leaving this one for last I think... until we see memcpy > > > >>>> show up in the profile. > > > >>>> > > > >>>>>>>> 3. Even if we do the above, we're still setting a big (~4k > > > >>>>>>>> or > > > >>>>>>>> more?) key into rocksdb every time we touch an object, even > > > >>>>>>>> when a tiny > > > >>>>> > > > >>>>> See my analysis, you're looking at 8-10K for the RBD random > > > >>>>> write case > > > >>>>> -- which I think everybody cares a lot about. > > > >>>>> > > > >>>>>>>> amount of metadata is getting changed. This is a > > > >>>>>>>> consequence of embedding all of the blobs into the onode > > > >>>>>>>> (or bnode). That seemed like a good idea early on when > > > >>>>>>>> they were tiny (i.e., just an extent), but now I'm not so > > > >>>>>>>> sure. I see a couple of different > > > >>>> options: > > > >>>>>>>> > > > >>>>>>>> a) Store each blob as ($onode_key+$blobid). When we load > > > >>>>>>>> the onode, load the blobs too. They will hopefully be > > > >>>>>>>> sequential in rocksdb (or definitely sequential in zs). > > > >>>>>>>> Probably go back to using an > > > >>>> iterator. > > > >>>>>>>> > > > >>>>>>>> b) Go all in on the "bnode" like concept. Assign blob ids > > > >>>>>>>> so that they are unique for any given hash value. Then > > > >>>>>>>> store the blobs as $shard.$poolid.$hash.$blobid (i.e., > > > >>>>>>>> where the bnode is now). Then when clone happens there is > > > >>>>>>>> no onode->bnode migration magic happening--we've already > > > >>>>>>>> committed to storing blobs in separate keys. When we load > > > >>>>>>>> the onode, keep the conditional bnode loading we already > > > >>>>>>>> have.. but when the bnode is loaded load up all the blobs > > > >>>>>>>> for the hash key. (Okay, we could fault in blobs > > > >>>>>>>> individually, but that code will be more > > > >>>>>>>> complicated.) > > > >>>>> > > > >>>>> I like this direction. I think you'll still end up demand > > > >>>>> loading the blobs in order to speed up the random read case. > > > >>>>> This scheme will result in some space-amplification, both in > > > >>>>> the lextent and in the blob-map, it's worth a bit of study too > > > >>>>> see how bad the metadata/data ratio becomes (just as a guess, > > > >>>>> $shard.$poolid.$hash.$blobid is probably 16 + > > > >>>>> 16 + 8 + 16 bytes in size, that's ~60 bytes of key for each > > > >>>>> Blob > > > >>>>> -- unless your KV store does path compression. My reading of > > > >>>>> RocksDB sst file seems to indicate that it doesn't, I > > > >>>>> *believe* that ZS does [need to confirm]). I'm wondering if > > > >>>>> the current notion > > > of local vs. > > > >>>>> global blobs isn't actually beneficial in that we can give > > > >>>>> local blobs different names that sort with their associated > > > >>>>> oNode (which probably makes the space-amp worse) which is an > > > >>>>> important optimization. We do need to watch the space amp, > > > >>>>> we're going to be burning DRAM to make KV accesses cheap and > > > >>>>> the amount of DRAM > > > is > > > >>> proportional to the space amp. > > > >>>> > > > >>>> I got this mostly working last night... just need to sort out > > > >>>> the clone case (and clean up a bunch of code). It was a > > > >>>> relatively painless transition to make, although in its current > > > >>>> form the blobs all belong to the bnode, and the bnode if > > > >>>> ephemeral but remains in > > > >>> memory until all referencing onodes go away. > > > >>>> Mostly fine, except it means that odd combinations of clone > > > >>>> could leave lots of blobs in cache that don't get trimmed. > > > >>>> Will address that > > > later. > > > >>>> > > > >>>> I'll try to finish it up this morning and get it passing tests and posted. > > > >>>> > > > >>>>>>>> In both these cases, a write will dirty the onode (which is > > > >>>>>>>> back to being pretty small.. just xattrs and the lextent > > > >>>>>>>> map) and 1-3 blobs (also > > > >>>>>> now small keys). > > > >>>>> > > > >>>>> I'm not sure the oNode is going to be that small. Looking at > > > >>>>> the RBD random 4K write case, you're going to have 1K entries > > > >>>>> each of which has an offset, size and a blob-id reference in > > > >>>>> them. In my current oNode compression scheme this compresses > > > >>>>> to about 1 byte > > > per entry. > > > >>>>> However, this optimization relies on being able to cheaply > > > >>>>> renumber the blob-ids, which is no longer possible when the > > > >>>>> blob-ids become parts of a key (see above). So now you'll have > > > >>>>> a minimum of 1.5-3 bytes extra for each blob-id (because you > > > >>>>> can't assume that the blob-ids > > > >>>> become "dense" > > > >>>>> anymore) So you're looking at 2.5-4 bytes per entry or about > > > >>>>> 2.5-4K Bytes of lextent table. Worse, because of the variable > > > >>>>> length encoding you'll have to scan the entire table to > > > >>>>> deserialize it (yes, we could do differential editing when we > > > >>>>> write but that's another > > > >>> discussion). > > > >>>>> Oh and I forgot to add the 200-300 bytes of oNode and xattrs :). > > > >>>>> So while this looks small compared to the current ~30K for the > > > >>>>> entire thing oNode/lextent/blobmap, it's NOT a huge gain over > > > >>>>> 8-10K of the compressed oNode/lextent/blobmap scheme that I > > > published earlier. > > > >>>>> > > > >>>>> If we want to do better we will need to separate the lextent > > > >>>>> from the oNode also. It's relatively easy to move the lextents > > > >>>>> into the KV store itself (there are two obvious ways to deal > > > >>>>> with this, either use the native offset/size from the lextent > > > >>>>> itself OR create 'N' buckets of logical offset into which we > > > >>>>> pour entries -- both of these would add somewhere between 1 > > > >>>>> and 2 KV look-ups > > > per > > > >>>>> operation > > > >>>>> -- here is where an iterator would probably help. > > > >>>>> > > > >>>>> Unfortunately, if you only process a portion of the lextent > > > >>>>> (because you've made it into multiple keys and you don't want > > > >>>>> to load all of > > > >>>>> them) you no longer can re-generate the refmap on the fly > > > >>>>> (another key space optimization). The lack of refmap screws up > > > >>>>> a number of other important algorithms -- for example the > > > >>>>> overlapping blob-map > > > >>> thing, etc. > > > >>>>> Not sure if these are easy to rewrite or not -- too > > > >>>>> complicated to think about at this hour of the evening. > > > >>>> > > > >>>> Yeah, I forgot about the extent_map and how big it gets. I > > > >>>> think, though, that if we can get a 4mb object with 1024 4k > > > >>>> lextents to encode the whole onode and extent_map in under 4K > > > >>>> that will be good enough. The blob update that goes with it > > > >>>> will be ~200 bytes, and benchmarks aside, the 4k random write > > > >>>> 100% fragmented object is a worst > > > >>> case. > > > >>> > > > >>> Yes, it's a worst-case. But it's a > > > >>> "worst-case-that-everybody-looks-at" vs. a > > > >>> "worst-case-that-almost- > > > nobody-looks-at". > > > >>> > > > >>> I'm still concerned about having an oNode that's larger than a 4K > block. > > > >>> > > > >>> > > > >>>> > > > >>>> Anyway, I'll get the blob separation branch working and we can > > > >>>> go from there... > > > >>>> > > > >>>> sage > > > >>> -- > > > >>> To unsubscribe from this list: send the line "unsubscribe > > > >>> ceph-devel" in the body of a message to > > > >>> majordomo@xxxxxxxxxxxxxxx More majordomo info at > > > >>> http://vger.kernel.org/majordomo-info.html > > > >> -- > > > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > >> in the body of a message to majordomo@xxxxxxxxxxxxxxx More > > > majordomo > > > >> info at http://vger.kernel.org/majordomo-info.html > > > >> > > > >> > > > -- > > > To unsubscribe from this list: send the line "unsubscribe > > > ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html