Another possibility is to "bin" the lextent table into known, fixed, offset ranges. Suppose each oNode had a fixed range of LBA keys associated with the lextent table: say [0..128K), [128K..256K), ... It eliminates the need to put a "lower_bound" into the KV Store directly. Though it'll likely be a bit more complex above and somewhat less efficient. > -----Original Message----- > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- > owner@xxxxxxxxxxxxxxx] On Behalf Of Allen Samuels > Sent: Sunday, August 21, 2016 10:27 AM > To: Sage Weil <sweil@xxxxxxxxxx> > Cc: ceph-devel@xxxxxxxxxxxxxxx > Subject: Re: bluestore blobs REVISITED > > I wonder how hard it would be to add a "lower-bound" fetch like stl. That > would allow the kv store to do the fetch without incurring the overhead of a > snapshot for the iteration scan. > > Shared blobs were always going to trigger an extra kv fetch no matter what. > > Sent from my iPhone. Please excuse all typos and autocorrects. > > > On Aug 21, 2016, at 12:08 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: > > > >> On Sat, 20 Aug 2016, Allen Samuels wrote: > >> I have another proposal (it just occurred to me, so it might not > >> survive more scrutiny). > >> > >> Yes, we should remove the blob-map from the oNode. > >> > >> But I believe we should also remove the lextent map from the oNode > >> and make each lextent be an independent KV value. > >> > >> However, in the special case where each extent --exactly-- maps onto > >> a blob AND the blob is not referenced by any other extent (which is > >> the typical case, unless you're doing compression with strange-ish > >> overlaps) > >> -- then you encode the blob in the lextent itself and there's no > >> separate blob entry. > >> > >> This is pretty much exactly the same number of KV fetches as what you > >> proposed before when the blob isn't shared (the typical case) -- > >> except the oNode is MUCH MUCH smaller now. > > > > I think this approach makes a lot of sense! The only thing I'm > > worried about is that the lextent keys are no longer known when they > > are being fetched (since they will be a logical offset), which means > > we'll have to use an iterator instead of a simple get. The former is > > quite a bit slower than the latter (which can make use of the rocksdb > > caches and/or key bloom filters more easily). > > > > We could augment your approach by keeping *just* the lextent offsets > > in the onode, so that we know exactly which lextent key to fetch, but > > then I'm not sure we'll get much benefit (lextent metadata size goes > > down by ~1/3, but then we have an extra get for cloned objects). > > > > Hmm, the other thing to keep in mind is that for RBD the common case > > is that lots of objects have clones, and many of those objects' blobs > > will be shared. > > > > sage > > > >> So for the non-shared case, you fetch the oNode which is dominated by > >> the xattrs now (so figure a couple of hundred bytes and not much CPU > >> cost to deserialize). And then fetch from the KV for the lextent > >> (which is 1 fetch -- unless it overlaps two previous lextents). If > >> it's the optimal case, the KV fetch is small (10-15 bytes) and > >> trivial to deserialize. If it's an unshared/local blob then you're > >> ready to go. If the blob is shared (locally or globally) then you'll > >> have to go fetch that one too. > >> > >> This might lead to the elimination of the local/global blob thing (I > >> think you've talked about that before) as now the only "local" blobs > >> are the unshared single extent blobs which are stored inline with the > >> lextent entry. You'll still have the special cases of promoting > >> unshared > >> (inline) blobs to global blobs -- which is probably similar to the > >> current promotion "magic" on a clone operation. > >> > >> The current refmap concept may require some additional work. I > >> believe that we'll have to do a reconstruction of the refmap, but > >> fortunately only for the range of the current I/O. That will be a bit > >> more expensive, but still less expensive than reconstructing the > >> entire refmap for every oNode deserialization, Fortunately I believe > >> the refmap is only really needed for compression cases or RBD cases > without "trim" > >> (this is the case to optimize -- it'll make trim really important for > >> performance). > >> > >> Best of both worlds???? > >> > >> Allen Samuels > >> SanDisk |a Western Digital brand > >> 2880 Junction Avenue, Milpitas, CA 95134 > >> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx > >> > >> > >>> -----Original Message----- > >>> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- > >>> owner@xxxxxxxxxxxxxxx] On Behalf Of Allen Samuels > >>> Sent: Friday, August 19, 2016 7:16 AM > >>> To: Sage Weil <sweil@xxxxxxxxxx> > >>> Cc: ceph-devel@xxxxxxxxxxxxxxx > >>> Subject: RE: bluestore blobs > >>> > >>>> -----Original Message----- > >>>> From: Sage Weil [mailto:sweil@xxxxxxxxxx] > >>>> Sent: Friday, August 19, 2016 6:53 AM > >>>> To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx> > >>>> Cc: ceph-devel@xxxxxxxxxxxxxxx > >>>> Subject: RE: bluestore blobs > >>>> > >>>> On Fri, 19 Aug 2016, Allen Samuels wrote: > >>>>>> -----Original Message----- > >>>>>> From: Sage Weil [mailto:sweil@xxxxxxxxxx] > >>>>>> Sent: Thursday, August 18, 2016 8:10 AM > >>>>>> To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx> > >>>>>> Cc: ceph-devel@xxxxxxxxxxxxxxx > >>>>>> Subject: RE: bluestore blobs > >>>>>> > >>>>>> On Thu, 18 Aug 2016, Allen Samuels wrote: > >>>>>>>> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- > >>>>>>>> owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil > >>>>>>>> Sent: Wednesday, August 17, 2016 7:26 AM > >>>>>>>> To: ceph-devel@xxxxxxxxxxxxxxx > >>>>>>>> Subject: bluestore blobs > >>>>>>>> > >>>>>>>> I think we need to look at other changes in addition to the > >>>>>>>> encoding performance improvements. Even if they end up being > >>>>>>>> good enough, these changes are somewhat orthogonal and at > least > >>>>>>>> one of them should give us something that is even faster. > >>>>>>>> > >>>>>>>> 1. I mentioned this before, but we should keep the encoding > >>>>>>>> bluestore_blob_t around when we load the blob map. If it's not > >>>>>>>> changed, don't reencode it. There are no blockers for > >>>>>>>> implementing this > >>>>>> currently. > >>>>>>>> It may be difficult to ensure the blobs are properly marked dirty... > >>>>>>>> I'll see if we can use proper accessors for the blob to enforce > >>>>>>>> this at compile time. We should do that anyway. > >>>>>>> > >>>>>>> If it's not changed, then why are we re-writing it? I'm having a > >>>>>>> hard time thinking of a case worth optimizing where I want to > >>>>>>> re-write the oNode but the blob_map is unchanged. Am I missing > >>>> something obvious? > >>>>>> > >>>>>> An onode's blob_map might have 300 blobs, and a single write only > >>>>>> updates one of them. The other 299 blobs need not be reencoded, > >>>>>> just > >>>> memcpy'd. > >>>>> > >>>>> As long as we're just appending that's a good optimization. How > >>>>> often does that happen? It's certainly not going to help the RBD > >>>>> 4K random write problem. > >>>> > >>>> It won't help the (l)extent_map encoding, but it avoids almost all > >>>> of the blob reencoding. A 4k random write will update one blob out > >>>> of > >>>> ~100 (or whatever it is). > >>>> > >>>>>>>> 2. This turns the blob Put into rocksdb into two memcpy stages: > >>>>>>>> one to assemble the bufferlist (lots of bufferptrs to each > >>>>>>>> untouched > >>>>>>>> blob) into a single rocksdb::Slice, and another memcpy > >>>>>>>> somewhere inside rocksdb to copy this into the write buffer. > >>>>>>>> We could extend the rocksdb interface to take an iovec so that > >>>>>>>> the first memcpy isn't needed (and rocksdb will instead iterate > >>>>>>>> over our buffers and copy them directly into its write buffer). > >>>>>>>> This is probably a pretty small piece of the overall time... > >>>>>>>> should verify with a profiler > >>>>>> before investing too much effort here. > >>>>>>> > >>>>>>> I doubt it's the memcpy that's really the expensive part. I'll > >>>>>>> bet it's that we're transcoding from an internal to an external > >>>>>>> representation on an element by element basis. If the iovec > >>>>>>> scheme is going to help, it presumes that the internal data > >>>>>>> structure essentially matches the external data structure so > >>>>>>> that only an iovec copy is required. I'm wondering how > >>>>>>> compatible this is with the current concepts of > lextext/blob/pextent. > >>>>>> > >>>>>> I'm thinking of the xattr case (we have a bunch of strings to > >>>>>> copy > >>>>>> verbatim) and updated-one-blob-and-kept-99-unchanged case: > >>>>>> instead of memcpy'ing them into a big contiguous buffer and > >>>>>> having rocksdb memcpy > >>>>>> *that* into it's larger buffer, give rocksdb an iovec so that > >>>>>> they smaller buffers are assembled only once. > >>>>>> > >>>>>> These buffers will be on the order of many 10s to a couple 100s > >>>>>> of > >>> bytes. > >>>>>> I'm not sure where the crossover point for constructing and then > >>>>>> traversing an iovec vs just copying twice would be... > >>>>> > >>>>> Yes this will eliminate the "extra" copy, but the real problem is > >>>>> that the oNode itself is just too large. I doubt removing one > >>>>> extra copy is going to suddenly "solve" this problem. I think > >>>>> we're going to end up rejiggering things so that this will be much > >>>>> less of a problem than it is now -- time will tell. > >>>> > >>>> Yeah, leaving this one for last I think... until we see memcpy show > >>>> up in the profile. > >>>> > >>>>>>>> 3. Even if we do the above, we're still setting a big (~4k or > >>>>>>>> more?) key into rocksdb every time we touch an object, even > >>>>>>>> when a tiny > >>>>> > >>>>> See my analysis, you're looking at 8-10K for the RBD random write > >>>>> case > >>>>> -- which I think everybody cares a lot about. > >>>>> > >>>>>>>> amount of metadata is getting changed. This is a consequence > >>>>>>>> of embedding all of the blobs into the onode (or bnode). That > >>>>>>>> seemed like a good idea early on when they were tiny (i.e., > >>>>>>>> just an extent), but now I'm not so sure. I see a couple of > >>>>>>>> different > >>>> options: > >>>>>>>> > >>>>>>>> a) Store each blob as ($onode_key+$blobid). When we load the > >>>>>>>> onode, load the blobs too. They will hopefully be sequential > >>>>>>>> in rocksdb (or definitely sequential in zs). Probably go back > >>>>>>>> to using an > >>>> iterator. > >>>>>>>> > >>>>>>>> b) Go all in on the "bnode" like concept. Assign blob ids so > >>>>>>>> that they are unique for any given hash value. Then store the > >>>>>>>> blobs as $shard.$poolid.$hash.$blobid (i.e., where the bnode is > >>>>>>>> now). Then when clone happens there is no onode->bnode > >>>>>>>> migration magic happening--we've already committed to storing > >>>>>>>> blobs in separate keys. When we load the onode, keep the > >>>>>>>> conditional bnode loading we already have.. but when the bnode > >>>>>>>> is loaded load up all the blobs for the hash key. (Okay, we > >>>>>>>> could fault in blobs individually, but that code will be more > >>>>>>>> complicated.) > >>>>> > >>>>> I like this direction. I think you'll still end up demand loading > >>>>> the blobs in order to speed up the random read case. This scheme > >>>>> will result in some space-amplification, both in the lextent and > >>>>> in the blob-map, it's worth a bit of study too see how bad the > >>>>> metadata/data ratio becomes (just as a guess, > >>>>> $shard.$poolid.$hash.$blobid is probably 16 + > >>>>> 16 + 8 + 16 bytes in size, that's ~60 bytes of key for each Blob > >>>>> -- unless your KV store does path compression. My reading of > >>>>> RocksDB sst file seems to indicate that it doesn't, I *believe* > >>>>> that ZS does [need to confirm]). I'm wondering if the current notion > of local vs. > >>>>> global blobs isn't actually beneficial in that we can give local > >>>>> blobs different names that sort with their associated oNode (which > >>>>> probably makes the space-amp worse) which is an important > >>>>> optimization. We do need to watch the space amp, we're going to be > >>>>> burning DRAM to make KV accesses cheap and the amount of DRAM > is > >>> proportional to the space amp. > >>>> > >>>> I got this mostly working last night... just need to sort out the > >>>> clone case (and clean up a bunch of code). It was a relatively > >>>> painless transition to make, although in its current form the blobs > >>>> all belong to the bnode, and the bnode if ephemeral but remains in > >>> memory until all referencing onodes go away. > >>>> Mostly fine, except it means that odd combinations of clone could > >>>> leave lots of blobs in cache that don't get trimmed. Will address that > later. > >>>> > >>>> I'll try to finish it up this morning and get it passing tests and posted. > >>>> > >>>>>>>> In both these cases, a write will dirty the onode (which is > >>>>>>>> back to being pretty small.. just xattrs and the lextent map) > >>>>>>>> and 1-3 blobs (also > >>>>>> now small keys). > >>>>> > >>>>> I'm not sure the oNode is going to be that small. Looking at the > >>>>> RBD random 4K write case, you're going to have 1K entries each of > >>>>> which has an offset, size and a blob-id reference in them. In my > >>>>> current oNode compression scheme this compresses to about 1 byte > per entry. > >>>>> However, this optimization relies on being able to cheaply > >>>>> renumber the blob-ids, which is no longer possible when the > >>>>> blob-ids become parts of a key (see above). So now you'll have a > >>>>> minimum of 1.5-3 bytes extra for each blob-id (because you can't > >>>>> assume that the blob-ids > >>>> become "dense" > >>>>> anymore) So you're looking at 2.5-4 bytes per entry or about > >>>>> 2.5-4K Bytes of lextent table. Worse, because of the variable > >>>>> length encoding you'll have to scan the entire table to > >>>>> deserialize it (yes, we could do differential editing when we > >>>>> write but that's another > >>> discussion). > >>>>> Oh and I forgot to add the 200-300 bytes of oNode and xattrs :). > >>>>> So while this looks small compared to the current ~30K for the > >>>>> entire thing oNode/lextent/blobmap, it's NOT a huge gain over > >>>>> 8-10K of the compressed oNode/lextent/blobmap scheme that I > published earlier. > >>>>> > >>>>> If we want to do better we will need to separate the lextent from > >>>>> the oNode also. It's relatively easy to move the lextents into the > >>>>> KV store itself (there are two obvious ways to deal with this, > >>>>> either use the native offset/size from the lextent itself OR > >>>>> create 'N' buckets of logical offset into which we pour entries -- > >>>>> both of these would add somewhere between 1 and 2 KV look-ups > per > >>>>> operation > >>>>> -- here is where an iterator would probably help. > >>>>> > >>>>> Unfortunately, if you only process a portion of the lextent > >>>>> (because you've made it into multiple keys and you don't want to > >>>>> load all of > >>>>> them) you no longer can re-generate the refmap on the fly (another > >>>>> key space optimization). The lack of refmap screws up a number of > >>>>> other important algorithms -- for example the overlapping blob-map > >>> thing, etc. > >>>>> Not sure if these are easy to rewrite or not -- too complicated to > >>>>> think about at this hour of the evening. > >>>> > >>>> Yeah, I forgot about the extent_map and how big it gets. I think, > >>>> though, that if we can get a 4mb object with 1024 4k lextents to > >>>> encode the whole onode and extent_map in under 4K that will be good > >>>> enough. The blob update that goes with it will be ~200 bytes, and > >>>> benchmarks aside, the 4k random write 100% fragmented object is a > >>>> worst > >>> case. > >>> > >>> Yes, it's a worst-case. But it's a > >>> "worst-case-that-everybody-looks-at" vs. a "worst-case-that-almost- > nobody-looks-at". > >>> > >>> I'm still concerned about having an oNode that's larger than a 4K block. > >>> > >>> > >>>> > >>>> Anyway, I'll get the blob separation branch working and we can go > >>>> from there... > >>>> > >>>> sage > >>> -- > >>> To unsubscribe from this list: send the line "unsubscribe > >>> ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >> -- > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" > >> in the body of a message to majordomo@xxxxxxxxxxxxxxx More > majordomo > >> info at http://vger.kernel.org/majordomo-info.html > >> > >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the > body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at > http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html