> -----Original Message----- > From: Igor Fedotov [mailto:ifedotov@xxxxxxxxxxxx] > Sent: Monday, March 21, 2016 11:31 AM > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Sage Weil > <sage@xxxxxxxxxxxx> > Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx> > Subject: Re: Adding compression support for bluestore. > > > > On 21.03.2016 20:14, Allen Samuels wrote: > > After reading your proposal and Sage's response, I have a hybrid proposal > that I think is BOBW... > > > > This proposal is based on Sage's comments about there being two different > structures. > > > > Leveraging my proposal for a unified extent (included by reference). For > our purpose each extent is fully self-contained, i.e., logical address/size, > physical address/size, checksums, flags, etc.... > It seems we don't need logical addr/size in this structure any more - it should > be within the logical block map. See below... > > The onode value (i.e., the one in the KV storage) contains only LIST of > extents (yes a list, read on!). The list is ordered temporally, this provides the > necessary ordering information to disambiguate between two extents that > overlap/occlude the same logical address range. > And I'm not sure if we need any temporal order in this extent list If logical > block map is maintained properly, i.e. proper entries are inserted on > overwrite and overwritten ones are updated. We just need a > (implicit?) collection of such extents and an ability to reference its' > specific entries. > Under "implicit" collection I mean we don't need to collect them within a > single data structure ( map, list, array, whatever..) - we just need some > reference to the entry from logical block map and corresponding reference > counting mechanics. > Serializing to be considered though... Yes, if you serialize both data structures you can eliminate the duplicate information and remove any ordering constraint on the extent list/map. It's not clear to me if there are any space or time advantages/disadvantages to either scheme. I have a mild preference for my rebuilding scheme because it simplifies the on-disk (well, in-KV ;-) data structure, eliminating a lot of complicated internal links and consistency checks that will be required for the "serialize both structures" approach (check all reference counts, check for orphan extents, etc.). But it seems to me that the choice is primarily esthetic and should be made by whoever actually does the implementation. > > > The in-memory onode contains an auxilliary map that's ordered by logical > address. Entries in the map point to the extent that provides the data for > that logical address range. There can be multiple entries in the map that > reference the same element of the list. This map cheaply disambiguates > between multiple extents that overlap in logical address space. This is the > case that Sage pointed out causes problems for the data structure that you > propose. In my proposal, you'll have two entries in the logical map pointing to > the same entry in the extent map. > > > > We reference count between the logical map and the extent list to know > when an extent no longer contributes data to this object (and hence can be > removed from the list and potentially have it's space recovered). > Yes. And such a map ( AKA logical block map ) has to contain (logical?) > address/size ( I'd prefer to name it a pointer or descriptor for a region within > a physical extent ) and a reference to an extent. Exactly as in Sage's > comment. > > Thus Sage proposal to be simply extended with reference counting for raw > extents. > > > Costs for this dual data structure are: > > > > Lookup, insert and delete are all O(log2 N) and are essentially equivalent to > the interval_set (with some extra logic for the reference counted extents). > > > > Construction of the logical map when an onode is de-serialized is just O(N) > inserts of an empty list -- something like O(log N+1), but not exactly. I think > it's something like log(1) + log(2) + log(3) + log(4)..... Not really sure what that > adds up to, but it's waaay less than O(N). > > > > There may be additional potential benefits to having the extent list be > temporally ordered. You might be able to infer all sorts of behavioral > information by examining that list... > > > > > >> -----Original Message----- > >> From: Igor Fedotov [mailto:ifedotov@xxxxxxxxxxxx] > >> Sent: Monday, March 21, 2016 9:36 AM > >> To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Sage Weil > >> <sage@xxxxxxxxxxxx> > >> Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx> > >> Subject: Re: Adding compression support for bluestore. > >> > >> > >> > >> On 21.03.2016 18:14, Allen Samuels wrote: > >>>> That's an interesting proposal but I can see following caveats here > >>>> (I beg pardon I misunderstood something): > >>>> 1) Potentially uncontrolled extent map growth when extensive > >>>> (over)writing takes place. > >>> Yes, a naïve insertion policy could lead to uncontrolled growth, but > >>> I don't > >> think this needs to be the case. I assume that when you add an > >> "extent", you won't increase the size of the array unnecessarily, > >> i.e., if the new extent doesn't overlap an existing extent then > >> there's no reason to increase the size of the map array -- actually > >> you want to insert the new extent at the <smallest> array index that > >> doesn't overlap, only increasing the array size when that's not > >> possible. I'm not 100% certain of the worst case, but I believe that > >> it's limited to the ratio between the largest extent and the smallest > >> extent. (i.e., if we assume writes are no larger than -- say -- 1MB and the > smallest are 4K, then I think the max depth of the array is 1M/4K => 2^8, 256. > >> Which is ugly but not awful -- since this is probably a contrived > >> case. This might be a reason to limit the largest extent size to > >> something a bit smaller (say 256K)... > >> It looks like I misunderstood something... It seemed to me that your > >> array grows depending on the maximum amount of block versions. > >> Imagine you have 1000 writes 0~4K and 1000 writes 8K~4K I supposed > >> that this will create following array: > >> [ > >> 0: <0:{...,4K}, 8K:{...4K}>, > >> ... > >> 999: <0:{...,4K}, 8K:{...4K}>, > >> ] > >> > >> what's happening in your case? > >> > >>>> 2) Read/Lookup algorithmic complexity. To find valid block (or > >>>> detect > >>>> overwrite) one should sequentially enumerate the full array. Given > >>>> 1) that might be very ineffective. > >>> Only requires one log2 lookup for each index of the array. > >> This depends on 1) thus still unclear at the moment. > >>>> 3) It's not dealing with unaligned overwrites. What happens when > >>>> some block is partially overwritten? > >>> I'm not sure I understand what cases you're referring to. Can you > >>> give an > >> example? > >> Well, as far as I understand in the proposal above you were operating > >> the entire blocks (i.e. 4K data) Thus overwriting the block is a > >> simple case - you just need to create a new block "version" and insert it > into an array. > >> But real user writes seems to be unaligned to block size. > >> E.g. > >> write 0~2048 > >> write 1024~3072 > >> > >> you have to either track both blocks or merge them. The latter is a > >> bit tricky for the compression case. > >> > >> > >> ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f