> -----Original Message----- > From: Igor Fedotov [mailto:ifedotov@xxxxxxxxxxxx] > Sent: Monday, March 21, 2016 7:08 AM > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Sage Weil > <sage@xxxxxxxxxxxx> > Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx> > Subject: Re: Adding compression support for bluestore. > > Allen, > > please find my comment inline > > On 19.03.2016 6:14, Allen Samuels wrote: > > If we're going to both allow compression and delayed overwrite we simply > have to handle the case where new data actually overlaps with previous data > -- recursively. If I understand the current code, it handles exactly one layer of > overlay which is always stored in KV store. We need to generalize this data > structure. I'm going to outline a proposal, which If I get wrong, I beg > forgiveness -- I'm not as familiar with this code as I would like, especially the > ref-counted shared extent stuff. But I'm going to blindly dive in and assume > that Sage will correct me when I go off the tracks -- and therefore end up > learning how all of this stuff REALLY works. > > > > I propose that the current bluestore_extent_t and bluestore_overlay_t be > essentially unified into a single structure with a typemark to distinguish > between being in KV store or in raw block storage. Here's an example: (for > this discussion, BLOCK_SIZE is 4K and is the minimum physical I/O size). > That's a good idea to have uniform structure. > However it's not cleat to me if you are suggesting to have single container for > both locations. I think you aren't. > In this case most probably you don't need to have 'location' inside the > structure. It's rather an uncommon case when instances from different > locations are mixed and processed togather. And code that deals with it > usually is aware of structure instance's origin. Even if that's not true > - additional external parameter can be provided. > > Struct bluestore_extent_t { > > Uint64_t logical_size; // size of data before any > compression. MUST BE AN INTEGER MULTIPLE of BLOCK_SIZE (and != 0) > > Uint64_t physical_size; // size of data on physical media > (yes, this is unneeded when location == KV, the serialize/deserialize could > compress this out -- but this is an unneeded optimization > > Uint64_t location:1; // values (in ENUM form) are "KV" > and "BLOCK" > > Uint64_t compression_alg:4; // compression algorithm... > > Uint64_t otherflags:xx; // round it out. > > Uint64_t media_address; // forms Key when location == KV > block address when location == BLOCK > > Vector<uint32_t> checksums; // Media checksums. See > commentary on this below. > > }; > > > > This allows any amount of compressed or uncompressed data to be > identified in either a KV key or a block store. > > > > W.r.t. the vector of checksums. There is lots of flexibility here which > probably isn't worth dealing with. The simplest solution is to always generate > a checksum for each BLOCK_SIZE-sized I/O independent of whether the data > is compressed or not and independent of any physical I/O size. This means > that data integrity on a read is checked AFTER decompression. This has a > side-effect of requiring the decompressor to be "safe" in the presence of > incorrect/corrupted data [not all decompressors have this property]. > Alternatively for compressed data we use a single checksum (regardless of > size) that covers only the compressed data. This allows the checksum to be > checked before decompression. The second scheme has somewhat better > performance as it checksums data AFTER compression (i.e., less data). > Another potentially important flag here is to suppress bluestore checksum > generation and checking -- some compression/decompression algorithms > already do their own data integrity checks and it's silly to pay for that twice. > Also, for data where a low BER is tolerable you might entertain the notion of > skipping checksum generation. A null vector of checksums would certainly > describe these situations (a flag could be added too). > Sounds good in general. But I'd prefer to start checksum implementation > discussions from the goals and use cases - for what purposes we are planning > to use them? What procedures benefit from that use? And how they do > that? > May be we should have a separate topic for that? Making a separate discussion thread is a good idea. I was simply pointing out some of the interactions between checksums and compression. I agree that having a goals and use cases discussion is the best place to start. > > > Once this unification is complete, you no longer need both "block_map" > and "overlap_map" in onode_t. you just have an extent map. But in order to > handle the lazy-recovery overlap schemes described above you must > provide some kind of "priority" information so that when you have two > extents that overlap you know which has the live data and which has the > dead data. This is very easy to express in the onode_t data structure simply > by making the extent map be an array of maps. The indexes in the array > implicitly provide the priority ordering that you need. Now, when an write > operation that generates the lazy-recovery scheme happens it just expands > the size of the array and inserts the new extent there. Here's an example. > > > > First we write blocks 0..10. This leaves the map[0] with one extent(0..10). > > Now we lazy-overwrite blocks 2 and 3. We create map[1] and insert > extent(2..3). But map[0] still contains extent(0..10). > > We can even lazy-overwrite block 3. We create map[2] and insert > > extent(3). Map[1] still has extent(2..3) and Map[0] still has extent(0..10); > Now we write block 50. That extent could technically go into any index of the > map. > > > > Yes, searching this data structure for the correct information IS more > > complicated than the current simple maps. But it's not hopelessly > > complicated and ought to be something that a good unit test can pretty > > much exhaustively cover with a little bit of thought. [I've skipped > > over thinking about refcounted extents as I don't fully understand > > those yet -- but they shouldn't be a fundamental problem] > That's an interesting proposal but I can see following caveats here (I beg > pardon I misunderstood something): > 1) Potentially uncontrolled extent map growth when extensive (over)writing > takes place. Yes, a naïve insertion policy could lead to uncontrolled growth, but I don't think this needs to be the case. I assume that when you add an "extent", you won't increase the size of the array unnecessarily, i.e., if the new extent doesn't overlap an existing extent then there's no reason to increase the size of the map array -- actually you want to insert the new extent at the <smallest> array index that doesn't overlap, only increasing the array size when that's not possible. I'm not 100% certain of the worst case, but I believe that it's limited to the ratio between the largest extent and the smallest extent. (i.e., if we assume writes are no larger than -- say -- 1MB and the smallest are 4K, then I think the max depth of the array is 1M/4K => 2^8, 256. Which is ugly but not awful -- since this is probably a contrived case. This might be a reason to limit the largest extent size to something a bit smaller (say 256K)... > 2) Read/Lookup algorithmic complexity. To find valid block (or detect > overwrite) one should sequentially enumerate the full array. Given 1) that > might be very ineffective. Only requires one log2 lookup for each index of the array. > 3) It's not dealing with unaligned overwrites. What happens when some > block is partially overwritten? I'm not sure I understand what cases you're referring to. Can you give an example? > > I'm going to publish a bit different approach that hopefully handles these > issues promptly. Stay tuned.. Great! Looking forward to seeing it. > > > > This data structure fully enables any of the combinations of compression > and overwriting that we've discussed. In particular it allows free intermixing > of compressed and non-compressed data and fully supports any kind of > delay-merging of compressed data (lazy space recovery) with data being > written to either KV store or block store. > > > > BTW, I'm not sure how the current in-memory WAL queue is recovered > after a crash/restart. Presumably there's an entry in the KV store to denote > the presence of the WAL queue entry. That logical may require modification > for cases where we are delaying the merge but the overlay data is actually in > block storage. This might require some minor re-tweaking with this proposal. > > > > A question for Sage: > > > > Does the current WAL logic attempt to induce delay for the merging? It > seems like there are potentially lots of situations where multiple overwrite > are happening to the same object but are dispersed in time (slightly). Do we > attempt to merge this in the WAL queue? > > > > > > Allen Samuels > > Software Architect, Fellow, Systems and Software Solutions > > > > 2880 Junction Avenue, San Jose, CA 95134 > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx > > > >> -----Original Message----- > >> From: Igor Fedotov [mailto:ifedotov@xxxxxxxxxxxx] > >> Sent: Friday, March 18, 2016 10:54 AM > >> To: Sage Weil <sage@xxxxxxxxxxxx> > >> Cc: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; ceph-devel <ceph- > >> devel@xxxxxxxxxxxxxxx> > >> Subject: Re: Adding compression support for bluestore. > >> > >> > >> > >> On 17.03.2016 18:33, Sage Weil wrote: > >>> I'd say "maybe". It's easy to say we should focus on read > >>> performance now, but as soon as we have "support for compression" > >>> everybody is going to want to turn it on on all of their clusters to > >>> spend less money on hard disks. That will definitely include RBD > >>> users, where write latency is very important. I'm hesitant to take > >>> an architectural direction that locks us in. With something layered > >>> over BlueStore I think we're forced to do it all in the initial > >>> phase; with the monolithic approach that integrates it into > >>> BlueStore's write path we have the option to do either one--perhaps > >>> based on the particular request or hints or whatever. > >>>>>>> What do you think? > >>>>>>> > >>>>>>> It would be nice to choose a simpler strategy for the first pass > >>>>>>> that handles a subset of write patterns (i.e., sequential > >>>>>>> writes, possibly > >>>>>>> unaligned) that is still a step in the direction of the more > >>>>>>> robust strategy we expect to implement after that. > >>>>>>> > >>>>>> I'd probably agree but.... I don't see a good way how one can > >>>>>> implement compression for specific write patterns only. > >>>>>> We need to either ensure that these patterns are used exclusively > >>>>>> ( append only / sequential only flags? ) or provide some means to > >>>>>> fall back to regular mode when inappropriate write occurs. > >>>>>> Don't think both are good and/or easy enough. > >>>>> Well, if we simply don't implement a garbage collector, then for > >>>>> sequential+aligned writes we don't end up with stuff that needs > >>>>> sequential+garbage > >>>>> collection. Even the sequential case might be doable if we make > >>>>> it possible to fill the extent with a sequence of compressed > >>>>> strings (as long as we haven't reached the compressed length, try > >>>>> to restart the decompression stream). > >>>> It's still unclear to me if such specific patterns should be > >>>> exclusively applied to the object. E.g. by using specific object > >>>> creation > >> mode mode. > >>>> Or we should detect them automatically and be able to fall back to > >>>> regular write ( i.e. disable compression ) when write doesn't > >>>> conform to the supported pattern. > >>> I think initially supporting only the append workload is a simple > >>> check for whether the offset == the object size (and maybe whether > >>> it is aligned). No persistent flags or hints needed there. > >> Well, but issues appear immediately after some overwrite request > >> takes place. > >> How to handle overwrites? To do compression for the overwritten or not? > >> If not - we need some way to be able to merge compressed and > >> uncompressed blocks. And so on and so forth IMO it's hard (or even > >> impossible) to apply compression for specific write patterns only > >> unless you prohibit other ones. > >> We can support a subset of compression policies ( i.e. ways how we > >> resolve compression issues: RMW at init phase, lazy overwrite, WAL > >> use, etc ) but not a subset of write patterns. > >> > >>>> And I'm not following the idea about "a sequence of compressed > >>>> strings". Could you please elaborate? > >>> Let's say we have 32KB compressed_blocks, and the client is doing > >>> 1000 byte appends. We will allocate a 32 chunk on disk, and only > >>> fill it with say ~500 bytes of compressed data. When the next write > >>> comes around, we could compress it too and append it to the block > >>> without decompressing the previous string. > >>> > >>> By string I mean that each compression cycle looks something like > >>> > >>> start(...) > >>> while (more data) > >>> compress_some_stuff(...) > >>> finish(...) > >>> > >>> i.e., there's a header and maybe a footer in the compressed string. > >>> If we are decompressing and the decompressor says "done" but there > >>> is more data in our compressed block, we could repeat the process > >>> until we get to the end of the compressed data. > >> Got it, thanks for clarification > >>> But it might not matter or be worth it. If the compressed blocks > >>> are smallish then decompressing, appending, and recompressing isn't > >>> going to be that expensive anyway. I'm mostly worried about small > >>> appends, e.g. by rbd mirroring (imaging 4 KB writes + some metadata) > >>> or the MDS > >> journal. > >> That's mainly about small appends not small writes, right? > >> > >> At this point I agree with Allen that we need variable policies to > >> handle compression. Most probably we wouldn't be able to create > >> single one that fits perfect for any write pattern. > >> The only concern about that is the complexity of such a task... > >>> sage > >> Thanks, > >> Igor > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html