> -----Original Message----- > From: Vikas Sinha-SSI [mailto:v.sinha@xxxxxxxxxxxxxxx] > Sent: Friday, March 18, 2016 12:18 PM > To: Igor Fedotov <ifedotov@xxxxxxxxxxxx>; Sage Weil > <sage@xxxxxxxxxxxx> > Cc: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; ceph-devel <ceph- > devel@xxxxxxxxxxxxxxx> > Subject: RE: Adding compression support for bluestore. > > Hi Igor, > Thanks a lot for this. Do you also consider supporting offline compression (via > a background task, or at least something not in the main IO path)? Will the > current proposal allow this, and do you consider this to be a useful option at > all? My concern is with the performance impact of compression, and > obviously I don't know whether it will be significant. Obviously I'm also > concerned about adding more complexity. > I would love to know your thoughts on this. > Thanks, > Vikas The revised extent map proposal that I sent earlier would directly support this capability. There's no reason that a policy of doing NO inline compression is implemented followed by a background (WAL based or even deep-scrub based) compression activity. This is yet another reason why separating policy from mechanism is important. > > > > -----Original Message----- > > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- > > owner@xxxxxxxxxxxxxxx] On Behalf Of Igor Fedotov > > Sent: Friday, March 18, 2016 8:54 AM > > To: Sage Weil > > Cc: Allen Samuels; ceph-devel > > Subject: Re: Adding compression support for bluestore. > > > > > > > > On 17.03.2016 18:33, Sage Weil wrote: > > > I'd say "maybe". It's easy to say we should focus on read > > > performance now, but as soon as we have "support for compression" > > > everybody is going to want to turn it on on all of their clusters to > > > spend less money on hard disks. That will definitely include RBD > > > users, where write latency is very important. I'm hesitant to take > > > an architectural direction that locks us in. With something layered > > > over BlueStore I think we're forced to do it all in the initial > > > phase; with the monolithic approach that integrates it into > > > BlueStore's write path we have the option to do either one--perhaps > > > based on the particular request or hints or whatever. > > >>>>> What do you think? > > >>>>> > > >>>>> It would be nice to choose a simpler strategy for the first pass > > >>>>> that handles a subset of write patterns (i.e., sequential > > >>>>> writes, possibly > > >>>>> unaligned) that is still a step in the direction of the more > > >>>>> robust strategy we expect to implement after that. > > >>>>> > > >>>> I'd probably agree but.... I don't see a good way how one can > > implement > > >>>> compression for specific write patterns only. > > >>>> We need to either ensure that these patterns are used exclusively > > ( append > > >>>> only / sequential only flags? ) or provide some means to fall > > >>>> back to regular mode when inappropriate write occurs. > > >>>> Don't think both are good and/or easy enough. > > >>> Well, if we simply don't implement a garbage collector, then for > > >>> sequential+aligned writes we don't end up with stuff that needs > > >>> sequential+garbage > > >>> collection. Even the sequential case might be doable if we make > > >>> it possible to fill the extent with a sequence of compressed > > >>> strings (as long as we haven't reached the compressed length, try > > >>> to restart the decompression stream). > > >> It's still unclear to me if such specific patterns should be > > >> exclusively applied to the object. E.g. by using specific object creation > mode mode. > > >> Or we should detect them automatically and be able to fall back to > > >> regular write ( i.e. disable compression ) when write doesn't > > >> conform to the supported pattern. > > > I think initially supporting only the append workload is a simple > > > check for whether the offset == the object size (and maybe whether > > > it is aligned). No persistent flags or hints needed there. > > Well, but issues appear immediately after some overwrite request takes > > place. > > How to handle overwrites? To do compression for the overwritten or not? > > If not - we need some way to be able to merge compressed and > > uncompressed blocks. And so on and so forth IMO it's hard (or even > > impossible) to apply compression for specific write patterns only > > unless you prohibit other ones. > > We can support a subset of compression policies ( i.e. ways how we > > resolve compression issues: RMW at init phase, lazy overwrite, WAL > > use, etc ) but not a subset of write patterns. > > > > >> And I'm not following the idea about "a sequence of compressed > strings". > > Could > > >> you please elaborate? > > > Let's say we have 32KB compressed_blocks, and the client is doing > > > 1000 byte appends. We will allocate a 32 chunk on disk, and only > > > fill it with say ~500 bytes of compressed data. When the next write > > > comes around, > > we > > > could compress it too and append it to the block without > > > decompressing > > the > > > previous string. > > > > > > By string I mean that each compression cycle looks something like > > > > > > start(...) > > > while (more data) > > > compress_some_stuff(...) > > > finish(...) > > > > > > i.e., there's a header and maybe a footer in the compressed string. > > > If we are decompressing and the decompressor says "done" but there > > > is more > > data > > > in our compressed block, we could repeat the process until we get to > > > the end of the compressed data. > > Got it, thanks for clarification > > > But it might not matter or be worth it. If the compressed blocks > > > are smallish then decompressing, appending, and recompressing isn't > > > going to be that expensive anyway. I'm mostly worried about small > appends, e.g. > > by > > > rbd mirroring (imaging 4 KB writes + some metadata) or the MDS journal. > > That's mainly about small appends not small writes, right? > > > > At this point I agree with Allen that we need variable policies to > > handle compression. Most probably we wouldn't be able to create single > > one that fits perfect for any write pattern. > > The only concern about that is the complexity of such a task... > > > sage > > Thanks, > > Igor > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More > majordomo > > info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html