Hi Igor, Thanks a lot for this. Do you also consider supporting offline compression (via a background task, or at least something not in the main IO path)? Will the current proposal allow this, and do you consider this to be a useful option at all? My concern is with the performance impact of compression, and obviously I don't know whether it will be significant. Obviously I'm also concerned about adding more complexity. I would love to know your thoughts on this. Thanks, Vikas > -----Original Message----- > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- > owner@xxxxxxxxxxxxxxx] On Behalf Of Igor Fedotov > Sent: Friday, March 18, 2016 8:54 AM > To: Sage Weil > Cc: Allen Samuels; ceph-devel > Subject: Re: Adding compression support for bluestore. > > > > On 17.03.2016 18:33, Sage Weil wrote: > > I'd say "maybe". It's easy to say we should focus on read performance > > now, but as soon as we have "support for compression" everybody is > > going to want to turn it on on all of their clusters to spend less > > money on hard disks. That will definitely include RBD users, where > > write latency is very important. I'm hesitant to take an architectural > > direction that locks us in. With something layered over BlueStore I > > think we're forced to do it all in the initial phase; with the > > monolithic approach that integrates it into BlueStore's write path we > > have the option to do either one--perhaps based on the particular > > request or hints or whatever. > >>>>> What do you think? > >>>>> > >>>>> It would be nice to choose a simpler strategy for the first pass that > >>>>> handles a subset of write patterns (i.e., sequential writes, possibly > >>>>> unaligned) that is still a step in the direction of the more robust > >>>>> strategy we expect to implement after that. > >>>>> > >>>> I'd probably agree but.... I don't see a good way how one can > implement > >>>> compression for specific write patterns only. > >>>> We need to either ensure that these patterns are used exclusively > ( append > >>>> only / sequential only flags? ) or provide some means to fall back to > >>>> regular > >>>> mode when inappropriate write occurs. > >>>> Don't think both are good and/or easy enough. > >>> Well, if we simply don't implement a garbage collector, then for > >>> sequential+aligned writes we don't end up with stuff that needs garbage > >>> collection. Even the sequential case might be doable if we make it > >>> possible to fill the extent with a sequence of compressed strings (as long > >>> as we haven't reached the compressed length, try to restart the > >>> decompression stream). > >> It's still unclear to me if such specific patterns should be exclusively > >> applied to the object. E.g. by using specific object creation mode mode. > >> Or we should detect them automatically and be able to fall back to regular > >> write ( i.e. disable compression ) when write doesn't conform to the > >> supported pattern. > > I think initially supporting only the append workload is a simple check > > for whether the offset == the object size (and maybe whether it is > > aligned). No persistent flags or hints needed there. > Well, but issues appear immediately after some overwrite request takes > place. > How to handle overwrites? To do compression for the overwritten or not? > If not - we need some way to be able to merge compressed and > uncompressed blocks. And so on and so forth > IMO it's hard (or even impossible) to apply compression for specific > write patterns only unless you prohibit other ones. > We can support a subset of compression policies ( i.e. ways how we > resolve compression issues: RMW at init phase, lazy overwrite, WAL use, > etc ) but not a subset of write patterns. > > >> And I'm not following the idea about "a sequence of compressed strings". > Could > >> you please elaborate? > > Let's say we have 32KB compressed_blocks, and the client is doing 1000 > > byte appends. We will allocate a 32 chunk on disk, and only fill it with > > say ~500 bytes of compressed data. When the next write comes around, > we > > could compress it too and append it to the block without decompressing > the > > previous string. > > > > By string I mean that each compression cycle looks something like > > > > start(...) > > while (more data) > > compress_some_stuff(...) > > finish(...) > > > > i.e., there's a header and maybe a footer in the compressed string. If we > > are decompressing and the decompressor says "done" but there is more > data > > in our compressed block, we could repeat the process until we get to the > > end of the compressed data. > Got it, thanks for clarification > > But it might not matter or be worth it. If the compressed blocks are > > smallish then decompressing, appending, and recompressing isn't going to > > be that expensive anyway. I'm mostly worried about small appends, e.g. > by > > rbd mirroring (imaging 4 KB writes + some metadata) or the MDS journal. > That's mainly about small appends not small writes, right? > > At this point I agree with Allen that we need variable policies to > handle compression. Most probably we wouldn't be able to create single > one that fits perfect for any write pattern. > The only concern about that is the complexity of such a task... > > sage > Thanks, > Igor > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html