RE: Adding compression support for bluestore.

Vikas Sinha-SSI <v.sinha@xxxxxxxxxxxxxxx> · Fri, 18 Mar 2016 17:17:38 +0000

Hi Igor,
Thanks a lot for this. Do you also consider supporting offline compression (via a background task, or
at least something not in the main IO path)? Will the current proposal allow this, and do you consider
this to be a useful option at all? My concern is with the performance impact of compression, and obviously I
don't know whether it will be significant. Obviously I'm also concerned about adding more complexity.
I would love to know your thoughts on this.
Thanks,
Vikas

> -----Original Message-----
> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> owner@xxxxxxxxxxxxxxx] On Behalf Of Igor Fedotov
> Sent: Friday, March 18, 2016 8:54 AM
> To: Sage Weil
> Cc: Allen Samuels; ceph-devel
> Subject: Re: Adding compression support for bluestore.
> 
> 
> 
> On 17.03.2016 18:33, Sage Weil wrote:
> > I'd say "maybe". It's easy to say we should focus on read performance
> > now, but as soon as we have "support for compression" everybody is
> > going to want to turn it on on all of their clusters to spend less
> > money on hard disks. That will definitely include RBD users, where
> > write latency is very important. I'm hesitant to take an architectural
> > direction that locks us in. With something layered over BlueStore I
> > think we're forced to do it all in the initial phase; with the
> > monolithic approach that integrates it into BlueStore's write path we
> > have the option to do either one--perhaps based on the particular
> > request or hints or whatever.
> >>>>> What do you think?
> >>>>>
> >>>>> It would be nice to choose a simpler strategy for the first pass that
> >>>>> handles a subset of write patterns (i.e., sequential writes, possibly
> >>>>> unaligned) that is still a step in the direction of the more robust
> >>>>> strategy we expect to implement after that.
> >>>>>
> >>>> I'd probably agree but.... I don't see a good way how one can
> implement
> >>>> compression for specific write patterns only.
> >>>> We need to either ensure that these patterns are used exclusively
> ( append
> >>>> only / sequential only flags? ) or provide some means to fall back to
> >>>> regular
> >>>> mode when inappropriate write occurs.
> >>>> Don't think both are good and/or easy enough.
> >>> Well, if we simply don't implement a garbage collector, then for
> >>> sequential+aligned writes we don't end up with stuff that needs garbage
> >>> collection.  Even the sequential case might be doable if we make it
> >>> possible to fill the extent with a sequence of compressed strings (as long
> >>> as we haven't reached the compressed length, try to restart the
> >>> decompression stream).
> >> It's still unclear to me if such specific patterns should be exclusively
> >> applied to the object. E.g. by using specific object creation mode mode.
> >> Or we should detect them automatically and be able to fall back to regular
> >> write ( i.e. disable compression )  when write doesn't conform to the
> >> supported pattern.
> > I think initially supporting only the append workload is a simple check
> > for whether the offset == the object size (and maybe whether it is
> > aligned).  No persistent flags or hints needed there.
> Well, but issues appear immediately after some overwrite request takes
> place.
> How to handle overwrites? To do compression for the overwritten or not?
> If not - we need some way to be able to merge compressed and
> uncompressed blocks. And so on and so forth
> IMO it's hard (or even impossible) to apply compression for specific
> write patterns only unless you prohibit other ones.
> We can support a subset of compression policies ( i.e. ways how we
> resolve compression issues: RMW at init phase, lazy overwrite, WAL use,
> etc ) but not a subset of write patterns.
> 
> >> And I'm not following the idea about "a sequence of compressed strings".
> Could
> >> you please elaborate?
> > Let's say we have 32KB compressed_blocks, and the client is doing 1000
> > byte appends.  We will allocate a 32 chunk on disk, and only fill it with
> > say ~500 bytes of compressed data.  When the next write comes around,
> we
> > could compress it too and append it to the block without decompressing
> the
> > previous string.
> >
> > By string I mean that each compression cycle looks something like
> >
> >   start(...)
> >   while (more data)
> >     compress_some_stuff(...)
> >   finish(...)
> >
> > i.e., there's a header and maybe a footer in the compressed string.  If we
> > are decompressing and the decompressor says "done" but there is more
> data
> > in our compressed block, we could repeat the process until we get to the
> > end of the compressed data.
> Got it, thanks for clarification
> > But it might not matter or be worth it.  If the compressed blocks are
> > smallish then decompressing, appending, and recompressing isn't going to
> > be that expensive anyway.  I'm mostly worried about small appends, e.g.
> by
> > rbd mirroring (imaging 4 KB writes + some metadata) or the MDS journal.
> That's mainly about small appends not small writes, right?
> 
> At this point I agree with Allen that we need variable policies to
> handle compression. Most probably we wouldn't be able to create single
> one that fits perfect for any write pattern.
> The only concern about that is the complexity of such a task...
> > sage
> Thanks,
> Igor
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html