RE: Adding compression support for bluestore.

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Sat, 19 Mar 2016 03:14:23 +0000

> -----Original Message-----
> From: Vikas Sinha-SSI [mailto:v.sinha@xxxxxxxxxxxxxxx]
> Sent: Friday, March 18, 2016 12:18 PM
> To: Igor Fedotov <ifedotov@xxxxxxxxxxxx>; Sage Weil
> <sage@xxxxxxxxxxxx>
> Cc: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; ceph-devel <ceph-
> devel@xxxxxxxxxxxxxxx>
> Subject: RE: Adding compression support for bluestore.
> 
> Hi Igor,
> Thanks a lot for this. Do you also consider supporting offline compression (via
> a background task, or at least something not in the main IO path)? Will the
> current proposal allow this, and do you consider this to be a useful option at
> all? My concern is with the performance impact of compression, and
> obviously I don't know whether it will be significant. Obviously I'm also
> concerned about adding more complexity.
> I would love to know your thoughts on this.
> Thanks,
> Vikas

The revised extent map proposal that I sent earlier would directly support this capability. There's no reason that a policy of doing NO inline compression is implemented followed by a background (WAL based or even deep-scrub based) compression activity. This is yet another reason why separating policy from mechanism is important.

> 
> 
> > -----Original Message-----
> > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> > owner@xxxxxxxxxxxxxxx] On Behalf Of Igor Fedotov
> > Sent: Friday, March 18, 2016 8:54 AM
> > To: Sage Weil
> > Cc: Allen Samuels; ceph-devel
> > Subject: Re: Adding compression support for bluestore.
> >
> >
> >
> > On 17.03.2016 18:33, Sage Weil wrote:
> > > I'd say "maybe". It's easy to say we should focus on read
> > > performance now, but as soon as we have "support for compression"
> > > everybody is going to want to turn it on on all of their clusters to
> > > spend less money on hard disks. That will definitely include RBD
> > > users, where write latency is very important. I'm hesitant to take
> > > an architectural direction that locks us in. With something layered
> > > over BlueStore I think we're forced to do it all in the initial
> > > phase; with the monolithic approach that integrates it into
> > > BlueStore's write path we have the option to do either one--perhaps
> > > based on the particular request or hints or whatever.
> > >>>>> What do you think?
> > >>>>>
> > >>>>> It would be nice to choose a simpler strategy for the first pass
> > >>>>> that handles a subset of write patterns (i.e., sequential
> > >>>>> writes, possibly
> > >>>>> unaligned) that is still a step in the direction of the more
> > >>>>> robust strategy we expect to implement after that.
> > >>>>>
> > >>>> I'd probably agree but.... I don't see a good way how one can
> > implement
> > >>>> compression for specific write patterns only.
> > >>>> We need to either ensure that these patterns are used exclusively
> > ( append
> > >>>> only / sequential only flags? ) or provide some means to fall
> > >>>> back to regular mode when inappropriate write occurs.
> > >>>> Don't think both are good and/or easy enough.
> > >>> Well, if we simply don't implement a garbage collector, then for
> > >>> sequential+aligned writes we don't end up with stuff that needs
> > >>> sequential+garbage
> > >>> collection.  Even the sequential case might be doable if we make
> > >>> it possible to fill the extent with a sequence of compressed
> > >>> strings (as long as we haven't reached the compressed length, try
> > >>> to restart the decompression stream).
> > >> It's still unclear to me if such specific patterns should be
> > >> exclusively applied to the object. E.g. by using specific object creation
> mode mode.
> > >> Or we should detect them automatically and be able to fall back to
> > >> regular write ( i.e. disable compression )  when write doesn't
> > >> conform to the supported pattern.
> > > I think initially supporting only the append workload is a simple
> > > check for whether the offset == the object size (and maybe whether
> > > it is aligned).  No persistent flags or hints needed there.
> > Well, but issues appear immediately after some overwrite request takes
> > place.
> > How to handle overwrites? To do compression for the overwritten or not?
> > If not - we need some way to be able to merge compressed and
> > uncompressed blocks. And so on and so forth IMO it's hard (or even
> > impossible) to apply compression for specific write patterns only
> > unless you prohibit other ones.
> > We can support a subset of compression policies ( i.e. ways how we
> > resolve compression issues: RMW at init phase, lazy overwrite, WAL
> > use, etc ) but not a subset of write patterns.
> >
> > >> And I'm not following the idea about "a sequence of compressed
> strings".
> > Could
> > >> you please elaborate?
> > > Let's say we have 32KB compressed_blocks, and the client is doing
> > > 1000 byte appends.  We will allocate a 32 chunk on disk, and only
> > > fill it with say ~500 bytes of compressed data.  When the next write
> > > comes around,
> > we
> > > could compress it too and append it to the block without
> > > decompressing
> > the
> > > previous string.
> > >
> > > By string I mean that each compression cycle looks something like
> > >
> > >   start(...)
> > >   while (more data)
> > >     compress_some_stuff(...)
> > >   finish(...)
> > >
> > > i.e., there's a header and maybe a footer in the compressed string.
> > > If we are decompressing and the decompressor says "done" but there
> > > is more
> > data
> > > in our compressed block, we could repeat the process until we get to
> > > the end of the compressed data.
> > Got it, thanks for clarification
> > > But it might not matter or be worth it.  If the compressed blocks
> > > are smallish then decompressing, appending, and recompressing isn't
> > > going to be that expensive anyway.  I'm mostly worried about small
> appends, e.g.
> > by
> > > rbd mirroring (imaging 4 KB writes + some metadata) or the MDS journal.
> > That's mainly about small appends not small writes, right?
> >
> > At this point I agree with Allen that we need variable policies to
> > handle compression. Most probably we wouldn't be able to create single
> > one that fits perfect for any write pattern.
> > The only concern about that is the complexity of such a task...
> > > sage
> > Thanks,
> > Igor
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html