This time without html (thanks gmail)! On 17 March 2016 at 09:43, Blair Bethwaite <blair.bethwaite@xxxxxxxxx> wrote: > Hi Igor, Allen, Sage, > > Apologies for the interjection into the technical back-and-forth here, but I > want to ask a question / make a request from the user/operator perspective > (possibly relevant to other advanced bluestore features too)... > > Can a feature like this expose metrics (e.g., compression ratio) back up to > higher layers such as rados that could then be used to automate use of the > feature? As a user/operator implicit compression support in the backend is > exciting, but it's something I'd want rados/librbd capable of toggling > on/off automatically based on a threshold (e.g., librbd could toggle > compression off at the image level if the first n rados objects > written/edited since turning compression on are compressed less than c%) - > this sort of thing would obviously help to avoid unnecessary overheads and > would cater to mixed use-cases (e.g. cloud provider block storage) where in > general the operator wants compression on but has no idea what users are > doing with their internal filesystems, it'd also mesh nicely with any future > "distributed"-compression implemented at the librbd client-side (which would > again likely be an rbd toggle). > > Cheers, > > On 17 March 2016 at 06:41, Allen Samuels <Allen.Samuels@xxxxxxxxxxx> wrote: >> >> > -----Original Message----- >> > From: Sage Weil [mailto:sage@xxxxxxxxxxxx] >> > Sent: Wednesday, March 16, 2016 2:28 PM >> > To: Igor Fedotov <ifedotov@xxxxxxxxxxxx> >> > Cc: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; ceph-devel <ceph- >> > devel@xxxxxxxxxxxxxxx> >> > Subject: Re: Adding compression support for bluestore. >> > >> > On Wed, 16 Mar 2016, Igor Fedotov wrote: >> > > On 15.03.2016 20:12, Sage Weil wrote: >> > > > My current thinking is that we do something like: >> > > > >> > > > - add a bluestore_extent_t flag for FLAG_COMPRESSED >> > > > - add uncompressed_length and compression_alg fields >> > > > (- add a checksum field we are at it, I guess) >> > > > >> > > > - in _do_write, when we are writing a new extent, we need to >> > > > compress it in memory (up to the max compression block), and feed >> > > > that size into _do_allocate so we know how much disk space to >> > > > allocate. this is probably reasonably tricky to do, and handles >> > > > just the simplest case (writing a new extent to a new object, or >> > > > appending to an existing one, and writing the new data compressed). >> > > > The current _do_allocate interface and responsibilities will >> > > > probably need >> > to change quite a bit here. >> > > sounds good so far >> > > > - define the general (partial) overwrite strategy. I would like for >> > > > this to be part of the WAL strategy. That is, we do the >> > > > read/modify/write as deferred work for the partial regions that >> > > > overlap >> > existing extents. >> > > > Then _do_wal_op would read the compressed extent, merge it with the >> > > > new piece, and write out the new (compressed) extents. The problem >> > > > is that right now the WAL path *just* does IO--it doesn't do any kv >> > > > metadata updates, which would be required here to do the final >> > > > allocation (we won't know how big the resulting extent will be until >> > > > we decompress the old thing, merge it with the new thing, and >> > recompress). >> > > > >> > > > But, we need to address this anyway to support CRCs (where we will >> > > > similarly do a read/modify/write, calculate a new checksum, and need >> > > > to update the onode). I think the answer here is just that the >> > > > _do_wal_op updates some in-memory-state attached to the wal >> > > > operation that gets applied when the wal entry is cleaned up in >> > > > _kv_sync_thread (wal_cleaning list). >> > > > >> > > > Calling into the allocator in the WAL path will be more complicated >> > > > than just updating the checksum in the onode, but I think it's >> > > > doable. >> > > Could you please name the issues for calling allocator in WAL path? >> > > Proper locking? What else? >> > >> > I think this bit isn't so bad... we need to add another field to the >> > in-memory >> > wal_op struct that includes space allocated in the WAL stage, and make >> > sure >> > that gets committed by the kv thread for all of the wal_cleaning txc's. >> > >> > > A potential issue with using WAL for compressed block overwrites is >> > > significant WAL data volume increase. IIUC currently WAL record can >> > > have up to 2*bluestore_min_alloc_size (i.e. 128K) client data per >> > > single write request - overlapped head and tail. >> > > In case of compressed blocks this will be up to >> > > 2*bluestore_max_compressed_block ( i.e. 8Mb ) as you can't simply >> > > overwrite fully overlapped extents - one should operate compression >> > blocks now... >> > > >> > > Seems attractive otherwise... >> > >> > I think the way to address this is to make >> > bluestore_max_compressed_block >> > *much* smaller. Like, 4x or 8x min_alloc_size, but no more. That gives >> > us a >> > smallish rounding error of "lost" efficiency, but keeps the size of >> > extents we >> > have to read+decompress in the overwrite or small read cases reasonable. >> > >> >> Yes, this is generally what people do. It's very hard to have a large >> compression window without having the CPU times balloon up. >> >> > The tradeoff is the onode_t's block_map gets bigger... but for a ~4MB >> > it's still >> > only 5-10 records, which sounds fine to me. >> > >> > > > The alternative is that we either >> > > > >> > > > a) do the read side of the overwrite in the first phase of the op, >> > > > before we commit it. That will mean a higher commit latency and >> > > > will slow down the pipeline, but would avoid the double-write of the >> > > > overlap/wal regions. Or, >> > > This is probably the simplest approach without hidden caveats but >> > > latency increase. >> > > > >> > > > b) we could just leave the overwritten extents alone and structure >> > > > the block_map so that they are occluded. This will 'leak' space for >> > > > some write patterns, but that might be okay given that we can come >> > > > back later and clean it up, or refine our strategy to be smarter. >> > > Just to clarify I understand the idea properly. Are you suggesting to >> > > simply write out new block to a new extent and update block map (and >> > > read procedure) to use that new extent or remains of the overwritten >> > > extents depending on the read offset? And overwritten extents are >> > > preserved intact until they are fully hidden or some background >> > > cleanup >> > procedure merge them. >> > > If so I can see following pros and cons: >> > > + write is faster >> > > - compressed data read is potentially slower as you might need to >> > > decompress more compressed blocks. >> > > - space usage is higher >> > > - need for garbage collector i.e. additional complexity >> > > >> > > Thus the question is what use patterns are at foreground and should be >> > > the most effective. >> > > IMO read performance and space saving are more important for the cases >> > > where compression is needed. >> > > >> > > > What do you think? >> > > > >> > > > It would be nice to choose a simpler strategy for the first pass >> > > > that handles a subset of write patterns (i.e., sequential writes, >> > > > possibly >> > > > unaligned) that is still a step in the direction of the more robust >> > > > strategy we expect to implement after that. >> > > > >> > > I'd probably agree but.... I don't see a good way how one can >> > > implement compression for specific write patterns only. >> > > We need to either ensure that these patterns are used exclusively ( >> > > append only / sequential only flags? ) or provide some means to fall >> > > back to regular mode when inappropriate write occurs. >> > > Don't think both are good and/or easy enough. >> > >> > Well, if we simply don't implement a garbage collector, then for >> > sequential+aligned writes we don't end up with stuff that needs garbage >> > collection. Even the sequential case might be doable if we make it >> > possible >> > to fill the extent with a sequence of compressed strings (as long as we >> > haven't reached the compressed length, try to restart the decompression >> > stream). >> > >> > > In this aspect my original proposal to have compression engine more or >> > > less segregated from the bluestore seems more attractive - there is no >> > > need to refactor bluestore internals in this case. One can easily >> > > start using compression or drop it and fall back to the current code >> > > state. No significant modifications in run-time data structures and >> > algorithms.... >> > >> > It sounds like in theory, but when I try to sort out how it would >> > actually work, >> > it seems like you have to either expose all of the block_map metadata up >> > to >> > this layer, at which point you may as well do it down in BlueStore and >> > have >> > the option of deferred WAL work, or you do something really simple with >> > fixed compression block sizes and get a weak final result. Not to >> > mention the >> > EC problems (although some of that will go away when EC overwrites come >> > along)... >> > >> > sage >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > -- > Cheers, > ~Blairo -- Cheers, ~Blairo -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html