Hi Sage, et al.
Let me share some ideas about possible compression burden reduction on
the cluster.
As known we perform block compression at BlueStore level for each
replica independently. This triples compression CPU overhead for the
cluster. Looks like significant CPU resource waste IMHO.
We can probably eliminate this overhead by introduction write request
preprocessing performed at ObjectStore level synchronously. This
preprocessing parses transaction, detects write requests and transforms
them into different ones aligned with current store allocation unit. At
the same time resulting extents that span more than single AU are
compressed if needed. I.e. preprocessing do some of the job performed at
BlueStore::_do_write_data that splits write request into
_do_write_small/_do_write_big calls. But after the split and big blob
compression preprocessor simply updates the transaction with new write
requests.
E.g.
with AU = 0x1000
Write Request (1~0xffff) is transformed into the following sequence:
WriteX 1~0xfff (uncompressed)
WriteX 0x1000~E000 (compressed if needed)
WriteX 0xf000~0xfff (uncompressed)
Then updated transaction is passed to all replicas including the master
one using regular apply_/queue_transaction mechanics.
As a bonus one receives automatic payload compression when transporting
request to remote store replicas.
Regular write request path should be preserved for EC pools and other
needs as well.
Please note that almost no latency is introduced for request handling.
Replicas receive modified transaction later but they do not spend time
on doing split/compress stuff.
There is a potential conflict with the current garbage collection stuff
though - we can't perform GC during preprocessing due to possible race
with preceding unfinished transactions and consequently we're unable to
merge and compress merged data. Well, we can do that when applying
transaction but this will produce a sequence like this at each replica:
decompress original request + decompress data to merge -> compress
merged data.
Probably this limitation isn't that bad - IMHO it's better to have
compressed blobs aligned with original write requests.
Moreover I have some ideas how to get rid of blob_depth notion that
makes life a bit easier. Will share shortly.
Any thought/comments?
Thanks,
Igor
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html