RE: Adding compression/checksum support for bluestore.

"Piotr.Dalek@xxxxxxxxxxxxxx" <Piotr.Dalek@xxxxxxxxxxxxxx> · Thu, 31 Mar 2016 17:39:35 +0000

> -----Original Message-----
> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> owner@xxxxxxxxxxxxxxx] On Behalf Of Igor Fedotov
> Sent: Thursday, March 31, 2016 7:18 PM
> 
> On 31.03.2016 19:32, Allen Samuels wrote:
> >> But do we really need to store checksums as metadata? What's about
> >> pre(post)fixing 4K-4(?) blob with the checksum and store this pair to
> >> the disk. IMO we always need checksum values along with blob data
> >> thus let's store and read them together. This immediately eliminates
> >> the question about the granularity and corresponding overhead... Have
> >> I missed something?
> > If you store them inline with the data then nothing lines up on boundaries
> that the HW designers expect and you end up doing things like extra-copying
> of every data buffer. This will kill performance.
> 
> Perhaps you are right.
> 
> But not sure I fully understand what HW designers you mean here. Are you
> considering the case when Ceph is embedded into some hardware and
> incoming RW requests  always operate aligned data and supposed to have
> the same alignment for data saved to disk?

> IMHO proper data alignment in the incoming requests is a particular
> case. Generally we don't have such a trait. Moreover compression
> completely destroys it if any. Thus in many cases we can easily append
> an additional data portion containing a checksum.

Devices - in general - store data in sectors (or blocks) of particular sizes, with 512 bytes for most HDDs and 4096 bytes for many SSDs and many large capacity HDDs ("advanced-format", as one company calls it). For that reason, when you're doing direct I/O, you read/write in multiples of those sectors/blocks, and anything below that will result in bad performance because you need to read entire sector anyway (no way to do partial-sector read), discard any unnecessary data and realign to match destination (or pad buffer with some value and write entire value, which means you need copy data from caller into that temp buffer before doing write). 
Putting checksum in front of data will prevent direct-to-destination reads (i.e. read(..dest, 4096)) and zero-copy write, because front of dest will contain checksum - you'll need to read into temp buffer, move checksum into its storage and move actual data to destination which means read and up to two memmoves. If you'll put checksum *past* data, you can read data directly into dest buffer, but you need to make sure that dest buffer is sizeof(checksum) larger than consumer expects and finally move checksum somewhere else -- better, but bug-prone.
Having checksums in separate storage may incur extra I/O (as Allen wrote), but removes both issues mentioned.

With best regards / Pozdrawiam
Piotr Dałek

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html