Re: sparse-read OSD op guarantees

Ilya Dryomov <idryomov@xxxxxxxxx> · Thu, 27 Apr 2023 22:15:47 +0200

On Wed, Apr 26, 2023 at 11:50 PM Sam Just <sjust@xxxxxxxxxx> wrote:
>
> This came up again in the dev summit at cephalocon, so I figure it's
> worth reviving this thread.
>
> First, I'll try to recap the situation (Ilya, feel free to correct me
> here).  My understanding of the issue is that rbd has features (most
> notably encryption) which depend on the librados SPARSE_READ operation
> reflecting accurately which ranges have been written or trimmed at a
> 4k granularity.  This appears to work correctly on replicated pools on
> bluestore, but erasure coded pools always return the full object
> contents up to the object size including regions the client has not
> written to.

Hi Sam,

As Jeff said in another email, fscrypt support in kcephfs has a hard
dependency on accurate allocation information.  librbd wants to grow
a similar dependency to enhance its built-in LUKS encryption support
(currently reads from unallocated areas on encrypted images are handled
inconsistently: if the underlying object doesn't exist, zeroes are
returned; if it does exist, we are at the mercy of sparse-read behavior
and can return random garbage obtained by decrypting zeroes).

>
> I don't think this was originally a guarantee of the interface.  I
> think the original guarantee was simply that SPARSE_READ would return
> any non-zero regions, not that it was guaranteed not to return
> unwritten or trimmed regions.  The OSD does not track this state above
> the ObjectStore layer -- SPARSE_READ and MAPEXT both rely directly on
> ObjectStore::fiemap.  MAPEXT actually returns -ENOTSUPP on erasure
> coded pools.
>
> Adam: the observed behavior is that fiemap on bluestore does
> accurately reflect the client's written extents at a 4k granularity.
> Is that reliable, or is it a property of only some bluestore
> configurations?
>
> As it appears desirable that we actually guarantee this, we probably
> want to do two things:
> 1) codify this guarantee in the ObjectStore interface (4k in all
> cases?), and ensure that all configurations satisfy it going forward
> (including seastore)
> 2) update the ec implementation to track allocation at the granularity
> of an EC stripe.  HashInfo is the natural place to put the
> information, probably?  We'll need to also implement ZERO.  Radek: I
> know you're looking into EC for crimson, perhaps you can evaluate how
> much work would be required here?

The EC stripe that is referred to here is configurable on a per-pool
basis with the default taken from osd_pool_erasure_code_stripe_unit,
right?  If the user configures it to e.g. 16k for a particular pool
(EC profile), how would that interact with the 4k guarantee at the
ObjectStore layer?

Thanks,

                Ilya
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx