Re: sparse-read OSD op guarantees

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



EC wouldn't be able to directly use the ObjectStore's fiemap
implementation.  I think we'd need to build that metadata into either
the object_info or the ECUtil::hash_info.
-Sam

On Thu, Apr 27, 2023 at 1:16 PM Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>
> On Wed, Apr 26, 2023 at 11:50 PM Sam Just <sjust@xxxxxxxxxx> wrote:
> >
> > This came up again in the dev summit at cephalocon, so I figure it's
> > worth reviving this thread.
> >
> > First, I'll try to recap the situation (Ilya, feel free to correct me
> > here).  My understanding of the issue is that rbd has features (most
> > notably encryption) which depend on the librados SPARSE_READ operation
> > reflecting accurately which ranges have been written or trimmed at a
> > 4k granularity.  This appears to work correctly on replicated pools on
> > bluestore, but erasure coded pools always return the full object
> > contents up to the object size including regions the client has not
> > written to.
>
> Hi Sam,
>
> As Jeff said in another email, fscrypt support in kcephfs has a hard
> dependency on accurate allocation information.  librbd wants to grow
> a similar dependency to enhance its built-in LUKS encryption support
> (currently reads from unallocated areas on encrypted images are handled
> inconsistently: if the underlying object doesn't exist, zeroes are
> returned; if it does exist, we are at the mercy of sparse-read behavior
> and can return random garbage obtained by decrypting zeroes).
>
> >
> > I don't think this was originally a guarantee of the interface.  I
> > think the original guarantee was simply that SPARSE_READ would return
> > any non-zero regions, not that it was guaranteed not to return
> > unwritten or trimmed regions.  The OSD does not track this state above
> > the ObjectStore layer -- SPARSE_READ and MAPEXT both rely directly on
> > ObjectStore::fiemap.  MAPEXT actually returns -ENOTSUPP on erasure
> > coded pools.
> >
> > Adam: the observed behavior is that fiemap on bluestore does
> > accurately reflect the client's written extents at a 4k granularity.
> > Is that reliable, or is it a property of only some bluestore
> > configurations?
> >
> > As it appears desirable that we actually guarantee this, we probably
> > want to do two things:
> > 1) codify this guarantee in the ObjectStore interface (4k in all
> > cases?), and ensure that all configurations satisfy it going forward
> > (including seastore)
> > 2) update the ec implementation to track allocation at the granularity
> > of an EC stripe.  HashInfo is the natural place to put the
> > information, probably?  We'll need to also implement ZERO.  Radek: I
> > know you're looking into EC for crimson, perhaps you can evaluate how
> > much work would be required here?
>
> The EC stripe that is referred to here is configurable on a per-pool
> basis with the default taken from osd_pool_erasure_code_stripe_unit,
> right?  If the user configures it to e.g. 16k for a particular pool
> (EC profile), how would that interact with the 4k guarantee at the
> ObjectStore layer?
>
> Thanks,
>
>                 Ilya
>
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx




[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux