EC wouldn't be able to directly use the ObjectStore's fiemap implementation. I think we'd need to build that metadata into either the object_info or the ECUtil::hash_info. -Sam On Thu, Apr 27, 2023 at 1:16 PM Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > > On Wed, Apr 26, 2023 at 11:50 PM Sam Just <sjust@xxxxxxxxxx> wrote: > > > > This came up again in the dev summit at cephalocon, so I figure it's > > worth reviving this thread. > > > > First, I'll try to recap the situation (Ilya, feel free to correct me > > here). My understanding of the issue is that rbd has features (most > > notably encryption) which depend on the librados SPARSE_READ operation > > reflecting accurately which ranges have been written or trimmed at a > > 4k granularity. This appears to work correctly on replicated pools on > > bluestore, but erasure coded pools always return the full object > > contents up to the object size including regions the client has not > > written to. > > Hi Sam, > > As Jeff said in another email, fscrypt support in kcephfs has a hard > dependency on accurate allocation information. librbd wants to grow > a similar dependency to enhance its built-in LUKS encryption support > (currently reads from unallocated areas on encrypted images are handled > inconsistently: if the underlying object doesn't exist, zeroes are > returned; if it does exist, we are at the mercy of sparse-read behavior > and can return random garbage obtained by decrypting zeroes). > > > > > I don't think this was originally a guarantee of the interface. I > > think the original guarantee was simply that SPARSE_READ would return > > any non-zero regions, not that it was guaranteed not to return > > unwritten or trimmed regions. The OSD does not track this state above > > the ObjectStore layer -- SPARSE_READ and MAPEXT both rely directly on > > ObjectStore::fiemap. MAPEXT actually returns -ENOTSUPP on erasure > > coded pools. > > > > Adam: the observed behavior is that fiemap on bluestore does > > accurately reflect the client's written extents at a 4k granularity. > > Is that reliable, or is it a property of only some bluestore > > configurations? > > > > As it appears desirable that we actually guarantee this, we probably > > want to do two things: > > 1) codify this guarantee in the ObjectStore interface (4k in all > > cases?), and ensure that all configurations satisfy it going forward > > (including seastore) > > 2) update the ec implementation to track allocation at the granularity > > of an EC stripe. HashInfo is the natural place to put the > > information, probably? We'll need to also implement ZERO. Radek: I > > know you're looking into EC for crimson, perhaps you can evaluate how > > much work would be required here? > > The EC stripe that is referred to here is configurable on a per-pool > basis with the default taken from osd_pool_erasure_code_stripe_unit, > right? If the user configures it to e.g. 16k for a particular pool > (EC profile), how would that interact with the 4k guarantee at the > ObjectStore layer? > > Thanks, > > Ilya > _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx