On Wed, Apr 26, 2023 at 11:50 PM Sam Just <sjust@xxxxxxxxxx> wrote: > > This came up again in the dev summit at cephalocon, so I figure it's > worth reviving this thread. > > First, I'll try to recap the situation (Ilya, feel free to correct me > here). My understanding of the issue is that rbd has features (most > notably encryption) which depend on the librados SPARSE_READ operation > reflecting accurately which ranges have been written or trimmed at a > 4k granularity. This appears to work correctly on replicated pools on > bluestore, but erasure coded pools always return the full object > contents up to the object size including regions the client has not > written to. Hi Sam, As Jeff said in another email, fscrypt support in kcephfs has a hard dependency on accurate allocation information. librbd wants to grow a similar dependency to enhance its built-in LUKS encryption support (currently reads from unallocated areas on encrypted images are handled inconsistently: if the underlying object doesn't exist, zeroes are returned; if it does exist, we are at the mercy of sparse-read behavior and can return random garbage obtained by decrypting zeroes). > > I don't think this was originally a guarantee of the interface. I > think the original guarantee was simply that SPARSE_READ would return > any non-zero regions, not that it was guaranteed not to return > unwritten or trimmed regions. The OSD does not track this state above > the ObjectStore layer -- SPARSE_READ and MAPEXT both rely directly on > ObjectStore::fiemap. MAPEXT actually returns -ENOTSUPP on erasure > coded pools. > > Adam: the observed behavior is that fiemap on bluestore does > accurately reflect the client's written extents at a 4k granularity. > Is that reliable, or is it a property of only some bluestore > configurations? > > As it appears desirable that we actually guarantee this, we probably > want to do two things: > 1) codify this guarantee in the ObjectStore interface (4k in all > cases?), and ensure that all configurations satisfy it going forward > (including seastore) > 2) update the ec implementation to track allocation at the granularity > of an EC stripe. HashInfo is the natural place to put the > information, probably? We'll need to also implement ZERO. Radek: I > know you're looking into EC for crimson, perhaps you can evaluate how > much work would be required here? The EC stripe that is referred to here is configurable on a per-pool basis with the default taken from osd_pool_erasure_code_stripe_unit, right? If the user configures it to e.g. 16k for a particular pool (EC profile), how would that interact with the 4k guarantee at the ObjectStore layer? Thanks, Ilya _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx