Re: sparse-read OSD op guarantees

Sam Just <sjust@xxxxxxxxxx> · Wed, 26 Apr 2023 14:50:06 -0700

This came up again in the dev summit at cephalocon, so I figure it's
worth reviving this thread.

First, I'll try to recap the situation (Ilya, feel free to correct me
here).  My understanding of the issue is that rbd has features (most
notably encryption) which depend on the librados SPARSE_READ operation
reflecting accurately which ranges have been written or trimmed at a
4k granularity.  This appears to work correctly on replicated pools on
bluestore, but erasure coded pools always return the full object
contents up to the object size including regions the client has not
written to.

I don't think this was originally a guarantee of the interface.  I
think the original guarantee was simply that SPARSE_READ would return
any non-zero regions, not that it was guaranteed not to return
unwritten or trimmed regions.  The OSD does not track this state above
the ObjectStore layer -- SPARSE_READ and MAPEXT both rely directly on
ObjectStore::fiemap.  MAPEXT actually returns -ENOTSUPP on erasure
coded pools.

Adam: the observed behavior is that fiemap on bluestore does
accurately reflect the client's written extents at a 4k granularity.
Is that reliable, or is it a property of only some bluestore
configurations?

As it appears desirable that we actually guarantee this, we probably
want to do two things:
1) codify this guarantee in the ObjectStore interface (4k in all
cases?), and ensure that all configurations satisfy it going forward
(including seastore)
2) update the ec implementation to track allocation at the granularity
of an EC stripe.  HashInfo is the natural place to put the
information, probably?  We'll need to also implement ZERO.  Radek: I
know you're looking into EC for crimson, perhaps you can evaluate how
much work would be required here?
-Sam

On Mon, May 2, 2022 at 5:21 PM Sam Just <sjust@xxxxxxxxxx> wrote:
>
> I don't think fiemap was ever intended as anything more than an
> optimization to permit a user to avoid transferring unnecessary
> zeroes.  SeaStore will probably not track sparseness at more than a 4k
> granularity.  I don't think the EC implementation is clever about
> sparse reads/writes at all since that information would probably need
> to be duplicated above the objectstore in the object_info.
> -Sam
>
> On Mon, May 2, 2022 at 7:47 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> >
> > On Mon, 2022-05-02 at 16:41 +0200, Ilya Dryomov wrote:
> > > On Mon, May 2, 2022 at 4:22 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > >
> > > > (sorry for the resend, but the first message got rejected by the list because it was from an unsubscribed address)
> > > >
> > > > On Mon, 2022-05-02 at 14:05 +0200, Ilya Dryomov wrote:
> > > > > Hi Sam,
> > > > >
> > > > > I wanted to clarify ObjectStore::fiemap API and sparse-read OSD op
> > > > > guarantees as this came up in Jeff's fscrypt work and just recently in
> > > > > RBD as well.
> > > > >
> > > > > In fscrypt for kcephfs, Jeff has opted to use sparse-read to ensure
> > > > > that file holes (which must contain all zeroes logically) don't get
> > > > > "decrypted" into seemingly random junk.  (Unlike ecryptfs, fscrypt
> > > > > framework doesn't attempt to protect the information about existence
> > > > > and location of holes in files, so logical holes generally correspond
> > > > > to physical holes.)
> > > > >
> > > >
> > > > The fscrypt client infrastructure generally prevents you from reading a
> > > > file when you don't have the key, but you could always analyze the
> > > > backing device and determine where the holes are. The situation with
> > > > cephfs is analogous.
> > >
> > > Yup.
> > >
> > > >
> > > > I imagine this is the same with ecryptfs though. I don't believe it
> > > > fills in the holes when you do a write past the EOF either. Were you
> > > > thinking of LUKS? That operates at the device level, so finding holes
> > > > there is a much different matter.
> > >
> > > I'm pretty sure ecryptfs always fills holes by encrypting logical zeroes and
> > > writing the resulting ciphertext out to the backing filesystem.  Quoting the
> > > FAQ:
> > >
> > >     eCryptfs does not currently support sparse files. Sequences of encrypted
> > >     extents with all 0's could be interpreted as sparse regions in eCryptfs
> > >     without too much implementation complexity. However, this would open up
> > >     a possible attack vector, since the fact that certain segments of data are
> > >     all 0's could betray strategic information that the user does not
> > >     necessarily want to reveal to an attacker. For instance, if the attacker
> > >     knows that a certain database file with patient medical data keeps
> > >     information about viral infections in one region of the file and
> > >     information about diabetes in another section of the file, then the very
> > >     fact that the segment for viral infection data is populated with data at
> > >     all would reveal that the patient has a viral infection.
> > >
> >
> > I stand corrected then! That tends to be pretty horrible for performance
> > though. Prepare to wait for a while if you do create a file and then
> > start writing at the 2G offset.
> >
> > In principle, we could also have the client fill in holes instead. It
> > may be worthwhile to have a mode where it does that. That might alsogive
> > us a way to support this on non-bluestore pools if it's not feasible to
> > allow for sparseness there).
> > --
> > Jeff Layton <jlayton@xxxxxxxxxx>
> >
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx