Re: sparse-read OSD op guarantees

Jeff Layton <jlayton@xxxxxxxxxx> · Thu, 27 Apr 2023 13:46:13 -0400

On Wed, 2023-04-26 at 14:50 -0700, Sam Just wrote:
> This came up again in the dev summit at cephalocon, so I figure it's
> worth reviving this thread.
> 
> First, I'll try to recap the situation (Ilya, feel free to correct me
> here).  My understanding of the issue is that rbd has features (most
> notably encryption) which depend on the librados SPARSE_READ operation
> reflecting accurately which ranges have been written or trimmed at a
> 4k granularity.  This appears to work correctly on replicated pools on
> bluestore, but erasure coded pools always return the full object
> contents up to the object size including regions the client has not
> written to.
> 
> I don't think this was originally a guarantee of the interface.  I
> think the original guarantee was simply that SPARSE_READ would return
> any non-zero regions, not that it was guaranteed not to return
> unwritten or trimmed regions.  The OSD does not track this state above
> the ObjectStore layer -- SPARSE_READ and MAPEXT both rely directly on
> ObjectStore::fiemap.  MAPEXT actually returns -ENOTSUPP on erasure
> coded pools.
> 
> Adam: the observed behavior is that fiemap on bluestore does
> accurately reflect the client's written extents at a 4k granularity.
> Is that reliable, or is it a property of only some bluestore
> configurations?
> 
> As it appears desirable that we actually guarantee this, we probably
> want to do two things:
> 1) codify this guarantee in the ObjectStore interface (4k in all
> cases?), and ensure that all configurations satisfy it going forward
> (including seastore)
> 2) update the ec implementation to track allocation at the granularity
> of an EC stripe.  HashInfo is the natural place to put the
> information, probably?  We'll need to also implement ZERO.  Radek: I
> know you're looking into EC for crimson, perhaps you can evaluate how
> much work would be required here?
> -Sam
> 

The reason we need this is that with the advent of fscrypt, ceph becomes
(for all intents and purposes) a block-based filesystem. We have to
encrypt and decrypt the data in blocks. fscrypt does allow you to choose
the blocksize though.

We chose 4k blocks for the fscrypt prototype because...well, no real
reason really other than it lines up nicely with 4k page-based buffered
I/O that the VFS layer is handing down on commodity hw.

In theory we could handle other blocksizes, though ideally we'd want to
keep the blocksize a power of two.

> On Mon, May 2, 2022 at 5:21 PM Sam Just <sjust@xxxxxxxxxx> wrote:
> > 
> > I don't think fiemap was ever intended as anything more than an
> > optimization to permit a user to avoid transferring unnecessary
> > zeroes.  SeaStore will probably not track sparseness at more than a 4k
> > granularity.  I don't think the EC implementation is clever about
> > sparse reads/writes at all since that information would probably need
> > to be duplicated above the objectstore in the object_info.
> > -Sam
> > 
> > On Mon, May 2, 2022 at 7:47 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > 
> > > On Mon, 2022-05-02 at 16:41 +0200, Ilya Dryomov wrote:
> > > > On Mon, May 2, 2022 at 4:22 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> > > > > 
> > > > > (sorry for the resend, but the first message got rejected by the list because it was from an unsubscribed address)
> > > > > 
> > > > > On Mon, 2022-05-02 at 14:05 +0200, Ilya Dryomov wrote:
> > > > > > Hi Sam,
> > > > > > 
> > > > > > I wanted to clarify ObjectStore::fiemap API and sparse-read OSD op
> > > > > > guarantees as this came up in Jeff's fscrypt work and just recently in
> > > > > > RBD as well.
> > > > > > 
> > > > > > In fscrypt for kcephfs, Jeff has opted to use sparse-read to ensure
> > > > > > that file holes (which must contain all zeroes logically) don't get
> > > > > > "decrypted" into seemingly random junk.  (Unlike ecryptfs, fscrypt
> > > > > > framework doesn't attempt to protect the information about existence
> > > > > > and location of holes in files, so logical holes generally correspond
> > > > > > to physical holes.)
> > > > > > 
> > > > > 
> > > > > The fscrypt client infrastructure generally prevents you from reading a
> > > > > file when you don't have the key, but you could always analyze the
> > > > > backing device and determine where the holes are. The situation with
> > > > > cephfs is analogous.
> > > > 
> > > > Yup.
> > > > 
> > > > > 
> > > > > I imagine this is the same with ecryptfs though. I don't believe it
> > > > > fills in the holes when you do a write past the EOF either. Were you
> > > > > thinking of LUKS? That operates at the device level, so finding holes
> > > > > there is a much different matter.
> > > > 
> > > > I'm pretty sure ecryptfs always fills holes by encrypting logical zeroes and
> > > > writing the resulting ciphertext out to the backing filesystem.  Quoting the
> > > > FAQ:
> > > > 
> > > >     eCryptfs does not currently support sparse files. Sequences of encrypted
> > > >     extents with all 0's could be interpreted as sparse regions in eCryptfs
> > > >     without too much implementation complexity. However, this would open up
> > > >     a possible attack vector, since the fact that certain segments of data are
> > > >     all 0's could betray strategic information that the user does not
> > > >     necessarily want to reveal to an attacker. For instance, if the attacker
> > > >     knows that a certain database file with patient medical data keeps
> > > >     information about viral infections in one region of the file and
> > > >     information about diabetes in another section of the file, then the very
> > > >     fact that the segment for viral infection data is populated with data at
> > > >     all would reveal that the patient has a viral infection.
> > > > 
> > > 
> > > I stand corrected then! That tends to be pretty horrible for performance
> > > though. Prepare to wait for a while if you do create a file and then
> > > start writing at the 2G offset.
> > > 
> > > In principle, we could also have the client fill in holes instead. It
> > > may be worthwhile to have a mode where it does that. That might alsogive
> > > us a way to support this on non-bluestore pools if it's not feasible to
> > > allow for sparseness there).
> > > --
> > > Jeff Layton <jlayton@xxxxxxxxxx>
> > > 
> 

-- 
Jeff Layton <jlayton@xxxxxxxxxx>
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx