Re: Sparse file info in filestore not propagated to other OSDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 21 Jun 2017, Piotr Dałek wrote:
> > > > > I tested on few of our production images and it seems that about 30%
> > > > > is
> > > > > sparse. This will be lost on any cluster wide event (add/remove nodes,
> > > > > PG grow, recovery).
> > > > > 
> > > > > How this is/will be handled in BlueStore?
> > > > 
> > > > BlueStore exposes the same sparseness metadata that enabling the
> > > > filestore seek hole or fiemap options does, so it won't be a problem
> > > > there.
> > > > 
> > > > I think the only thing that we could potentially add is zero detection
> > > > on writes (so that explicitly writing zeros consumes no space).  We'd
> > > > have to be a bit careful measuring the performance impact of that check
> > > > on
> > > > non-zero writes.
> > > 
> > > I saw that RBD (librbd) does that - replacing writes with discards when
> > > buffer
> > > contains only zeros. Some code that does the same in librados could be
> > > added
> > > and it shouldn't impact performance much, current implementation of
> > > mem_is_zero is fast and shouldn't be a big problem.
> > 
> > I'd rather not have librados silently translating requests; I think it
> > makes more sense to do any zero checking in bluestore.  _do_write_small
> > and _do_write_big already break writes into (aligned) chunks; that would
> > be an easy place to add the check.
> 
> That leaves out filestore.
> 
> And while I get your point, doing it on librados level would reduce network
> usage for zeroed out regions as well, and check could be done just once, not
> replica_size times...

In the librbd case I think a client-side check makes sense.

For librados, it's a low level interface with complicated semantics.  
Silently translating a write op to a zero op feels dangerous to me.  
Would a zero range extend the object size, for example?  Or implicitly 
create an object that doesn't exist?  I can't remember.  (It would need to 
match write perfectly for this to be safe.)  The user might also have a 
compound op of multiple operations, which would make swapping one out in 
the middle stranger.  And probably half the librados unit tests would 
stop testing what we thought they were testing.  Etc.

It seems more natural to do this a layer up in librbd or rgw...

sage

[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux