Re: CephFS snapshot preferred behaviors

Alexandre Oliva <oliva@xxxxxxx> · Thu, 28 Jul 2016 18:44:20 -0300

On Jul 25, 2016, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:

> * Right now, we allow users to rename snapshots. (This is newish, so
> you may not be aware of it if you've been using snapshots for a
> while.) Is that an important ability to preserve?

I recall wishing for it back in the early days (0.2?.*), when I tried to
use this feature.

> * If you create a snapshot at "/1/2/foo", you can't delete "/1/2/foo"
> without removing the snapshot. Is that a good interface?

No opinion on that.  I never thought of deleting the roots that I used
to snapshot.

> * If you create a hard link at "/1/2/foo/bar" pointing at "/1/3/bar"
> and then take a snapshot at "/1/2/foo", it *will not* capture the file
> data in bar. Is that okay?

It really isn't.  It was probably the primary reason I decided ceph
snapshots wouldn't work for my purposes (taking snapshots for long-term
backup/archival/reference purposes).  Though back then the
implementation of hardlinks in cephfs was significantly different.

> Doing otherwise is *exceedingly* difficult.

*nod*

I guess it would be easier to handle hardlinks properly if their
implementation was changed again, so that a hard-linked inode ceases to
have a "primary" holding directory, and turns into a "directory" of
(dir-inode#, name, snap) backlink tuples, plus the file's inode info
proper.  Snapshot-taking could then just snapshot that "directory", in
pretty much the same it snapshots other directories; except we might
want to filter out entries corresponding to directories outside the
snapshot once the snapshot taking is otherwise completed.

This arrangement would have the extra benefit of bringing hard-link
metadata back to metadata pools, at the slight cost of one extra
indirect access when accessing the file with the earliest of its names
(nothing guarantees that's the most used access path anyway; in my case,
it almost never is, and data disks are much slower and less reliable
than metadata disks)

> * Creating snapshots is really fast right now: you issue a mkdir, the
> MDS commits one log entry to disk, and we return that it's done. Part
> of that is because we asynchronously notify clients about the new
> snapshot, and they and the MDS asynchronously flush out data for the
> snapshot. Is that good? There's a trade-off with durability (buffered
> data which you might expect to be in the snapshot gets lost if a
> client crashes, despite the snapshot "completing") and with external
> communication channels (you could have multiple clients write data
> they want in the snapshot, take the snapshot, then have a client write
> data it *doesn't* want in the snapshot get written quickly enough to
> be included as part of the snap). Would you rather creating a snapshot
> be slower but force a synchronous write-out of all data to disk?

This leads to a more general issue that bugs me a bit, which is the lack
of such memory synchronization primitives as acquire and release in the
filesystem interface, at least from a command-line interface.  I often
wish there was a way to force a refresh of all filesystem (meta)data
visible from a mountpoint, so that I can make sure what I see there is
what I've just "sync"ed on another client, without having to remount it.
It's not clear to me that, for example, that if I take a snapshot from
the client with this "stale" view, there's any risk that the snapshot
will NOT contain the data and metadata previously synced on another
client.  (the happens-before relationship might not visible to software,
as in, it's implemented by a human :-)

As for the unrelated client, it appears from your description that it's
strictly unordered WRT the snapshot-taking, which is unfortunate, but it
does make the presence or absence or the write in the snapshot
impredictable.  Ideally, it would at least be atomic.  This might even
be required by POSIX semantics, since individual write()s are atomic, at
least up to a certain size.  It should obey local happens before: if one
local write makes it (let's call it the last local write to make the
snapshot, for the sake of the argument), then so do all
previously-completed local writes.  Or, to account for lack of internal
synchronization, at least those that might have been externally visible
before the last local write that made it (the rocket launch problem in
durable transactions; it could be the observation of the launch that
triggers the snapshot) and those that locally happen-before that last
write.  Other than that, I can't really think of other requirements for
unsynchronized writes to make snapshots.

As for things that should NOT go in the snapshot, it's kind of easy for
the client that takes it, but for others, I suggest observing the
snapshot from the client ought to be a way to ensure whatever
happens-after the observation doesn't make it.

Now, I have very little insight into how these informally-stated
requirements would map to ceph's protocols and client implementations,
so they might turn out to be infeasible or too costly, but I hope
they're useful for something ;-)

-- 
Alexandre Oliva, freedom fighter    http://FSFLA.org/~lxoliva/
You must be the change you wish to see in the world. -- Gandhi
Be Free! -- http://FSFLA.org/   FSF Latin America board member
Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com