Re: CephFS snapshot preferred behaviors

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 29 Jul 2016 12:58:06 -0700

On Thu, Jul 28, 2016 at 2:44 PM, Alexandre Oliva <oliva@xxxxxxx> wrote:
> On Jul 25, 2016, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>
>> * Right now, we allow users to rename snapshots. (This is newish, so
>> you may not be aware of it if you've been using snapshots for a
>> while.) Is that an important ability to preserve?
>
> I recall wishing for it back in the early days (0.2?.*), when I tried to
> use this feature.
>
>> * If you create a snapshot at "/1/2/foo", you can't delete "/1/2/foo"
>> without removing the snapshot. Is that a good interface?
>
> No opinion on that.  I never thought of deleting the roots that I used
> to snapshot.
>
>> * If you create a hard link at "/1/2/foo/bar" pointing at "/1/3/bar"
>> and then take a snapshot at "/1/2/foo", it *will not* capture the file
>> data in bar. Is that okay?
>
> It really isn't.  It was probably the primary reason I decided ceph
> snapshots wouldn't work for my purposes (taking snapshots for long-term
> backup/archival/reference purposes).  Though back then the
> implementation of hardlinks in cephfs was significantly different.
>
>> Doing otherwise is *exceedingly* difficult.
>
> *nod*
>
> I guess it would be easier to handle hardlinks properly if their
> implementation was changed again, so that a hard-linked inode ceases to
> have a "primary" holding directory, and turns into a "directory" of
> (dir-inode#, name, snap) backlink tuples, plus the file's inode info
> proper.  Snapshot-taking could then just snapshot that "directory", in
> pretty much the same it snapshots other directories; except we might
> want to filter out entries corresponding to directories outside the
> snapshot once the snapshot taking is otherwise completed.
>
> This arrangement would have the extra benefit of bringing hard-link
> metadata back to metadata pools, at the slight cost of one extra
> indirect access when accessing the file with the earliest of its names
> (nothing guarantees that's the most used access path anyway; in my case,
> it almost never is, and data disks are much slower and less reliable
> than metadata disks)

Yeah...My instinct is that hard links need to be snapshotted however
the internal implementation works. It's just *really hard*. The very
best solution I've been able to nod towards so far basically brings
back all the unworking parts of the "past parent" logic I talk about
over on ceph-devel that we're going to be totally eliminating. :/ It
*might* work better for hard links, but...I wouldn't count on it.

>
>
>> * Creating snapshots is really fast right now: you issue a mkdir, the
>> MDS commits one log entry to disk, and we return that it's done. Part
>> of that is because we asynchronously notify clients about the new
>> snapshot, and they and the MDS asynchronously flush out data for the
>> snapshot. Is that good? There's a trade-off with durability (buffered
>> data which you might expect to be in the snapshot gets lost if a
>> client crashes, despite the snapshot "completing") and with external
>> communication channels (you could have multiple clients write data
>> they want in the snapshot, take the snapshot, then have a client write
>> data it *doesn't* want in the snapshot get written quickly enough to
>> be included as part of the snap). Would you rather creating a snapshot
>> be slower but force a synchronous write-out of all data to disk?
>
> This leads to a more general issue that bugs me a bit, which is the lack
> of such memory synchronization primitives as acquire and release in the
> filesystem interface, at least from a command-line interface.  I often
> wish there was a way to force a refresh of all filesystem (meta)data
> visible from a mountpoint, so that I can make sure what I see there is
> what I've just "sync"ed on another client, without having to remount it.
> It's not clear to me that, for example, that if I take a snapshot from
> the client with this "stale" view, there's any risk that the snapshot
> will NOT contain the data and metadata previously synced on another
> client.  (the happens-before relationship might not visible to software,
> as in, it's implemented by a human :-)
>
> As for the unrelated client, it appears from your description that it's
> strictly unordered WRT the snapshot-taking, which is unfortunate, but it
> does make the presence or absence or the write in the snapshot
> impredictable.  Ideally, it would at least be atomic.  This might even
> be required by POSIX semantics, since individual write()s are atomic, at
> least up to a certain size.  It should obey local happens before: if one
> local write makes it (let's call it the last local write to make the
> snapshot, for the sake of the argument), then so do all
> previously-completed local writes.  Or, to account for lack of internal
> synchronization, at least those that might have been externally visible
> before the last local write that made it (the rocket launch problem in
> durable transactions; it could be the observation of the launch that
> triggers the snapshot) and those that locally happen-before that last
> write.  Other than that, I can't really think of other requirements for
> unsynchronized writes to make snapshots.

Well, anything that's been synced to disk prior to a snapshot being
created definitely ends up in the snapshot. And you shouldn't need to
do any kind of synchronization of your own in order to see the latest
updates — just do an "ls" or similar.

> As for things that should NOT go in the snapshot, it's kind of easy for
> the client that takes it, but for others, I suggest observing the
> snapshot from the client ought to be a way to ensure whatever
> happens-after the observation doesn't make it.

That actually is the case — if you manage to view the existence of a
snapshot, anything you do will reflect that snapshot's existence. This
is more in the way of a theoretical problem that could be worked
around by anybody who knows and cares.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com