On Jul 25, 2016, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > * Right now, we allow users to rename snapshots. (This is newish, so > you may not be aware of it if you've been using snapshots for a > while.) Is that an important ability to preserve? I recall wishing for it back in the early days (0.2?.*), when I tried to use this feature. > * If you create a snapshot at "/1/2/foo", you can't delete "/1/2/foo" > without removing the snapshot. Is that a good interface? No opinion on that. I never thought of deleting the roots that I used to snapshot. > * If you create a hard link at "/1/2/foo/bar" pointing at "/1/3/bar" > and then take a snapshot at "/1/2/foo", it *will not* capture the file > data in bar. Is that okay? It really isn't. It was probably the primary reason I decided ceph snapshots wouldn't work for my purposes (taking snapshots for long-term backup/archival/reference purposes). Though back then the implementation of hardlinks in cephfs was significantly different. > Doing otherwise is *exceedingly* difficult. *nod* I guess it would be easier to handle hardlinks properly if their implementation was changed again, so that a hard-linked inode ceases to have a "primary" holding directory, and turns into a "directory" of (dir-inode#, name, snap) backlink tuples, plus the file's inode info proper. Snapshot-taking could then just snapshot that "directory", in pretty much the same it snapshots other directories; except we might want to filter out entries corresponding to directories outside the snapshot once the snapshot taking is otherwise completed. This arrangement would have the extra benefit of bringing hard-link metadata back to metadata pools, at the slight cost of one extra indirect access when accessing the file with the earliest of its names (nothing guarantees that's the most used access path anyway; in my case, it almost never is, and data disks are much slower and less reliable than metadata disks) > * Creating snapshots is really fast right now: you issue a mkdir, the > MDS commits one log entry to disk, and we return that it's done. Part > of that is because we asynchronously notify clients about the new > snapshot, and they and the MDS asynchronously flush out data for the > snapshot. Is that good? There's a trade-off with durability (buffered > data which you might expect to be in the snapshot gets lost if a > client crashes, despite the snapshot "completing") and with external > communication channels (you could have multiple clients write data > they want in the snapshot, take the snapshot, then have a client write > data it *doesn't* want in the snapshot get written quickly enough to > be included as part of the snap). Would you rather creating a snapshot > be slower but force a synchronous write-out of all data to disk? This leads to a more general issue that bugs me a bit, which is the lack of such memory synchronization primitives as acquire and release in the filesystem interface, at least from a command-line interface. I often wish there was a way to force a refresh of all filesystem (meta)data visible from a mountpoint, so that I can make sure what I see there is what I've just "sync"ed on another client, without having to remount it. It's not clear to me that, for example, that if I take a snapshot from the client with this "stale" view, there's any risk that the snapshot will NOT contain the data and metadata previously synced on another client. (the happens-before relationship might not visible to software, as in, it's implemented by a human :-) As for the unrelated client, it appears from your description that it's strictly unordered WRT the snapshot-taking, which is unfortunate, but it does make the presence or absence or the write in the snapshot impredictable. Ideally, it would at least be atomic. This might even be required by POSIX semantics, since individual write()s are atomic, at least up to a certain size. It should obey local happens before: if one local write makes it (let's call it the last local write to make the snapshot, for the sake of the argument), then so do all previously-completed local writes. Or, to account for lack of internal synchronization, at least those that might have been externally visible before the last local write that made it (the rocket launch problem in durable transactions; it could be the observation of the launch that triggers the snapshot) and those that locally happen-before that last write. Other than that, I can't really think of other requirements for unsynchronized writes to make snapshots. As for things that should NOT go in the snapshot, it's kind of easy for the client that takes it, but for others, I suggest observing the snapshot from the client ought to be a way to ensure whatever happens-after the observation doesn't make it. Now, I have very little insight into how these informally-stated requirements would map to ceph's protocols and client implementations, so they might turn out to be infeasible or too costly, but I hope they're useful for something ;-) -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist|Red Hat Brasil GNU Toolchain Engineer _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com