On Thu, Jul 28, 2016 at 2:44 PM, Alexandre Oliva <oliva@xxxxxxx> wrote: > On Jul 25, 2016, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > >> * Right now, we allow users to rename snapshots. (This is newish, so >> you may not be aware of it if you've been using snapshots for a >> while.) Is that an important ability to preserve? > > I recall wishing for it back in the early days (0.2?.*), when I tried to > use this feature. > >> * If you create a snapshot at "/1/2/foo", you can't delete "/1/2/foo" >> without removing the snapshot. Is that a good interface? > > No opinion on that. I never thought of deleting the roots that I used > to snapshot. > >> * If you create a hard link at "/1/2/foo/bar" pointing at "/1/3/bar" >> and then take a snapshot at "/1/2/foo", it *will not* capture the file >> data in bar. Is that okay? > > It really isn't. It was probably the primary reason I decided ceph > snapshots wouldn't work for my purposes (taking snapshots for long-term > backup/archival/reference purposes). Though back then the > implementation of hardlinks in cephfs was significantly different. > >> Doing otherwise is *exceedingly* difficult. > > *nod* > > I guess it would be easier to handle hardlinks properly if their > implementation was changed again, so that a hard-linked inode ceases to > have a "primary" holding directory, and turns into a "directory" of > (dir-inode#, name, snap) backlink tuples, plus the file's inode info > proper. Snapshot-taking could then just snapshot that "directory", in > pretty much the same it snapshots other directories; except we might > want to filter out entries corresponding to directories outside the > snapshot once the snapshot taking is otherwise completed. > > This arrangement would have the extra benefit of bringing hard-link > metadata back to metadata pools, at the slight cost of one extra > indirect access when accessing the file with the earliest of its names > (nothing guarantees that's the most used access path anyway; in my case, > it almost never is, and data disks are much slower and less reliable > than metadata disks) Yeah...My instinct is that hard links need to be snapshotted however the internal implementation works. It's just *really hard*. The very best solution I've been able to nod towards so far basically brings back all the unworking parts of the "past parent" logic I talk about over on ceph-devel that we're going to be totally eliminating. :/ It *might* work better for hard links, but...I wouldn't count on it. > > >> * Creating snapshots is really fast right now: you issue a mkdir, the >> MDS commits one log entry to disk, and we return that it's done. Part >> of that is because we asynchronously notify clients about the new >> snapshot, and they and the MDS asynchronously flush out data for the >> snapshot. Is that good? There's a trade-off with durability (buffered >> data which you might expect to be in the snapshot gets lost if a >> client crashes, despite the snapshot "completing") and with external >> communication channels (you could have multiple clients write data >> they want in the snapshot, take the snapshot, then have a client write >> data it *doesn't* want in the snapshot get written quickly enough to >> be included as part of the snap). Would you rather creating a snapshot >> be slower but force a synchronous write-out of all data to disk? > > This leads to a more general issue that bugs me a bit, which is the lack > of such memory synchronization primitives as acquire and release in the > filesystem interface, at least from a command-line interface. I often > wish there was a way to force a refresh of all filesystem (meta)data > visible from a mountpoint, so that I can make sure what I see there is > what I've just "sync"ed on another client, without having to remount it. > It's not clear to me that, for example, that if I take a snapshot from > the client with this "stale" view, there's any risk that the snapshot > will NOT contain the data and metadata previously synced on another > client. (the happens-before relationship might not visible to software, > as in, it's implemented by a human :-) > > As for the unrelated client, it appears from your description that it's > strictly unordered WRT the snapshot-taking, which is unfortunate, but it > does make the presence or absence or the write in the snapshot > impredictable. Ideally, it would at least be atomic. This might even > be required by POSIX semantics, since individual write()s are atomic, at > least up to a certain size. It should obey local happens before: if one > local write makes it (let's call it the last local write to make the > snapshot, for the sake of the argument), then so do all > previously-completed local writes. Or, to account for lack of internal > synchronization, at least those that might have been externally visible > before the last local write that made it (the rocket launch problem in > durable transactions; it could be the observation of the launch that > triggers the snapshot) and those that locally happen-before that last > write. Other than that, I can't really think of other requirements for > unsynchronized writes to make snapshots. Well, anything that's been synced to disk prior to a snapshot being created definitely ends up in the snapshot. And you shouldn't need to do any kind of synchronization of your own in order to see the latest updates — just do an "ls" or similar. > As for things that should NOT go in the snapshot, it's kind of easy for > the client that takes it, but for others, I suggest observing the > snapshot from the client ought to be a way to ensure whatever > happens-after the observation doesn't make it. That actually is the case — if you manage to view the existence of a snapshot, anything you do will reflect that snapshot's existence. This is more in the way of a theoretical problem that could be worked around by anybody who knows and cares. -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com