On Wed, 28 Aug 2013, Yan, Zheng wrote: > On Wed, Aug 28, 2013 at 6:21 AM, Sage Weil <sage@xxxxxxxxxxx> wrote: > > Hi, > > > > On Sat, 24 Aug 2013, Alexandre Oliva wrote: > >> On Aug 23, 2013, Sage Weil <sage@xxxxxxxxxxx> wrote: > >> > >> > FWIW Alexandre, this feature was never really complete. For it to work, > >> > we also need to snapshot the monitors, and roll them back as well. > >> > >> That depends on what's expected from the feature, actually. > >> > >> One use is to roll back a single osd, and for that, the feature works > >> just fine. Of course, for that one doesn't need the multi-osd snapshots > >> to be mutually consistent, but it's still convenient to be able to take > >> a global snapshot with a single command. > >> > >> Another use is to roll back the entire cluster to an earlier state, and > >> for that, you *probably* want to roll back the monitors too, although it > >> doesn't seem like this is strictly necessary, unless some significant > >> configuration changes occurrend in the cluster since the snapshot was > >> taken, and you want to roll back those too. > >> > >> In my experience, rolling back only osds has worked just fine, with the > >> exception of cases in which the snapshot is much too old, and the mons > >> have already expired osdmaps after the last one the osd got when the > >> snapshot was taken. For this one case, I have a patch that enables the > >> osd to rejoin the cluster in spite of the expired osdmaps, which has > >> always worked for me, but I understand there may be exceptional cluster > >> reconfigurations in which this wouldn't have worked. > >> > >> > >> As for snapshotting monitors... I suppose the way to go is to start a > >> store.db dump in background, instead of taking a btrfs snapshot, since > >> the store.db is not created as a subvolume. That said, it would make > >> some sense to make it so, to make it trivially snapshottable. > >> > >> > >> Anyway, I found a problem in the earlier patch: when I added a new disk > >> to my cluster this morning, it tried to iterate over osdmaps that were > >> not available (e.g. the long-gone osdmap 1), and crashed. > >> > >> Here's a fixed version, that makes sure we don't start the iteration > >> before m->get_first(). > > > > In principle, we can add this back in. I think it needs a few changes, > > though. > > > > First, FileStore::snapshot() needs to pause and drain the workqueue before > > taking the snapshot, similar to what is done with the sync sequence. > > Otherwise it isn't a transactionally consistent snapshot and may tear some > > update. Because it is draining the work queue, it *might* also need to > > drop some locks, but I'm hopeful that that isn't necessary. > > > > Second, the call in handle_osd_map() should probably go in the loop a bit > > further down that is consuming maps. It probably won't matter most of the > > time, but I'm paranoid about corner conditions. It also avoids iterating > > over the new OSDMaps multiple times in the common case where there is no > > cluster_snap (a minor win). > > > > Finally, eventually we should make this do a checkpoint on the mons too. > > We can add the osd snapping back in first, but before this can/should > > really be used the mons need to be snapshotted as well. Probably that's > > just adding in a snapshot() method to MonitorStore.h and doing either a > > leveldb snap or making a full copy of store.db... I forget what leveldb is > > capable of here. > > > > I think we also need to snapshot the osd journal If the snapshot does a sync (drain op_tp before doing the snap) that puts the file subvol in a consistent state. To actually use it ceph-osd rolls back to that point on startup. I didn't check that code, but I think what it should do is ignore/reset the journal then. This is annoying code to test, unfortunately. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html