On Wed, Aug 28, 2013 at 6:21 AM, Sage Weil <sage@xxxxxxxxxxx> wrote: > Hi, > > On Sat, 24 Aug 2013, Alexandre Oliva wrote: >> On Aug 23, 2013, Sage Weil <sage@xxxxxxxxxxx> wrote: >> >> > FWIW Alexandre, this feature was never really complete. For it to work, >> > we also need to snapshot the monitors, and roll them back as well. >> >> That depends on what's expected from the feature, actually. >> >> One use is to roll back a single osd, and for that, the feature works >> just fine. Of course, for that one doesn't need the multi-osd snapshots >> to be mutually consistent, but it's still convenient to be able to take >> a global snapshot with a single command. >> >> Another use is to roll back the entire cluster to an earlier state, and >> for that, you *probably* want to roll back the monitors too, although it >> doesn't seem like this is strictly necessary, unless some significant >> configuration changes occurrend in the cluster since the snapshot was >> taken, and you want to roll back those too. >> >> In my experience, rolling back only osds has worked just fine, with the >> exception of cases in which the snapshot is much too old, and the mons >> have already expired osdmaps after the last one the osd got when the >> snapshot was taken. For this one case, I have a patch that enables the >> osd to rejoin the cluster in spite of the expired osdmaps, which has >> always worked for me, but I understand there may be exceptional cluster >> reconfigurations in which this wouldn't have worked. >> >> >> As for snapshotting monitors... I suppose the way to go is to start a >> store.db dump in background, instead of taking a btrfs snapshot, since >> the store.db is not created as a subvolume. That said, it would make >> some sense to make it so, to make it trivially snapshottable. >> >> >> Anyway, I found a problem in the earlier patch: when I added a new disk >> to my cluster this morning, it tried to iterate over osdmaps that were >> not available (e.g. the long-gone osdmap 1), and crashed. >> >> Here's a fixed version, that makes sure we don't start the iteration >> before m->get_first(). > > In principle, we can add this back in. I think it needs a few changes, > though. > > First, FileStore::snapshot() needs to pause and drain the workqueue before > taking the snapshot, similar to what is done with the sync sequence. > Otherwise it isn't a transactionally consistent snapshot and may tear some > update. Because it is draining the work queue, it *might* also need to > drop some locks, but I'm hopeful that that isn't necessary. > > Second, the call in handle_osd_map() should probably go in the loop a bit > further down that is consuming maps. It probably won't matter most of the > time, but I'm paranoid about corner conditions. It also avoids iterating > over the new OSDMaps multiple times in the common case where there is no > cluster_snap (a minor win). > > Finally, eventually we should make this do a checkpoint on the mons too. > We can add the osd snapping back in first, but before this can/should > really be used the mons need to be snapshotted as well. Probably that's > just adding in a snapshot() method to MonitorStore.h and doing either a > leveldb snap or making a full copy of store.db... I forget what leveldb is > capable of here. > I think we also need to snapshot the osd journal Regards Yan, Zheng -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html