On Wed, 11 Oct 2017, Gregory Farnum wrote: > On Wed, Oct 11, 2017 at 12:02 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: > > On Wed, 11 Oct 2017, Gregory Farnum wrote: > >> On Wed, Oct 11, 2017 at 9:13 AM, Sage Weil <sweil@xxxxxxxxxx> wrote: > >> > I'm working on removing the removed_snaps field from pg_pool_t (and > >> > thus the OSDMap) as this can get very large for clusters that have aged > >> > and use snapshots. > >> > > >> > https://github.com/ceph/ceph/blob/master/src/osd/osd_types.h#L1320 > >> > > >> > The short version of the plan is only include recently removed snaps in > >> > the map. Once all PGs in a pool have reported that the snap has been > >> > trimmed, we can safely retire the snapid from that set. > >> > > >> > There are a couple of possible problems related to the fact that the > >> > OSD currently santizes the SnapContext with every write by removing > >> > snapids that appear in removed_snaps. This is meant to deal with race > >> > conditions where the IO was submitted before the snap was removed, but the > >> > OSD has already logically removed it (or scheduled it for removal). > >> > > >> > I see two categories of problems: > >> > > >> > 1. The IO was prepared at the librados client before the snap was removed, > >> > the IO is delayed (e.g., client can't connect, or PG is inactive, etc.) > >> > until after the snap is deleted and retired from removed_snaps, and then > >> > the IO is processed. This could trigger a clone on the OSD that will > >> > then never get cleaned up. > >> > > >> > Specifically, a librbd example: > >> > > >> > a. librbd prepares a IOContext that includes snap S > >> > b. librbd initiates a librados IO with on that ioc. That request is > >> > delayed. > >> > c. Snap S is removed (someone tells the mon to delete it) > >> > - either our librbd client did it, or another one did and we get a > >> > notify telling us to refresh our header; doesn't matter which. > >> > d. S is included in OSDMap removed_snaps. > >> > e. all OSDs prune S > >> > f. S is removed from removed_snaps some time later > >> > g. The request from (b) finally reaches the OSD and triggers a clone S > >> > (which will never get cleaned up) > >> > > >> > I think we can fix this problem by making Objecter slightly smarter: it > >> > can watch OSDMaps it receives and prune snapc's for in-flight requests. > >> > In the above scenario, sometime between d and g, when it finally > >> > gets a recent OSDMap, it will have pruned S from the request's snapc. If > >> > it didn't get the map, and the request is still tagged with an old > >> > OSDMap, the OSD can kill the OSD session to force a reconnect/resend. It > >> > can do this for any incoming client request tagged with an OSDMap prior to > >> > a low-water-mark epoch in the OSDMap for which older removed_snaps > >> > have been proved (e.g. removed_snaps_pruned_epoch, or some better name). > >> > > >> > In the extreme case where a client is disconnected for so long that they > >> > can't advance their map to the current one due to the mon having trimmed > >> > maps, and they have outstanding writes with snapcs, the client > >> > can choose to fail the requests with EIO or ESTALE or something, or crash > >> > itself, or otherwise behave as if it has been blacklisted/fenced (if it's > >> > RBD, it probably has anyway). > >> > >> This all sounds good, except... > >> How on earth do we process these snapc trims in a way that doesn't > >> bring us back down to effectively single-threaded transfer speeds? > >> > >> We might be able to do something with publishing a list of trimmed > >> snaps and having the dispatch threads check their op snapcs against > >> that shared data, but I'm not sure and we'd need to performance test > >> very carefully... > > > > The only Objecter change here is that when we get a new osdmap we would > > apply removed_snaps_this_epoch to the current in-flight requests. We > > already walk all in-flight ops on map changes to recheck the CRUSH > > mapping, and the snapc lists are short, and we can construct the OSDMap so > > that it has the list of newly-deleted snaps for just that epoch in a new > > field, so I don't think this will have any measureable impact on > > performance. > > > > Is there a different effect I'm missing? > > I was worried about the map walk. I hadn't thought about the CRUSH > mapping changes so I guess I don't know how that's implemented, but if > it works that's good! > > > > >> > 2. There is a bug, and the librbd image is out of sync: it thinks that > >> > snap S still exists but in reality it has been pruned. If this happens, > >> > then the librbd client may use S and it may trigger a clone. However, > >> > that snap still is referenced from the image, so it will presumably > >> > eventually get deleted and cleaned up. > >> > > >> > Greg suggested the possibility of a similar CephFS bug, where for example > >> > a CephFS client gets confused and continues sending snapcs with removed > >> > snaps. I think we can catch this category of bug (and probably #2 as > >> > well) if we make the OSD log an error/warning to the cluster log if it > >> > gets an incoming request including a deleted snapid when the request is > >> > marked with an epoch after when the snap was deleted. This would > >> > would mean adding the deleted_epoch for each removed snapid to pg_pool_t > >> > to do properly; maybe worth it, maybe not? > >> > >> I don't quite understand; can you expand on this? > > > > The annoyingly confusing thing about this is that snapids are kind of like > > time points, but the set of existing snapids has nothing to do with what > > we've trimmed from removed_snaps... it's not like we can just drop low > > snapids out of the set, since the snapid is the snap creation "when" and > > not the deletion "when." > > > > So, if we want to catch buggy clients that try to use snapcontexts with > > deleted snapids, we need to know which epoch each snapid was deleted in so > > that we can compare that to the request epoch. > > > > I'm not so sure this will work out. For one, there is a sort of race > > condition when librbd is setting up the snapc itself, since a > > concurrent client might delete the snap before it refreshes, so that it > > constructs a snapc with a bad snap after librados sees the osdmap deleting > > it. :/ Maybe it's not worth worrying about this piece. > > Oh, so instead of having just "interval_set<snapid> deleted_snaps", > we'd have to add "map<snapid, epoch_t> deleted_snap_epochs" (or > equivalent) and then compare those (plus a floor epoch for > oldest-known-deletion) to any incoming snapcontext? > > And for the warning, we'd have some of the snapids that should have > been deleted still around for comparison. So we wouldn't have any way > of reliably identifying them all, but we'd expect to see some of them > turning up if it's an ongoing problem? Yeah exactly. I think this check isn't so bad to implement, but not strictly necessary. > >> Let me also suggest one other idea that came up when I was discussing > >> snapshots with Patrick, by first laying out the problems we're aware > >> of: > >> 1) the OSDMap itself might grow too large > >> 2) comparing very large interval sets of snapIDs is computationally > >> expensive for the OSD. We do this > >> a) when processing a new OSDMap against each PG's local list of deleted snaps > >> b) when comparing the deleted snaps against incoming snapcontexts > >> (although this is usually not an issue because incoming snapcontexts > >> tend to be pretty small) > >> > >> So we want to limit the size of the OSDMap, and we also want to avoid > >> very large comparisons. Why don't we attack these problems > >> individually? > >> > >> 1) Switch from a deleted_snaps set to a deleting_snaps set in the > >> OSDMap, and trim it based on per-PG feedback to the manager. > >> 2) Maintain a *full* deleted_snaps set in the PG info on the OSDs. > >> 3) Only load into memory the deleted_snaps we reasonably expect to > >> see, *along* with a boundary snapid indicating what range the set is > >> valid for. > >> 4) If we get a snapid smaller than the range is valid for (in a > >> client's SnapContext, or I suppose in the deleting_snaps map), load > >> load more deleted snapids off disk to do the comparison. > >> > >> I haven't sketched out all the code paths but when I skimmed things > >> over I think that snapcontext check (and possible off-disk loading) is > >> actually in any easy-to-wait location. This avoids us needing to > >> change the client wire protocol or introduce more ordering > >> dependencies. > >> > >> The biggest downside I can see is that it adds a pretty obvious > >> DoS/resource consumption attack for malicious clients, but we're not > >> exactly immune to those in general. Thoughts? > > > > I think this would definitly work. I'm still hoping that we can avoid > > having the full set of deleted snaps maintained on the OSD at all, > > though--that will cut out a lot of complexity (and memory (and storage)). > > > > > > FWIW what I'm currently trying is adding new fields to the map for snapids > > deleted *this* epoch. That makes it easy for Objecter to adjust its > > requests accordingly. On the OSD, it also makes the interval_set > > calculcations on map updates faster (we already know what hte new snapids > > are). > > > > And I think it will open up the possibility of removing removed_snaps > > entirely from pg_pool_t if the PGs track the same set locally. In fact, I > > think in the limit we'll end up with each osdmap epoch just containing (1) > > newly deleted snapids this epoch, and (2) newly purged snapids this epoch. > > The PGs will maintain the set of deleting snapids (in pg_info_t or > > similar, sharing it during peering etc), and we'll keep that window > > short-ish to prevent it from getting too big. I'm thinking a 1-5 day > > window... basically, whatever the window of laggy clients getting the boot > > is that you can tolerate. The OSDs will only be pruning incoming snapcs > > against that deleting_snaps set (small!), OSDMaps themselves won't be big > > (each one will only include newly deleted or purged ids). If a PG falls > > behind and jumps a map gap, it will backfill anyway and get mopped up that > > way... > > > > But... the cost of eliminating all of that crap is losing the client's > > ability to have an in-flight request that is days old reconnect and > > proceed. Fair trade? > > Well, thinking about it more I'm not sure how we could sensibly store > a good boundary for storing some of the deleted snaps out of memory > anyway. So I think we have to go this direction. But I want to think > about it a bit more because I'm still concerned about the buggy > resurrection of snapids. > > Maybe we store the full set on disk, cache what we see in-memory, and > check validity against the disk on any snapid we don't have cached? :/ Just had another thought last night: the mon can preserve the full history of deletions, by epoch. When the objecter encounters a map gap it can request the removed_snaps over teh gap period from the mon at the same time it's getting the next map (i.e., the oldest full map stored by the mon). Since this is a pretty rare/excpetional thing, I don't worry much about the extra work for the mon, and it avoids the ugly client-must-crash behavior... a laggy client will always be able to catch up. I updated the plan B notes in the pad to include this bit! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html