I'm working on removing the removed_snaps field from pg_pool_t (and thus the OSDMap) as this can get very large for clusters that have aged and use snapshots. https://github.com/ceph/ceph/blob/master/src/osd/osd_types.h#L1320 The short version of the plan is only include recently removed snaps in the map. Once all PGs in a pool have reported that the snap has been trimmed, we can safely retire the snapid from that set. There are a couple of possible problems related to the fact that the OSD currently santizes the SnapContext with every write by removing snapids that appear in removed_snaps. This is meant to deal with race conditions where the IO was submitted before the snap was removed, but the OSD has already logically removed it (or scheduled it for removal). I see two categories of problems: 1. The IO was prepared at the librados client before the snap was removed, the IO is delayed (e.g., client can't connect, or PG is inactive, etc.) until after the snap is deleted and retired from removed_snaps, and then the IO is processed. This could trigger a clone on the OSD that will then never get cleaned up. Specifically, a librbd example: a. librbd prepares a IOContext that includes snap S b. librbd initiates a librados IO with on that ioc. That request is delayed. c. Snap S is removed (someone tells the mon to delete it) - either our librbd client did it, or another one did and we get a notify telling us to refresh our header; doesn't matter which. d. S is included in OSDMap removed_snaps. e. all OSDs prune S f. S is removed from removed_snaps some time later g. The request from (b) finally reaches the OSD and triggers a clone S (which will never get cleaned up) I think we can fix this problem by making Objecter slightly smarter: it can watch OSDMaps it receives and prune snapc's for in-flight requests. In the above scenario, sometime between d and g, when it finally gets a recent OSDMap, it will have pruned S from the request's snapc. If it didn't get the map, and the request is still tagged with an old OSDMap, the OSD can kill the OSD session to force a reconnect/resend. It can do this for any incoming client request tagged with an OSDMap prior to a low-water-mark epoch in the OSDMap for which older removed_snaps have been proved (e.g. removed_snaps_pruned_epoch, or some better name). In the extreme case where a client is disconnected for so long that they can't advance their map to the current one due to the mon having trimmed maps, and they have outstanding writes with snapcs, the client can choose to fail the requests with EIO or ESTALE or something, or crash itself, or otherwise behave as if it has been blacklisted/fenced (if it's RBD, it probably has anyway). 2. There is a bug, and the librbd image is out of sync: it thinks that snap S still exists but in reality it has been pruned. If this happens, then the librbd client may use S and it may trigger a clone. However, that snap still is referenced from the image, so it will presumably eventually get deleted and cleaned up. Greg suggested the possibility of a similar CephFS bug, where for example a CephFS client gets confused and continues sending snapcs with removed snaps. I think we can catch this category of bug (and probably #2 as well) if we make the OSD log an error/warning to the cluster log if it gets an incoming request including a deleted snapid when the request is marked with an epoch after when the snap was deleted. This would would mean adding the deleted_epoch for each removed snapid to pg_pool_t to do properly; maybe worth it, maybe not? Thoughts? sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html