On 2017-09-08 01:59 PM, Gregory Farnum wrote: > On Fri, Sep 8, 2017 at 1:45 AM, Florian Haas <florian@xxxxxxxxxxx> wrote: >>> In our use case, we are severly hampered by the size of removed_snaps >>> (50k+) in the OSDMap to the point were ~80% of ALL cpu time is spent in >>> PGPool::update and its interval calculation code. We have a cluster of >>> around 100k RBDs with each RBD having upto 25 snapshots and only a small >>> portion of our RBDs mapped at a time (~500-1000). For size / performance >>> reasons we try to keep the number of snapshots low (<25) and need to >>> prune snapshots. Since in our use case RBDs 'age' at different rates, >>> snapshot pruning creates holes to the point where we the size of the >>> removed_snaps interval set in the osdmap is 50k-100k in many of our Ceph >>> clusters. I think in general around 2 snapshot removal operations >>> currently happen a minute just because of the volume of snapshots and >>> users we have. >> Right. Greg, this is what I was getting at: 25 snapshots per RBD is >> firmly in "one snapshot per day per RBD" territory — this is something >> that a cloud operator might do, for example, offering daily snapshots >> going back one month. But it still wrecks the cluster simply by having >> lots of images (even though only a fraction of them, less than 1%, are >> ever in use). That's rather counter-intuitive, it doesn't hit you >> until you have lots of images, and once you're affected by it there's >> no practical way out — where "out" is defined as "restoring overall >> cluster performance to something acceptable". >> >>> We found the PGPool::update and the interval calculation code code to be >>> quite inefficient. Some small changes made it a lot faster giving more >>> breathing room, we shared and these and most already got applied: >>> https://github.com/ceph/ceph/pull/17088 >>> https://github.com/ceph/ceph/pull/17121 >>> https://github.com/ceph/ceph/pull/17239 >>> https://github.com/ceph/ceph/pull/17265 >>> https://github.com/ceph/ceph/pull/17410 (not yet merged, needs more fixes) >>> >>> However for our use case these patches helped, but overall CPU usage in >>> this area is still high (>70% or so), making the Ceph cluster slow >>> causing blocked requests and many operations (e.g. rbd map) to take a >>> long time. >> I think this makes this very much a practical issue, not a >> hypothetical/theoretical one. >> >>> We are trying to work around these issues by trying to change our >>> snapshot strategy. In the short-term we are manually defragmenting the >>> interval set by scanning for holes and trying to delete snapids in >>> between holes to coalesce more holes. This is not so nice to do. In some >>> cases we employ strategies to 'recreate' old snapshots (as we need to >>> keep them) at higher snapids. For our use case a 'snapid rename' feature >>> would have been quite helpful. >>> >>> I hope this shines some light on practical Ceph clusters in which >>> performance is bottlenecked not by I/O but by snapshot removal. >> For others following this thread or retrieving it from the list >> archive some time down the road, I'd rephrase that as "bottlenecked >> not by I/O but by CPU utilization associated with snapshot removal". >> Is that fair to say, Patrick? Please correct me if I'm >> misrepresenting. >> >> Greg (or Josh/Jason/Sage/anyone really :) ), can you provide >> additional insight as to how these issues can be worked around or >> mitigated, besides the PRs that Patrick and his colleagues have >> already sent? > Yeah. Like I said, we have a proposed solution for this (that we can > probably backport to Luminous stable?), but that's the sort of thing I > haven't heard about before. And the issue is indeed with the raw size > of the removed_snaps member, which will be a problem for cloud > operators of a certain scale. > > Theoretically, I'd expect you could control it if you are careful: > 1) take all snapshots on your RBD images for a single time unit > together, don't intersperse them (ie, don't create up daily snapshots > on some images at the same time as hourly snapshots on others) > 2) trim all snapshots from the same time unit on the same schedule > 3) limit the number of live time units you keep around That is basically our long term strategy, but it does involve some re-architecting of our code, which does take some time. > There are obvious downsides to those steps, and it's a problem I look > forward to us resolving soonish. But if you follow those I'd expect > the removed_snaps interval_set to be proportional in size to the > number of live time units you have, rather than the number of RBD > volumes or anything else. > > > > On Wed, Sep 6, 2017 at 8:44 AM, Florian Haas <florian@xxxxxxxxxxx> wrote: >> Hi Greg, >> >> thanks for your insight! I do have a few follow-up questions. >> >> On 09/05/2017 11:39 PM, Gregory Farnum wrote: >>>> It seems to me that there still isn't a good recommendation along the >>>> lines of "try not to have more than X snapshots per RBD image" or "try >>>> not to have more than Y snapshots in the cluster overall". Or is the >>>> "correct" recommendation actually "create as many snapshots as you >>>> might possibly want, none of that is allowed to create any instability >>>> nor performance degradation and if it does, that's a bug"? >>> I think we're closer to "as many snapshots as you want", but there are >>> some known shortages there. >>> >>> First of all, if you haven't seen my talk from the last OpenStack >>> summit on snapshots and you want a bunch of details, go watch that. :p >>> https://www.openstack.org/videos/boston-2017/ceph-snapshots-for-fun-and-profit-1 >> OK so I just rewatched that to see if I had missed anything regarding >> recommendations for how many snapshots are sane. For anyone else >> following this thread, there are two items I could make out, and I'm >> taking the liberty to include the direct links here: >> >> - From the talk itself: https://youtu.be/rY0OWtllkn8?t=26m29s >> >> This says don't do a snapshot every minute on each RBD, but one per day >> is probably OK. That is rather *very* vague, unfortunately, since as you >> point out in the talk the overhead associated with snapshots is strongly >> related to how many RADOS-level snapshots there are in the cluster >> overall, and clearly it makes a big difference whether you're taking one >> daily snapshot of 10 RBD images, or of 100,000. >> >> So, can you refine that estimate a bit? As in, can you give at least an >> order-of-magnitude estimate for "this many snapshots overall is probably >> OK, but multiply by 10 and you're in trouble"? >> >> - From the Q&A: https://youtu.be/rY0OWtllkn8?t=36m58s >> >> Here, you talk about how having many holes in the interval set governing >> the snap trim queue can be a problem. That one is rather tricky too, >> because as far as I can tell there is really no way for users to >> influence this (other than, of course, deleting *all* snapshots or never >> creating or deleting any at all). >> >>> There are a few dimensions there can be failures with snapshots: >>> 1) right now the way we mark snapshots as deleted is suboptimal — when >>> deleted they go into an interval_set in the OSDMap. So if you have a >>> bunch of holes in your deleted snapshots, it is possible to inflate >>> the osdmap to a size which causes trouble. But I'm not sure if we've >>> actually seen this be an issue yet — it requires both a large cluster, >>> and a large map, and probably some other failure causing osdmaps to be >>> generated very rapidly. >> Can you give an estimate as to what a "large" map is in this context? In >> other words, when is a map sufficiently inflated with that interval set >> to be a problem? >> >>> 2) There may be issues with how rbd records what snapshots it is >>> associated with? No idea about this; haven't heard of any. >>> >>> 3) Trimming snapshots requires IO. This is where most (all?) of the >>> issues I've seen have come from; either in it being unscheduled IO >>> that the rest of the system doesn't account for or throttle (as in the >>> links you highlighted) or in admins overwhelming the IO capacity of >>> their clusters. >> Again, I think (correct me if I'm wrong here) that trimming does factor >> into your "one snapshot per RBD image per day" recommendation, but would >> you be able to express that in terms of overall RADOS-level snapshots? > The problem with these bounds (which are apparently not the only ones, > given that whole CPU discussion) is that they're all based on the > throughput capacity of your system, so there are no broadly-applicable > rules of thumb. If you're running a slow-hard-drive 1GigE monitor > system you will have trouble with OSDMaps much smaller than somebody > with SSDs and 10GigE. Trimming snapshots is just normal IO to the Ceph > system at this point, but it's not free so if you generate snapshots > at 1-minute intervals and delete them just as quickly, that may exceed > your available IOPS needs (although with up-to-date code that just > means your snapshot trimming queue will get longer). Our system is all 10GigE, with SSD journal disks and spinning disks for the data. When our clusters are falling over, the IOPS are not high enough to be saturating our disks, and the network links are nowhere near saturation. We are, however, seeing all of the CPUs on the OSDs completely pegged. > -Greg > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com