On Sun, Sep 21, 2014 at 4:26 PM, Dan van der Ster <daniel.vanderster@xxxxxxx> wrote: > Hi Florian, > > September 21 2014 3:33 PM, "Florian Haas" <florian@xxxxxxxxxxx> wrote: >> That said, I'm not sure that wip-9487-dumpling is the final fix to the >> issue. On the system where I am seeing the issue, even with the fix >> deployed, osd's still not only go crazy snap trimming (which by itself >> would be understandable, as the system has indeed recently had >> thousands of snapshots removed), but they also still produce the >> previously seen ENOENT messages indicating they're trying to trim >> snaps that aren't there. >> > > You should be able to tell exactly how many snaps need to be trimmed. Check the current purged_snaps with > > ceph pg x.y query > > and also check the snap_trimq from debug_osd=10. The problem fixed in wip-9487 is the (mis)communication of purged_snaps to a new OSD. But if in your cluster purged_snaps is "correct" (which it should be after the fix from Sage), and it still has lots of snaps to trim, then I believe the only thing to do is let those snaps all get trimmed. (my other patch linked sometime earlier in this thread might help by breaking up all that trimming work into smaller pieces, but that was never tested). Yes, it does indeed look like the system does have thousands of snapshots left to trim. That said, since the PGs are locked during this time, this creates a situation where the cluster is becoming unusable with no way for the user to recover. > Entering the realm of speculation, I wonder if your OSDs are getting interrupted, marked down, out, or crashing before they have the opportunity to persist purged_snaps? purged_snaps is updated in ReplicatedPG::WaitingOnReplicas::react, but if the primary is too busy to actually send that transaction to its peers, so then eventually it or the new primary needs to start again, and no progress is ever made. If this is what is happening on your cluster, then again, perhaps my osd_snap_trim_max patch could be a solution. Since the snap trimmer immediately jacks the affected OSDs up to 100% CPU utilization, and they stop even responding to heartbeats, yes they do get marked down and that makes the issue much worse. Even when setting nodown, though, then that doesn't change the fact that the affected OSDs just spin practically indefinitely. So, even with the patch for 9487, which fixes *your* issue of the cluster trying to trim tons of snaps when in fact it should be trimming only a handful, the user is still in a world of pain when they do indeed have tons of snaps to trim. And obviously, neither of osd max backfills nor osd recovery max active help here, because even a single backfill/recovery makes the OSD go nuts. There is the silly option of setting osd_snap_trim_sleep to say 61 minutes, and restarting the ceph-osd daemons before the snap trim can kick in, i.e. hourly, via a cron job. Of course, while this prevents the OSD from going into a death spin, it only perpetuates the problem until a patch for this issue is available, because snap trimming never even runs, let alone completes. This is particularly bad because a user can get themselves a non-functional cluster simply by trying to delete thousands of snapshots at once. If you consider a tiny virtualization cluster of just 100 persistent VMs, out of which you take one snapshot an hour, then deleting the snapshots taken in one month puts you well above that limit. So we're not talking about outrageous numbers here. I don't think anyone can fault any user for attempting this. What makes the situation even worse is that there is no cluster-wide limit to the number of snapshots, or even say snapshots per RBD volume, or snapshots per PG, nor any limit on the number of snapshots deleted concurrently. So yes, I think your patch absolutely still has merit, as would any means of reducing the number of snapshots an OSD will trim in one go. As it is, the situation looks really really bad, specifically considering that RBD and RADOS are meant to be super rock solid, as opposed to say CephFS which is in an experimental state. And contrary to CephFS snapshots, I can't recall any documentation saying that RBD snapshots will break your system. Cheers, Florian -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html