Hi Florian, > On 17 Sep 2014, at 17:09, Florian Haas <florian at hastexo.com> wrote: > > Hi Craig, > > just dug this up in the list archives. > > On Fri, Mar 28, 2014 at 2:04 AM, Craig Lewis <clewis at centraldesktop.com> wrote: >> In the interest of removing variables, I removed all snapshots on all pools, >> then restarted all ceph daemons at the same time. This brought up osd.8 as >> well. > > So just to summarize this: your 100% CPU problem at the time went away > after you removed all snapshots, and the actual cause of the issue was > never found? > > I am seeing a similar issue now, and have filed > http://tracker.ceph.com/issues/9503 to make sure it doesn't get lost > again. Can you take a look at that issue and let me know if anything > in the description sounds familiar? Could your ticket be related to the snap trimming issue I?ve finally narrowed down in the past couple days? http://tracker.ceph.com/issues/9487 Bump up debug_osd to 20 then check the log during one of your incidents. If it is busy logging the snap_trimmer messages, then it?s the same issue. (The issue is that rbd pools have many purged_snaps, but sometimes after backfilling a PG the purged_snaps list is lost and thus the snap trimmer becomes very busy whilst re-trimming thousands of snaps. During that time (a few minutes on my cluster) the OSD is blocked.) Cheers, Dan > > You mentioned in a later message in the same thread that you would > keep your snapshot script running and "repeat the experiment". Did the > situation change in any way after that? Did the issue come back? Or > did you just stop using snapshots altogether? > > Cheers, > Florian > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com