RGW hung, 2 OSDs using 100% CPU

daniel.vanderster@xxxxxxx (Dan Van Der Ster) · Wed, 17 Sep 2014 15:24:42 +0000

Hi Florian,

> On 17 Sep 2014, at 17:09, Florian Haas <florian at hastexo.com> wrote:
> 
> Hi Craig,
> 
> just dug this up in the list archives.
> 
> On Fri, Mar 28, 2014 at 2:04 AM, Craig Lewis <clewis at centraldesktop.com> wrote:
>> In the interest of removing variables, I removed all snapshots on all pools,
>> then restarted all ceph daemons at the same time.  This brought up osd.8 as
>> well.
> 
> So just to summarize this: your 100% CPU problem at the time went away
> after you removed all snapshots, and the actual cause of the issue was
> never found?
> 
> I am seeing a similar issue now, and have filed
> http://tracker.ceph.com/issues/9503 to make sure it doesn't get lost
> again. Can you take a look at that issue and let me know if anything
> in the description sounds familiar?

Could your ticket be related to the snap trimming issue I?ve finally narrowed down in the past couple days?

  http://tracker.ceph.com/issues/9487

Bump up debug_osd to 20 then check the log during one of your incidents. If it is busy logging the snap_trimmer messages, then it?s the same issue. (The issue is that rbd pools have many purged_snaps, but sometimes after backfilling a PG the purged_snaps list is lost and thus the snap trimmer becomes very busy whilst re-trimming thousands of snaps. During that time (a few minutes on my cluster) the OSD is blocked.)

Cheers, Dan

> 
> You mentioned in a later message in the same thread that you would
> keep your snapshot script running and "repeat the experiment". Did the
> situation change in any way after that? Did the issue come back? Or
> did you just stop using snapshots altogether?
> 
> Cheers,
> Florian
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com