Re: RGW hung, 2 OSDs using 100% CPU

Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> · Fri, 28 Mar 2014 10:42:08 -0700



      On 3/27/14 18:04 , Craig Lewis wrote:

    
        I'm trying to use strace on osd.4:

        strace -tt -f -ff -o ./ceph-osd.4.strace -x
          /usr/bin/ceph-osd --cluster=ceph -i 4 -f

        
        So far, strace is running, and the process isn't hung.  After I
        ran this, the cluster finally finished backfilling the last of
        the PGs (all on osd.4).

        
        Since the cluster is healthy again, I killed the strace, and
        started daemon normally (start ceph-osd id=4).  Things seem fine
        now.  I'm going to let it scrub and deepscrub overnight.  I'll
        restart radosgw-agent tomorrow.

        
    This seems to have resolved the issue.  The cluster completed
    recovery while I was strace'ing osd.4, and hasn't had any issues
    since then.  I restarted radosgw-agent, and it's running fine.

    
    I don't think the snapshots are related, but I don't know.  The
    snapshots I deleted were taken over a 2 week period, and covered an
    increase of 40% of the cluster data size.

    
    The snapshot cron is still active, so I guess I'll repeat the
    experiment.  If the issue comes back in a couple weeks, I try the
    strace without removing the snapshots.

  
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com