Re: RGW hung, 2 OSDs using 100% CPU

Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> · Wed, 26 Mar 2014 15:05:39 -0700



    I forgot to mention:

      root@ceph1c:~# uname -a

      Linux ceph1c 3.5.0-46-generic #70~precise1-Ubuntu SMP Thu
        Jan 9 23:55:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

        
      root@ceph1c:~# ceph -v

      ceph version 0.72.2
        (a913ded2ff138aefb8cb84d347d72164099cfd60)

        
      root@ceph1c:~# uptime

       15:05:22 up  3:55,  4 users,  load average: 4.84, 3.87,
        2.95

      
              Craig Lewis
              

               Senior Systems Engineer

                Office +1.714.602.1309

                Email clewis@xxxxxxxxxxxxxxxxxx
               
              Central Desktop.
                  Work together in ways you never thought possible.
                 

                   Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog  

                
      On 3/26/14 15:04 , Craig Lewis wrote:

    
      I noticed that my RGW Replication wasn't doing anything.  RGW
      Replication logs were full of:

      2014-03-26T11:29:01.802 18698:ERROR:root:Could not retrieve
        region map from destination

      Traceback (most recent call last):

        File
        "/usr/lib/python2.7/dist-packages/radosgw_agent/cli.py", line
        262, in main

          region_map = client.get_region_map(dest_conn)

      <snip>

      
      I tried to get a list of buckets in the slave zone, and it hangs. 
      The simpliest operation I can run that hangs is:

      radosgw-admin --name=client.radosgw.ceph1c regions list

      
      This cluster has 2 nodes.  Watching ceph -w, I see 2 OSDs get
      kicked out of the cluster.  OSDs 4 and 8 are both flapping.  An
      excerpt from ceph -w:

      2014-03-26 11:24:33.582757 osd.9 [WRN] 1 slow requests, 1
        included below; oldest blocked for > 30.838374 secs

      2014-03-26 11:24:33.582762 osd.9 [WRN] slow request
        30.838374 seconds old, received at 2014-03-26 11:24:02.744331:
        osd_op(client.41087064.0:8 notify.0 [watch add cookie 1 ver 0]
        9.4322fa9f e8270) v4 currently waiting for subops from [4]

      2014-03-26 11:24:35.583064 osd.9 [WRN] 2 slow requests, 1
        included below; oldest blocked for > 32.838703 secs

      2014-03-26 11:24:35.583068 osd.9 [WRN] slow request
        30.012828 seconds old, received at 2014-03-26 11:24:05.570206:
        osd_op(client.41087080.0:8 notify.0 [watch add cookie 1 ver 0]
        9.4322fa9f e8270) v4 currently waiting for subops from [4]

      2014-03-26 11:28:10.612684 osd.9 [WRN] 4 slow requests, 1
        included below; oldest blocked for > 247.868328 secs

      2014-03-26 11:28:10.612692 osd.9 [WRN] slow request
        240.276922 seconds old, received at 2014-03-26 11:24:10.335737:
        osd_op(client.41087104.0:8 notify.0 [watch add cookie 1 ver 0]
        9.4322fa9f e8270) v4 currently waiting for subops from [4]

      2014-03-26 11:31:08.198483 mon.0 [INF] osd.4 marked
        itself down

      2014-03-26 11:31:08.266406 mon.0 [INF] osdmap e8271: 16
        osds: 15 up, 16 in

      2014-03-26 11:31:21.294695 mon.0 [INF] osd.4
        10.193.0.6:6812/2215 boot

      2014-03-26 11:33:55.856801 mon.0 [INF] osd.8 marked down
        after no pg stats for 900.157048seconds

      2014-03-26 11:33:55.894864 mon.0 [INF] osdmap e8275: 16
        osds: 15 up, 16 in

      2014-03-26 11:47:30.952634 mon.0 [INF] osd.8 out (down
        for 815.058447)

        
      I set the noout flag.

      
      This is a production slave zone.  I can play with the cluster as
      long as replication is off.  By playing with OSD 4 and 8, I found
      that I can make the cluster work fine with OSD 4 up, and OSD 8
      down.  If I start ODS 8, and stop OSD 4, RGW is still hung.  If I
      start both, RGW is still hung.

      
      I saw some mailing list posts where somebody had a similar issue,
      and fixing the time solved it.  I verified that my times are
      correct on both machines.

      
      I haven't seen any XFS issues, but just to be sure, I umounted
      both devices, and ran xfs_check.  No errors were reported.

      
      Since all the easy things appear to be in order, I bumped up
      logging on both nodes:

      [osd]

        osd journal size = 6144

        osd max backfills = 1

        osd mkfs options xfs = -l size=1024m -n size=64k -i
        size=2048 -s size=4096

        osd mkfs type = xfs

        osd mount options xfs =
        rw,noatime,nodiratime,nosuid,noexec,inode64

        osd recovery op priority = 1

        debug osd = 5

        debug filestore = 5

        debug journal = 1

        debug monc = 20

      
      At 14:14:00, I started OSD 4, and waited for ceph-w to stabilize. 
      CPU usage was normal.

      At 14:15:10, I ran radosgw-admin --name=client.radosgw.ceph1c
        regions list && radosgw-admin
        --name=client.radosgw.ceph1c regionmap get.  It returned
      successfully.

      At 14:16:00, I started OSD 8, and waited for ceph -w to
      stabilize.  CPU usage started out normal, but went to 100% before
      14:16:40.

      At 14:17:25, I ran radosgw-admin --name=client.radosgw.ceph1c
        regions list && radosgw-admin
        --name=client.radosgw.ceph1c regionmap get.  regions list
      hung, and I killed At 14:18:15, I stopped ceph-osd id=8.

      At 14:18:45, I ran radosgw-admin --name=client.radosgw.ceph1c
        regions list && radosgw-admin
        --name=client.radosgw.ceph1c regionmap get.  It returned
      successfully.

      At 14:19:10, I stopped ceph-osd id=8.

      
      Looking at the logs, nothing jumps out at me between 14:16:00 and
      14:16:40.  The logs are busy, since OSD 8 is backfilling.

      
      I don't think it's related, but I'd like to point out that all of
      the RGW pools have snapshots:

      root@ceph1c:/var/log/ceph# rados lssnap -p
        .us-central-1.rgw.root

      3    2014-03-15T02:00:02    2014.03.15 02:00:02

      4    2014-03-16T02:00:01    2014.03.16 02:00:02

      5    2014-03-17T02:00:01    2014.03.17 02:00:01

      6    2014-03-18T02:00:01    2014.03.18 02:00:01

      7    2014-03-19T02:00:01    2014.03.19 02:00:01

      8    2014-03-20T02:00:01    2014.03.20 02:00:01

      9    2014-03-21T02:00:01    2014.03.21 02:00:01

      10    2014-03-22T02:00:01    2014.03.22 02:00:01

      11    2014-03-23T02:00:02    2014.03.23 02:00:02

      12    2014-03-24T02:00:02    2014.03.24 02:00:02

      13    2014-03-25T02:00:02    2014.03.25 02:00:02

      14    2014-03-26T02:00:01    2014.03.26 02:00:01

      12 snaps

      
      That's only in my slave zone.  The master zone (which is working
      fine) has no snapshots.

      
      For now, I've removed the debugging, started OSD 4, and started
      radosgw-agent.  OSD 8 remains off.  

      
      There are 43 PGs that live on OSD 4 and 8:

      root@ceph1c:/var/log/ceph# ceph pg dump | fgrep '[4]' | cut
        -d"." -f1 | sort | uniq -c

           36 11

            2 13

            1 14

            1 18

            1 19

            1 20

            1 21

      root@ceph1c:/var/log/ceph# ceph osd dump | grep pool |
        egrep 'pool (11|13|14|18|19|20|21)'

      pool 11 '.rgw.buckets' rep size 2 min_size 1
        crush_ruleset 0 object_hash rjenkins pg_num 2048 pgp_num 2048
        last_change 8014 owner 0 flags hashpspool

      pool 13 '.us-central-1.log' rep size 2 min_size 1
        crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32
        last_change 8007 owner 0 flags hashpspool

      pool 14 '.us-central-1.intent-log' rep size 2 min_size 1
        crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32
        last_change 8006 owner 0 flags hashpspool

      pool 18 '.us-central-1.users.swift' rep size 2 min_size 1
        crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32
        last_change 8010 owner 0 flags hashpspool

      pool 19 '.us-central-1.users.uid' rep size 2 min_size 1
        crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32
        last_change 8011 owner 0 flags hashpspool

      pool 20 '.rgw.root' rep size 2 min_size 1 crush_ruleset 0
        object_hash rjenkins pg_num 32 pgp_num 32 last_change 207 owner
        0 flags hashpspool

      pool 21 '.rgw.control' rep size 2 min_size 1
        crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32
        last_change 208 owner 0 flags hashpspool

      
      The two OSD logs from 14:14:00 to 14:19:10 can be found at

      OSD 4: https://cd.centraldesktop.com/p/eAAAAAAADL0vAAAAAABMSio

      OSD 8: https://cd.centraldesktop.com/p/eAAAAAAADL05AAAAAFAVWRg

      
      Any suggestions or ideas are appreciated.

      
      -- 

         
              Craig Lewis
              

               Senior Systems Engineer

                Office +1.714.602.1309

                Email clewis@xxxxxxxxxxxxxxxxxx
               
              Central Desktop.
                  Work together in ways you never thought possible.
                 

                   Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog  

                
      _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

    
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com