RGW hung, 2 OSDs using 100% CPU

Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> · Wed, 26 Mar 2014 15:04:10 -0700



    I noticed that my RGW Replication wasn't doing anything.  RGW
    Replication logs were full of:

    2014-03-26T11:29:01.802 18698:ERROR:root:Could not retrieve
      region map from destination

    Traceback (most recent call last):

      File
      "/usr/lib/python2.7/dist-packages/radosgw_agent/cli.py", line 262,
      in main

        region_map = client.get_region_map(dest_conn)

    <snip>

    
    I tried to get a list of buckets in the slave zone, and it hangs. 
    The simpliest operation I can run that hangs is:

    radosgw-admin --name=client.radosgw.ceph1c regions list

    
    This cluster has 2 nodes.  Watching ceph -w, I see 2 OSDs get kicked
    out of the cluster.  OSDs 4 and 8 are both flapping.  An excerpt
    from ceph -w:

    2014-03-26 11:24:33.582757 osd.9 [WRN] 1 slow requests, 1
      included below; oldest blocked for > 30.838374 secs

    2014-03-26 11:24:33.582762 osd.9 [WRN] slow request
      30.838374 seconds old, received at 2014-03-26 11:24:02.744331:
      osd_op(client.41087064.0:8 notify.0 [watch add cookie 1 ver 0]
      9.4322fa9f e8270) v4 currently waiting for subops from [4]

    2014-03-26 11:24:35.583064 osd.9 [WRN] 2 slow requests, 1
      included below; oldest blocked for > 32.838703 secs

    2014-03-26 11:24:35.583068 osd.9 [WRN] slow request
      30.012828 seconds old, received at 2014-03-26 11:24:05.570206:
      osd_op(client.41087080.0:8 notify.0 [watch add cookie 1 ver 0]
      9.4322fa9f e8270) v4 currently waiting for subops from [4]

    2014-03-26 11:28:10.612684 osd.9 [WRN] 4 slow requests, 1
      included below; oldest blocked for > 247.868328 secs

    2014-03-26 11:28:10.612692 osd.9 [WRN] slow request
      240.276922 seconds old, received at 2014-03-26 11:24:10.335737:
      osd_op(client.41087104.0:8 notify.0 [watch add cookie 1 ver 0]
      9.4322fa9f e8270) v4 currently waiting for subops from [4]

    2014-03-26 11:31:08.198483 mon.0 [INF] osd.4 marked itself
      down

    2014-03-26 11:31:08.266406 mon.0 [INF] osdmap e8271: 16
      osds: 15 up, 16 in

    2014-03-26 11:31:21.294695 mon.0 [INF] osd.4
      10.193.0.6:6812/2215 boot

    2014-03-26 11:33:55.856801 mon.0 [INF] osd.8 marked down
      after no pg stats for 900.157048seconds

    2014-03-26 11:33:55.894864 mon.0 [INF] osdmap e8275: 16
      osds: 15 up, 16 in

    2014-03-26 11:47:30.952634 mon.0 [INF] osd.8 out (down for
      815.058447)

      
    I set the noout flag.

    
    This is a production slave zone.  I can play with the cluster as
    long as replication is off.  By playing with OSD 4 and 8, I found
    that I can make the cluster work fine with OSD 4 up, and OSD 8
    down.  If I start ODS 8, and stop OSD 4, RGW is still hung.  If I
    start both, RGW is still hung.

    
    I saw some mailing list posts where somebody had a similar issue,
    and fixing the time solved it.  I verified that my times are correct
    on both machines.

    
    I haven't seen any XFS issues, but just to be sure, I umounted both
    devices, and ran xfs_check.  No errors were reported.

    
    Since all the easy things appear to be in order, I bumped up logging
    on both nodes:

    [osd]

      osd journal size = 6144

      osd max backfills = 1

      osd mkfs options xfs = -l size=1024m -n size=64k -i
      size=2048 -s size=4096

      osd mkfs type = xfs

      osd mount options xfs =
      rw,noatime,nodiratime,nosuid,noexec,inode64

      osd recovery op priority = 1

      debug osd = 5

      debug filestore = 5

      debug journal = 1

      debug monc = 20

    
    At 14:14:00, I started OSD 4, and waited for ceph-w to stabilize. 
    CPU usage was normal.

    At 14:15:10, I ran radosgw-admin --name=client.radosgw.ceph1c
      regions list && radosgw-admin
      --name=client.radosgw.ceph1c regionmap get.  It returned
    successfully.

    At 14:16:00, I started OSD 8, and waited for ceph -w to stabilize. 
    CPU usage started out normal, but went to 100% before 14:16:40.

    At 14:17:25, I ran radosgw-admin --name=client.radosgw.ceph1c
      regions list && radosgw-admin
      --name=client.radosgw.ceph1c regionmap get.  regions list
    hung, and I killed At 14:18:15, I stopped ceph-osd id=8.

    At 14:18:45, I ran radosgw-admin --name=client.radosgw.ceph1c
      regions list && radosgw-admin
      --name=client.radosgw.ceph1c regionmap get.  It returned
    successfully.

    At 14:19:10, I stopped ceph-osd id=8.

    
    Looking at the logs, nothing jumps out at me between 14:16:00 and
    14:16:40.  The logs are busy, since OSD 8 is backfilling.

    
    I don't think it's related, but I'd like to point out that all of
    the RGW pools have snapshots:

    root@ceph1c:/var/log/ceph# rados lssnap -p
      .us-central-1.rgw.root

    3    2014-03-15T02:00:02    2014.03.15 02:00:02

    4    2014-03-16T02:00:01    2014.03.16 02:00:02

    5    2014-03-17T02:00:01    2014.03.17 02:00:01

    6    2014-03-18T02:00:01    2014.03.18 02:00:01

    7    2014-03-19T02:00:01    2014.03.19 02:00:01

    8    2014-03-20T02:00:01    2014.03.20 02:00:01

    9    2014-03-21T02:00:01    2014.03.21 02:00:01

    10    2014-03-22T02:00:01    2014.03.22 02:00:01

    11    2014-03-23T02:00:02    2014.03.23 02:00:02

    12    2014-03-24T02:00:02    2014.03.24 02:00:02

    13    2014-03-25T02:00:02    2014.03.25 02:00:02

    14    2014-03-26T02:00:01    2014.03.26 02:00:01

    12 snaps

    
    That's only in my slave zone.  The master zone (which is working
    fine) has no snapshots.

    
    For now, I've removed the debugging, started OSD 4, and started
    radosgw-agent.  OSD 8 remains off.  

    
    There are 43 PGs that live on OSD 4 and 8:

    root@ceph1c:/var/log/ceph# ceph pg dump | fgrep '[4]' | cut
      -d"." -f1 | sort | uniq -c

         36 11

          2 13

          1 14

          1 18

          1 19

          1 20

          1 21

    root@ceph1c:/var/log/ceph# ceph osd dump | grep pool |
      egrep 'pool (11|13|14|18|19|20|21)'

    pool 11 '.rgw.buckets' rep size 2 min_size 1 crush_ruleset
      0 object_hash rjenkins pg_num 2048 pgp_num 2048 last_change 8014
      owner 0 flags hashpspool

    pool 13 '.us-central-1.log' rep size 2 min_size 1
      crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32
      last_change 8007 owner 0 flags hashpspool

    pool 14 '.us-central-1.intent-log' rep size 2 min_size 1
      crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32
      last_change 8006 owner 0 flags hashpspool

    pool 18 '.us-central-1.users.swift' rep size 2 min_size 1
      crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32
      last_change 8010 owner 0 flags hashpspool

    pool 19 '.us-central-1.users.uid' rep size 2 min_size 1
      crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32
      last_change 8011 owner 0 flags hashpspool

    pool 20 '.rgw.root' rep size 2 min_size 1 crush_ruleset 0
      object_hash rjenkins pg_num 32 pgp_num 32 last_change 207 owner 0
      flags hashpspool

    pool 21 '.rgw.control' rep size 2 min_size 1 crush_ruleset
      0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 208 owner
      0 flags hashpspool

    
    The two OSD logs from 14:14:00 to 14:19:10 can be found at

    OSD 4: https://cd.centraldesktop.com/p/eAAAAAAADL0vAAAAAABMSio

    OSD 8: https://cd.centraldesktop.com/p/eAAAAAAADL05AAAAAFAVWRg

    
    Any suggestions or ideas are appreciated.

    
    -- 

       
            Craig Lewis
            

             Senior Systems Engineer

              Office +1.714.602.1309

              Email clewis@xxxxxxxxxxxxxxxxxx
             
            Central Desktop.
                Work together in ways you never thought possible.
               

                 Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog  

              
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com