Re: Infernalis .rgw.buckets.index objects becoming corrupted in on RHEL 7.2 during recovery

"Brandon Morris, PMP" <brandon.morris.pmp@xxxxxxxxx> · Tue, 22 Mar 2016 15:35:26 -0400

I was able to get this back to HEALTH_OK by doing the following:

1. Allow ceph-objectstore-tool to run over a weekend attempting to export the PG.  Looking at timestamps it took approximately 6 hours to complete successfully
2. Import the PG into unused PG and start it up+out
3. Allow the cluster to detect the imported PG and backfill the exported object / keys that were stuck and unfound
4. start OSD 388 and allow the cluster to query it for any other recovery information.  OSD 388 no longer experienced the suicide timeout after backfill was complete.
5. Wait for the cluster to finish recovering the PG to 3 new OSD's

I am still not sure what caused it to get into this state or if there is a better way to resolve it.  Like I stated before, this is the second time this has happened while the .rgw.buckets.index pool is recovering.     

Has anyone else run into this or is it just our specific deployment recipe?

Thanks,

Brandon

On Thu, Mar 17, 2016 at 4:37 PM, Brandon Morris, PMP <brandon.morris.pmp@xxxxxxxxx> wrote:
List,

We have stood up a Infernalis 9.2.0 cluster on RHEL 7.2.  We are using the radosGW to store potentially billions of small to medium sized objects (64k - 1MB).  

We have run into an issue twice thus far where .rgw.bucket.index placement groups will become corrupt during recovery after a drive failure.  This corruption will cause the OSD to crash with a  suicide_timeout error when trying to backfill the corrupted index file to a different OSD.  Exporting the corrupted placement group using the ceph-objectstore-tool will also hang. When this first came up, we were able to simply rebuild the .rgw pools and start from scratch.  There were no underlying XFS issues. 

Before we put this cluster into full operation, we are looking to determine what caused this and if there is a hard limit to the number of objects in a bucket.  We are currently putting all objects into 1 bucket, but should probably divide these up.

I have uploaded the OSD and ceph-objectstore tool debug files here: https://github.com/garignack/ceph_misc/raw/master/ceph-osd.388.zip   Any help would be greatly appreciated.  

I'm not a ceph expert by any means, but here is where I've gotten to thus far. (And may be way off base)

The PG in question only has 1 object - .dir.default.808642.1.163
    | [root@node13 ~]# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-388/ --journal-path /var/lib/ceph/osd/ceph-388/journal --pgid 24.197 --op list
    | SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 18 00 00 00 00 20 00 00 00 00 00 83 1c 00 00 00 00 00 00 00 00 00 00 00 00
    | SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 18 00 00 00 00 20 00 00 00 00 00 83 1c 00 00 00 00 00 00 00 00 00 00 00 00
    | ["24.197",{"oid":".dir.default.808642.1.163","key":"","snapid":-2,"hash":491874711,"max":0,"pool":24,"namespace":"","max":0}]

Here are the final lines of the ceph-objectstore-tool before it hangs:

    | e140768: 570 osds: 558 up, 542 in
    | Read 24/1d516997/.dir.default.808642.1.163/head
    | size=0
    | object_info: 24/1d516997/.dir.default.808642.1.163/head(139155'2197754 client.1137891.0:20837319 dirty|omap|data_digest s 0 uv 2197754 dd ffffffff)
    | attrs size 2

This leads me to suspect something between line 564 and line 576 in the tool is hanging.  https://github.com/ceph/ceph/blob/master/src/tools/ceph_objectstore_tool.cc#L564.  Current suspect is the objectstore read command.

    | ret = store->read(cid, obj, offset, len, rawdatabl);  

Looking through the OSD debug logs, I also see a strange size(18446744073709551615) on the recovery operation for the 24/1d516997/.dir.default.808642.1.163/head object

    | 2016-03-17 12:12:29.753446 7f972ca3d700 10 osd.388 154849 dequeue_op 0x7f97580d3500 prio 2 cost 1049576 latency 0.000185 MOSDPGPull(24.197 154849 [PullOp(24/1d516997/.dir.default.808642.1.163/head, recovery_info: ObjectRecoveryInfo(24/1d516997/.dir.default.808642.1.163/head@139155'2197754, size: 18446744073709551615, copy_subset: [0~18446744073709551615], clone_subset: {}), recovery_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2 pg pg[24.197( v 139155'2197754 (139111'2194700,139155'2197754] local-les=154480 n=1 ec=128853 les/c/f 154268/138679/0 154649/154650/154650) [179,443,517]/[306,441] r=-1 lpr=154846 pi=138674-154649/37 crt=139155'2197752 lcod 0'0 inactive NOTIFY NIBBLEWISE]

this error eventually causes the thread to hang and eventually trigger the suicide timeout
    | 2016-03-17 12:12:45.541528 7f973524e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f972ca3d700' had timed out after 15
    | 2016-03-17 12:12:45.541533 7f973524e700 20 heartbeat_map is_healthy = NOT HEALTHY, total workers: 29, number of unhealthy: 1
    | 2016-03-17 12:12:45.541534 7f973524e700 10 osd.388 154849 internal heartbeat not healthy, dropping ping request

    | 2016-03-17 12:15:02.148193 7f973524e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f972ca3d700' had timed out after 15
    | 2016-03-17 12:15:02.148195 7f973524e700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f972ca3d700' had suicide timed out after 150

    | ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
    |  1: (()+0x7e6ab2) [0x7f9753701ab2]
    |  2: (()+0xf100) [0x7f9751893100]
    | 3: (gsignal()+0x37) [0x7f97500705f7]
    | 4: (abort()+0x148) [0x7f9750071ce8]
    | 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f97509749d5]
    | 6: (()+0x5e946) [0x7f9750972946]
    | 7: (()+0x5e973) [0x7f9750972973]
    | 8: (()+0x5eb93) [0x7f9750972b93]
    | 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0x7f97537f6dda]
    | 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2d9) [0x7f97537363b9]
    | 11: (ceph::HeartbeatMap::is_healthy()+0xd6) [0x7f9753736bf6]
    | 12: (OSD::handle_osd_ping(MOSDPing*)+0x933) [0x7f9753241593]
    | 13: (OSD::heartbeat_dispatch(Message*)+0x3bb) [0x7f975324278b]
    | 14: (DispatchQueue::entry()+0x649) [0x7f97538b2939]
    | 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f97537d6b1d]
    | 16: (()+0x7dc5) [0x7f975188bdc5]
    | 17: (clone()+0x6d) [0x7f975013121d]
    | NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

    | --- logging levels ---
    |    0/ 5 none
    |    0/ 0 lockdep
    |    0/ 0 context
    |    0/ 0 crush
    |    1/ 5 mds
    |    1/ 5 mds_balancer
    |    1/ 5 mds_locker
    |    1/ 5 mds_log
    |    1/ 5 mds_log_expire
    |    1/ 5 mds_migrator
    |    0/ 0 buffer
    |    0/ 0 timer
    |    0/ 0 filer
    |    0/ 1 striper
    |    0/ 0 objecter
    |    0/ 0 rados
    |    0/ 0 rbd
    |    0/ 5 rbd_replay
    |    0/ 0 journaler
    |    0/ 5 objectcacher
    |    0/ 0 client
    |   20/20 osd
    |    0/ 0 optracker
    |    0/ 0 objclass
    |    0/ 0 filestore
    |    1/ 3 keyvaluestore
    |   20/20 journal
    |   20/20 ms
    |    0/ 0 mon
    |    0/ 0 monc
    |    0/ 0 paxos
    |    0/ 0 tp
    |    0/ 0 auth
    |    1/ 5 crypto
    |    0/ 0 finisher
    |   20/20 heartbeatmap
    |    0/ 0 perfcounter
    |    0/ 0 rgw
    |    1/10 civetweb
    |    1/ 5 javaclient
    |    0/ 0 asok
    |    0/ 0 throttle
    |    0/ 0 refs
    |    1/ 5 xio
    |    1/ 5 compressor
    |    1/ 5 newstore
    |   -2/-2 (syslog threshold)
    |   -1/-1 (stderr threshold)
    |   max_recent     10000
    |   max_new         1000
    |   log_file /var/log/ceph/ceph-osd.388.log
    | --- end dump of recent events ---

Here is the output from Ceph -s and Ceph Health Detail:

cluster 040292e3-03f8-4c03-a6bb-ab0a13703c77
     health HEALTH_WARN
            1 pgs degraded
            1 pgs recovering
            1 pgs stuck degraded
            1 pgs stuck unclean
            1 pgs stuck undersized
            1 pgs undersized
            100 requests are blocked > 32 sec
            recovery 3/1009551476 objects degraded (0.000%)
            recovery 5/1009551476 objects misplaced (0.000%)
            recovery 1/336517158 unfound (0.000%)
     monmap e1: 3 mons at {MON01=192.168.2.20:6789/0,mon02=192.168.2.21:6789/0,mon03=192.168.2.22:6789/0}
            election epoch 218, quorum 0,1,2 mon01,mon02,mon03
     mdsmap e101: 1/1/1 up {0=mon03=up:active}, 2 up:standby
     osdmap e154854: 558 osds: 544 up, 529 in; 1 remapped pgs
            flags sortbitwise
      pgmap v3740299: 44352 pgs, 21 pools, 6979 GB data, 320 Mobjects
            29349 GB used, 2847 TB / 2876 TB avail
            3/1009551476 objects degraded (0.000%)
            5/1009551476 objects misplaced (0.000%)
            1/336517158 unfound (0.000%)
               44322 active+clean
                  27 active+clean+scrubbing
                   2 active+clean+scrubbing+deep
                   1 active+recovering+undersized+degraded+remapped

HEALTH_WARN 1 pgs degraded; 1 pgs recovering; 1 pgs stuck degraded; 1 pgs stuck unclean; 1 pgs stuck undersized; 1 pgs undersized; 100 requests are blocked > 32 sec; 1 osds have slow requests; recovery 3/1009551476 objects degraded (0.000%); recovery 5/1009551476 objects misplaced (0.000%); recovery 1/336517158 unfound (0.000%)
pg 24.197 is stuck unclean for 555746.710196, current state active+recovering+undersized+degraded+remapped, last acting [306,441]
pg 24.197 is stuck undersized for 94219.301529, current state active+recovering+undersized+degraded+remapped, last acting [306,441]
pg 24.197 is stuck degraded for 94219.302284, current state active+recovering+undersized+degraded+remapped, last acting [306,441]
pg 24.197 is active+recovering+undersized+degraded+remapped, acting [306,441], 1 unfound
100 ops are blocked > 4194.3 sec on osd.306
1 osds have slow requests
recovery 3/1009551476 objects degraded (0.000%)
recovery 5/1009551476 objects misplaced (0.000%)
recovery 1/336517158 unfound (0.000%)    

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com