Hello, We have a cluster with HEALTH_ERR state for a while now. We are trying to figure out how to solve it without the need of removing the affected rbd image. ceph -s cluster e94277ae-3d38-4547-8add-2cf3306f3efd health HEALTH_ERR 1 pgs inconsistent 5 scrub errors mon.ds2-mon1 low disk space monmap e5: 3 mons at {ds2-mon1=[2a00:c6c0:0:120::211]:6789/0,ds2-mon2=[2a00:c6c0:0:120::212]:6789/0,ds2-mon3=[2a00:c6c0:0:120::213]:6789/0} election epoch 2692, quorum 0,1,2 ds2-mon1,ds2-mon2,ds2-mon3 osdmap e241870: 95 osds: 95 up, 95 in pgmap v37634467: 1408 pgs, 3 pools, 30203 GB data, 11366 kobjects 90462 GB used, 99939 GB / 185 TB avail 1402 active+clean 3 active+clean+scrubbing 2 active+clean+scrubbing+deep 1 active+clean+inconsistent The problematic PG is 2.17e, osd set [74, 26] being 74 the primary, and also the osd having issues with the object. The log of osd.74 has the following errors: /var/log/ceph/ceph-osd.74.log.1.gz:899:2016-11-23 09:36:26.521014 7f1b3db20700 -1 log_channel(cluster) log [ERR] : trim_object Snap 27e96 not in clones /var/log/ceph/ceph-osd.74.log.1.gz:951:2016-11-23 12:01:59.581861 7f1b40325700 -1 log_channel(cluster) log [ERR] : 2.17e shard 74: soid 2:7ea8d0f2:::rbd_data.2cb1eb3f2e2dab.000000000002b0a9:2769b missing attr _ /var/log/ceph/ceph-osd.74.log.1.gz:952:2016-11-23 12:01:59.581948 7f1b40325700 -1 log_channel(cluster) log [ERR] : scrub 2.17e 2:7ea8d0f2:::rbd_data.2cb1eb3f2e2dab.000000000002b0a9:28a75 is an unexpected clone /var/log/ceph/ceph-osd.74.log.1.gz:953:2016-11-23 12:01:59.581976 7f1b40325700 -1 log_channel(cluster) log [ERR] : scrub 2.17e 2:7ea8d0f2:::rbd_data.2cb1eb3f2e2dab.000000000002b0a9:287d4 is an unexpected clone /var/log/ceph/ceph-osd.74.log.1.gz:954:2016-11-23 12:01:59.581997 7f1b40325700 -1 log_channel(cluster) log [ERR] : scrub 2.17e 2:7ea8d0f2:::rbd_data.2cb1eb3f2e2dab.000000000002b0a9:27e96 is an unexpected clone /var/log/ceph/ceph-osd.74.log.1.gz:955:2016-11-23 12:01:59.582016 7f1b40325700 -1 log_channel(cluster) log [ERR] : scrub 2.17e 2:7ea8d0f2:::rbd_data.2cb1eb3f2e2dab.000000000002b0a9:2769b is an unexpected clone /var/log/ceph/ceph-osd.74.log.1.gz:956:2016-11-23 12:03:08.824919 7f1b40325700 -1 log_channel(cluster) log [ERR] : 2.17e scrub 0 missing, 1 inconsistent objects /var/log/ceph/ceph-osd.74.log.1.gz:957:2016-11-23 12:03:08.824934 7f1b40325700 -1 log_channel(cluster) log [ERR] : 2.17e scrub 5 errors /var/log/ceph/ceph-osd.74.log.1.gz:958:2016-11-23 12:05:50.500581 7f1b3db20700 -1 log_channel(cluster) log [ERR] : trim_object Snap 27e96 not in clones We run ceph pg repair several times some time ago and went from 8 scrub errors to 6, and then to 4, and then back to 5 scrub errors, but never to a healthy state. Error messages in the log have been changing too, although the unexpected clones message has always been there. $ rbd snap ls imageXX SNAPID NAME SIZE 155958 2016-10-01-00-50-21 976 GB 161092 2016-10-16-01-13-52 976 GB 163478 2016-10-23-00-58-53 976 GB 165844 2016-10-30-00-58-57 976 GB 166517 2016-11-01-01-39-25 976 GB By mapping SNAPID to hexadecimal, we get the id that is referred in the log. 163478 27e96 165844 287d4 166517 28a75 161435 2769b -- This one is not a snapshot in that image ?? However, "rbd snap ls imageXX" is no longer showing the snapshots for that image since few days ago. (???) We wonder if it is safe to perform another ceph pg repair, since the primary seems to be the one with the broken information. As far as I understand, the primary is the one who tries to fix the rest with its information. Also, we wonder if the recommendation of performing a manual fix as David suggests in this case http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-May/031307.html would work for us too. Kind regards, -- Ana Avilés _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com