Re: Understanding/correcting sudden onslaught of unfound objects

Graham Allan <gta@xxxxxxx> · Fri, 16 Feb 2018 12:31:46 -0600

On 02/15/2018 05:33 PM, Gregory Farnum wrote:
On Thu, Feb 15, 2018 at 3:10 PM Graham Allan <gta@xxxxxxx 
<mailto:gta@xxxxxxx>> wrote:

    A lot more in xattrs which I won't paste, though the keys are:

     > root@cephmon1:~# ssh ceph03 find
    /var/lib/ceph/osd/ceph-295/current/70.3d6s0_head -name '*1089213*'
    -exec xattr {} +
     > user.ceph._user.rgw.idtag
     > user.cephos.spill_out
     > user.ceph._
     > user.ceph.snapset
     > user.ceph._user.rgw.manifest
     > user.ceph._@1
     > user.ceph.hinfo_key
     > user.ceph._user.rgw.manifest@1
     > user.ceph._user.rgw.manifest@2
     > user.ceph._user.rgw.acl
     > user.ceph._user.rgw.x-amz-acl
     > user.ceph._user.rgw.etag
     > user.ceph._user.rgw.x-amz-date
     > user.ceph._user.rgw.content_type

    Not sure which among these would contain pointers to part files.

I believe it’s the manifest xattrs there.

Thanks, I see better now how this goes together. Rather than dumping out 
the manifest xattrs, I guess I should also be able to understand this 
from the output of "radosgw-admin object stat".

    One example, yesterday pg 70.438 showed "has 169 objects unfound and
    apparently lost". At that time the state was
    active+recovery_wait+inconsistent. Today it's showing no unfound objects
    but is active+clean+inconsistent, and objects which were inaccessible
    via radosgw yesterday can now download. I'm not sure what changed. I
    have asked ceph to perform another deep scrub and repair on the pg, but
    it has yet to start. I'm really curious to see if it becomes consistent,
    or discovers unfound objects again.

    Actually now I notice that a pg reported as
    active+recovery_wait+inconsistent by "ceph health detail" is shown as
    active+recovering+inconsistent by "ceph pg list". That makes more sense
    to me - "recovery_wait" implied to me that it was waiting for recovery
    to start, while "recovering" explains why the problem might clear
    itself.

Right, “recovery_wait” means that the pg needs to do log-based recovery 
but (at least) one of the participating OSDs doesn’t have a slot 
available; that will resolve itself eventually.

It sounds like the scrubbing has detected some inconsistencies but the 
reason you weren’t getting data is just that it hit an object which 
needed recovery but was blocked waiting on it.

Yes, though it seems to be stuck in a cycle. This morning, that same pg 
70.438 is back in active+recovery_wait+inconsistent state with the same 
169 unfound objects - and from s3, the objects which would download 
successfully yesterday now stall. Which probably makes sense while in 
"unfound" state, but clearly the data is there, while ceph is not 
successful in making the pg consistent. Each time it repairs again, it 
claims to fix more errors but finds the same number of unfound objects 
again.

/var/log/ceph/ceph.log.2.gz:2018-02-14 15:48:14.438357 osd.175 osd.175 10.31.0.71:6838/66301 2928 : cluster [ERR] 70.438s0 repair 0 missing, 169 inconsistent objects
/var/log/ceph/ceph.log.2.gz:2018-02-14 15:48:14.442875 osd.175 osd.175 10.31.0.71:6838/66301 2929 : cluster [ERR] 70.438 repair 169 errors, 845 fixed
/var/log/ceph/ceph.log.1.gz:2018-02-15 19:42:25.040196 osd.175 osd.175 10.31.0.71:6838/66301 2995 : cluster [ERR] 70.438s0 repair 0 missing, 169 inconsistent objects
/var/log/ceph/ceph.log.1.gz:2018-02-15 19:42:25.046028 osd.175 osd.175 10.31.0.71:6838/66301 2996 : cluster [ERR] 70.438 repair 169 errors, 685 fixed

I also now see that for these unfound objects "radosgw-admin object 
stat" also hangs. Clearly makes sense since radosgw must perform this to 
retrieve the object. Does it imply that ceph can't access the "head" 
object in order to read the xattr data?

If I set debug rgw=1 and demug ms=1 before running the "object stat" 
command, it seems to stall in a loop of trying communicate with osds for 
pool 96, which is .rgw.control

10.32.16.93:0/2689814946 --> 10.31.0.68:6818/8969 -- osd_op(unknown.0.0:541 96.e 96:7759931f:::notify.3:head [watch ping cookie 139709246356176] snapc 0=[] ondisk+write+known_if_redirected e507695) v8 -- 0x7f10ac033610 con 0
10.32.16.93:0/2689814946 <== osd.38 10.31.0.68:6818/8969 59 ==== osd_op_reply(541 notify.3 [watch ping cookie 139709246356176] v0'0 uv3933745 ondisk = 0) v8 ==== 152+0+0 (2536111836 0 0) 0x7f1158003e20 con 0x7f117afd8390

Prior to that, probably more relevant, this was the only communication 
logged with the primary osd of the pg:

10.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 -- osd_op(unknown.0.0:96 70.438s0 70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head [getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e507695) v8 -- 0x7fab79889fa0 con 0
10.32.16.93:0/1552085932 <== osd.175 10.31.0.71:6838/66301 1 ==== osd_backoff(70.438s0 block id 1 [70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head) e507695) v1 ==== 209+0+0 (1958971312 0 0) 0x7fab5003d3c0 con 0x7fab79885980
210.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 -- osd_backoff(70.438s0 ack-block id 1 [70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head) e507695) v1 -- 0x7fab48065420 con 0

so I guess the backoff message above is saying the object is 
unavailable. OK, that certainly makes sense. Not sure that it helps me 
understand how to fix the inconsistencies....

Graham
--
Graham Allan
Minnesota Supercomputing Institute - gta@xxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com