On 02/15/2018 05:33 PM, Gregory Farnum wrote:
On Thu, Feb 15, 2018 at 3:10 PM Graham Allan <gta@xxxxxxx
<mailto:gta@xxxxxxx>> wrote:
A lot more in xattrs which I won't paste, though the keys are:
> root@cephmon1:~# ssh ceph03 find
/var/lib/ceph/osd/ceph-295/current/70.3d6s0_head -name '*1089213*'
-exec xattr {} +
> user.ceph._user.rgw.idtag
> user.cephos.spill_out
> user.ceph._
> user.ceph.snapset
> user.ceph._user.rgw.manifest
> user.ceph._@1
> user.ceph.hinfo_key
> user.ceph._user.rgw.manifest@1
> user.ceph._user.rgw.manifest@2
> user.ceph._user.rgw.acl
> user.ceph._user.rgw.x-amz-acl
> user.ceph._user.rgw.etag
> user.ceph._user.rgw.x-amz-date
> user.ceph._user.rgw.content_type
Not sure which among these would contain pointers to part files.
I believe it’s the manifest xattrs there.
Thanks, I see better now how this goes together. Rather than dumping out
the manifest xattrs, I guess I should also be able to understand this
from the output of "radosgw-admin object stat".
One example, yesterday pg 70.438 showed "has 169 objects unfound and
apparently lost". At that time the state was
active+recovery_wait+inconsistent. Today it's showing no unfound objects
but is active+clean+inconsistent, and objects which were inaccessible
via radosgw yesterday can now download. I'm not sure what changed. I
have asked ceph to perform another deep scrub and repair on the pg, but
it has yet to start. I'm really curious to see if it becomes consistent,
or discovers unfound objects again.
Actually now I notice that a pg reported as
active+recovery_wait+inconsistent by "ceph health detail" is shown as
active+recovering+inconsistent by "ceph pg list". That makes more sense
to me - "recovery_wait" implied to me that it was waiting for recovery
to start, while "recovering" explains why the problem might clear
itself.
Right, “recovery_wait” means that the pg needs to do log-based recovery
but (at least) one of the participating OSDs doesn’t have a slot
available; that will resolve itself eventually.
It sounds like the scrubbing has detected some inconsistencies but the
reason you weren’t getting data is just that it hit an object which
needed recovery but was blocked waiting on it.
Yes, though it seems to be stuck in a cycle. This morning, that same pg
70.438 is back in active+recovery_wait+inconsistent state with the same
169 unfound objects - and from s3, the objects which would download
successfully yesterday now stall. Which probably makes sense while in
"unfound" state, but clearly the data is there, while ceph is not
successful in making the pg consistent. Each time it repairs again, it
claims to fix more errors but finds the same number of unfound objects
again.
/var/log/ceph/ceph.log.2.gz:2018-02-14 15:48:14.438357 osd.175 osd.175 10.31.0.71:6838/66301 2928 : cluster [ERR] 70.438s0 repair 0 missing, 169 inconsistent objects
/var/log/ceph/ceph.log.2.gz:2018-02-14 15:48:14.442875 osd.175 osd.175 10.31.0.71:6838/66301 2929 : cluster [ERR] 70.438 repair 169 errors, 845 fixed
/var/log/ceph/ceph.log.1.gz:2018-02-15 19:42:25.040196 osd.175 osd.175 10.31.0.71:6838/66301 2995 : cluster [ERR] 70.438s0 repair 0 missing, 169 inconsistent objects
/var/log/ceph/ceph.log.1.gz:2018-02-15 19:42:25.046028 osd.175 osd.175 10.31.0.71:6838/66301 2996 : cluster [ERR] 70.438 repair 169 errors, 685 fixed
I also now see that for these unfound objects "radosgw-admin object
stat" also hangs. Clearly makes sense since radosgw must perform this to
retrieve the object. Does it imply that ceph can't access the "head"
object in order to read the xattr data?
If I set debug rgw=1 and demug ms=1 before running the "object stat"
command, it seems to stall in a loop of trying communicate with osds for
pool 96, which is .rgw.control
10.32.16.93:0/2689814946 --> 10.31.0.68:6818/8969 -- osd_op(unknown.0.0:541 96.e 96:7759931f:::notify.3:head [watch ping cookie 139709246356176] snapc 0=[] ondisk+write+known_if_redirected e507695) v8 -- 0x7f10ac033610 con 0
10.32.16.93:0/2689814946 <== osd.38 10.31.0.68:6818/8969 59 ==== osd_op_reply(541 notify.3 [watch ping cookie 139709246356176] v0'0 uv3933745 ondisk = 0) v8 ==== 152+0+0 (2536111836 0 0) 0x7f1158003e20 con 0x7f117afd8390
Prior to that, probably more relevant, this was the only communication
logged with the primary osd of the pg:
10.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 -- osd_op(unknown.0.0:96 70.438s0 70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head [getxattrs,stat] snapc 0=[] ondisk+read+known_if_redirected e507695) v8 -- 0x7fab79889fa0 con 0
10.32.16.93:0/1552085932 <== osd.175 10.31.0.71:6838/66301 1 ==== osd_backoff(70.438s0 block id 1 [70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head) e507695) v1 ==== 209+0+0 (1958971312 0 0) 0x7fab5003d3c0 con 0x7fab79885980
210.32.16.93:0/1552085932 --> 10.31.0.71:6838/66301 -- osd_backoff(70.438s0 ack-block id 1 [70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head,70:1c20c157:::default.325674.85_bellplants_images%2f1042066.jpg:head) e507695) v1 -- 0x7fab48065420 con 0
so I guess the backoff message above is saying the object is
unavailable. OK, that certainly makes sense. Not sure that it helps me
understand how to fix the inconsistencies....
Graham
--
Graham Allan
Minnesota Supercomputing Institute - gta@xxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com