Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

Florian Haas <florian@xxxxxxxxxxx> · Wed, 13 Sep 2017 12:40:38 +0200

Hi everyone,

disclaimer upfront: this was seen in the wild on Hammer, and on 0.94.7
no less. Reproducing this on 0.94.10 is a pending process, and we'll
update here with findings, but my goal with this post is really to
establish whether the behavior as seen is expected, and if so, what
the rationale for it is. This is all about slow requests popping up on
a rather large scale after a previously down OSD node is brought back
into the cluster.

So here's the sequence of events for the issue I'm describing, as seen
in a test:

22:08:53 - OSD node stopped. OSDs 6, 17, 18, 22, 31, 32, 36, 45, 58
mark themselves down. Cluster has noout set, so all OSDs remain in.
fio tests are running against RBD in a loop, thus there is heavy
client I/O activity generating lots of new objects.

22:36:57 - OSD node restarted. OSDs 6, 17, 18, 22, 31, 32, 36, 45, 58
boot. Recovery commences, client I/O continues.

22:36:58 - OSD 17 is marked up (this will be relevant later).

22:37:01 - Last of the booted OSDs is marked up. Client I/O continues.

22:37:28 - Slow request warnings appear. osd op complaint time is set
to 30, so these are requests that were issued around the time of the
node restart. In other words, the cluster gets slow immediately after
the node comes back up.

22:49:28 - Last slow request warning seen.

23:00:43 - Recovery complete, all PGs active+clean.

Here is an example of such a slow request, as seen in ceph.log (and ceph -w) :
2017-09-09 22:37:33.241347 osd.30 172.22.4.54:6808/8976 3274 : cluster
[WRN] slow request 30.447920 seconds old, received at 2017-09-09
22:37:02.793382: osd_op(client.790810466.0:55974959
rbd_data.154cd34979e2a9e3.0000000000001980 [set-alloc-hint object_size
4194304 write_size 4194304,write 438272~24576] 277.bbddc5af snapc
4f37f=[4f37f] ack+ondisk+write+known_if_redirected e1353592) currently
waiting for degraded object

Having bumped the OSD's log level to 10 beforehand, these are some of
OSD 30's log entries correlating with this event, with a client
attempting to access rbd_data.154cd34979e2a9e3.0000000000001980:

2017-09-09 22:36:50.251219 7fe6ed7e9700 10 osd.30 pg_epoch: 1353586
pg[277.5af( v 1353586'11578774 (1353421'11574953,1353586'11578774]
local-les=1353563 n=1735 ec=862683 les/c 1353563/1353563
1353562/1353562/1333335) [30,94] r=0 lpr=1353562 luod=1353586'11578773
lua=1353586'11578773 crt=1353586'11578772 lcod 1353586'11578772 mlcod
1353586'11578772 active+undersized+degraded
snaptrimq=[4f12c~1,4f136~2,4f139~1,4f13c~1,4f13f~1,4f142~1,4f146~1,4f3f6~1,4f3fa~2,4f3fd~2,4f400~3,4f404~1,4f406~1,4f40d~2,4f410~3,4f414~2,4f419~1,4f41b~1,4f41d~3,4f421~1,4f423~1,4f426~1,4f428~1,4f42a~1]]
append_log: trimming to 1353586'11578772 entries 1353586'11578771
(1353585'11578560) modify
277/bbddc5af/rbd_data.154cd34979e2a9e3.0000000000001980/head by
client.790810466.0:55974546 2017-09-09
22:36:35.644687,1353586'11578772 (1353586'11578761) modify
277/b6d355af/rbd_data.28ab56362eb141f2.0000000000003483/head by
client.790921544.0:36806288 2017-09-09 22:36:37.940141

Note the 10-second gap in lines containing
"rbd_data.154cd34979e2a9e3.0000000000001980" between this log line and
the next. But this cluster has an unusually long osd snap trim sleep
of 500ms, so that might contribute to the delay here. Right in
between, at 22:36:58, OSD 17 comes up.

Next comes what I am most curious about:

2017-09-09 22:37:00.220011 7fe6edfea700 10 osd.30 pg_epoch: 1353589
pg[277.5af( v 1353586'11578776 (1353421'11574953,1353586'11578776]
local-les=1353589 n=1735 ec=862683 les/c 1353563/1353563
1353588/1353588/1333335) [30,17,94] r=0 lpr=1353588
pi=1353562-1353587/1 crt=1353586'11578773 lcod 1353586'11578775 mlcod
0'0 inactive snaptrimq=[4f12c~1,4f136~2,4f139~1,4f13c~1,4f13f~1,4f142~1,4f146~1,4f3f6~1,4f3fa~2,4f3fd~2,4f400~3,4f404~1,4f406~1,4f40d~2,4f410~3,4f414~2,4f419~1,4f41b~1,4f41d~3,4f421~1,4f423~1,4f426~1,4f428~1,4f42a~1]]
search_for_missing
277/bbddc5af/rbd_data.154cd34979e2a9e3.0000000000001980/head
1353586'11578774 is on osd.30
2017-09-09 22:37:00.220307 7fe6edfea700 10 osd.30 pg_epoch: 1353589
pg[277.5af( v 1353586'11578776 (1353421'11574953,1353586'11578776]
local-les=1353589 n=1735 ec=862683 les/c 1353563/1353563
1353588/1353588/1333335) [30,17,94] r=0 lpr=1353588
pi=1353562-1353587/1 crt=1353586'11578773 lcod 1353586'11578775 mlcod
0'0 inactive snaptrimq=[4f12c~1,4f136~2,4f139~1,4f13c~1,4f13f~1,4f142~1,4f146~1,4f3f6~1,4f3fa~2,4f3fd~2,4f400~3,4f404~1,4f406~1,4f40d~2,4f410~3,4f414~2,4f419~1,4f41b~1,4f41d~3,4f421~1,4f423~1,4f426~1,4f428~1,4f42a~1]]
search_for_missing
277/bbddc5af/rbd_data.154cd34979e2a9e3.0000000000001980/head
1353586'11578774 also missing on osd.17
2017-09-09 22:37:00.220599 7fe6edfea700 10 osd.30 pg_epoch: 1353589
pg[277.5af( v 1353586'11578776 (1353421'11574953,1353586'11578776]
local-les=1353589 n=1735 ec=862683 les/c 1353563/1353563
1353588/1353588/1333335) [30,17,94] r=0 lpr=1353588
pi=1353562-1353587/1 crt=1353586'11578773 lcod 1353586'11578775 mlcod
0'0 inactive snaptrimq=[4f12c~1,4f136~2,4f139~1,4f13c~1,4f13f~1,4f142~1,4f146~1,4f3f6~1,4f3fa~2,4f3fd~2,4f400~3,4f404~1,4f406~1,4f40d~2,4f410~3,4f414~2,4f419~1,4f41b~1,4f41d~3,4f421~1,4f423~1,4f426~1,4f428~1,4f42a~1]]
search_for_missing
277/bbddc5af/rbd_data.154cd34979e2a9e3.0000000000001980/head
1353586'11578774 is on osd.94
2017-09-09 22:37:02.793774 7fe6ed7e9700  7 osd.30 pg_epoch: 1353592
pg[277.5af( v 1353586'11578776 (1353421'11574953,1353586'11578776]
local-les=1353589 n=1735 ec=862683 les/c 1353589/1353563
1353588/1353588/1333335) [30,17,94] r=0 lpr=1353588
pi=1353562-1353587/1 crt=1353586'11578773 lcod 1353586'11578775 mlcod
0'0 active+recovery_wait+degraded
snaptrimq=[4f12c~1,4f136~2,4f139~1,4f13c~1,4f13f~1,4f142~1,4f146~1,4f3f6~1,4f3fa~2,4f3fd~2,4f400~3,4f404~1,4f406~1,4f40d~2,4f410~3,4f414~2,4f419~1,4f41b~1,4f41d~3,4f421~1,4f423~1,4f426~1,4f428~1,4f42a~1]]
degraded 277/bbddc5af/rbd_data.154cd34979e2a9e3.0000000000001980/head,
recovering

So we have a client that is talking to OSD 30. OSD 30 was never down;
OSD 17 was. OSD 30 is also the preferred primary for this PG (via
primary affinity). The OSD now says that

- it does itself have a copy of the object,
- so does OSD 94,
- but that the object is "also" missing on OSD 17.

So I'd like to ask firstly: what does "also" mean here?

Secondly, if the local copy is current, and we have no fewer than
min_size objects, and recovery is meant to be a background operation,
then why is the recovery in the I/O path here? Specifically, why is
that the case on a write, where the object is being modified anyway,
and the modification then needs to be replicated out to OSDs 17 and
94?

Thirdly, (a) is this how things are meant to be, in general, or (b)
was this broken in hammer and fixed with the client/recovery op
rewrite in jewel, or (c) was it fixed at any other time, or (d) is it
still broken?

Looking forward to any insight that people might be able to share.
Thanks in advance!

Cheers,
Florian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com