Hi everyone, disclaimer upfront: this was seen in the wild on Hammer, and on 0.94.7 no less. Reproducing this on 0.94.10 is a pending process, and we'll update here with findings, but my goal with this post is really to establish whether the behavior as seen is expected, and if so, what the rationale for it is. This is all about slow requests popping up on a rather large scale after a previously down OSD node is brought back into the cluster. So here's the sequence of events for the issue I'm describing, as seen in a test: 22:08:53 - OSD node stopped. OSDs 6, 17, 18, 22, 31, 32, 36, 45, 58 mark themselves down. Cluster has noout set, so all OSDs remain in. fio tests are running against RBD in a loop, thus there is heavy client I/O activity generating lots of new objects. 22:36:57 - OSD node restarted. OSDs 6, 17, 18, 22, 31, 32, 36, 45, 58 boot. Recovery commences, client I/O continues. 22:36:58 - OSD 17 is marked up (this will be relevant later). 22:37:01 - Last of the booted OSDs is marked up. Client I/O continues. 22:37:28 - Slow request warnings appear. osd op complaint time is set to 30, so these are requests that were issued around the time of the node restart. In other words, the cluster gets slow immediately after the node comes back up. 22:49:28 - Last slow request warning seen. 23:00:43 - Recovery complete, all PGs active+clean. Here is an example of such a slow request, as seen in ceph.log (and ceph -w) : 2017-09-09 22:37:33.241347 osd.30 172.22.4.54:6808/8976 3274 : cluster [WRN] slow request 30.447920 seconds old, received at 2017-09-09 22:37:02.793382: osd_op(client.790810466.0:55974959 rbd_data.154cd34979e2a9e3.0000000000001980 [set-alloc-hint object_size 4194304 write_size 4194304,write 438272~24576] 277.bbddc5af snapc 4f37f=[4f37f] ack+ondisk+write+known_if_redirected e1353592) currently waiting for degraded object Having bumped the OSD's log level to 10 beforehand, these are some of OSD 30's log entries correlating with this event, with a client attempting to access rbd_data.154cd34979e2a9e3.0000000000001980: 2017-09-09 22:36:50.251219 7fe6ed7e9700 10 osd.30 pg_epoch: 1353586 pg[277.5af( v 1353586'11578774 (1353421'11574953,1353586'11578774] local-les=1353563 n=1735 ec=862683 les/c 1353563/1353563 1353562/1353562/1333335) [30,94] r=0 lpr=1353562 luod=1353586'11578773 lua=1353586'11578773 crt=1353586'11578772 lcod 1353586'11578772 mlcod 1353586'11578772 active+undersized+degraded snaptrimq=[4f12c~1,4f136~2,4f139~1,4f13c~1,4f13f~1,4f142~1,4f146~1,4f3f6~1,4f3fa~2,4f3fd~2,4f400~3,4f404~1,4f406~1,4f40d~2,4f410~3,4f414~2,4f419~1,4f41b~1,4f41d~3,4f421~1,4f423~1,4f426~1,4f428~1,4f42a~1]] append_log: trimming to 1353586'11578772 entries 1353586'11578771 (1353585'11578560) modify 277/bbddc5af/rbd_data.154cd34979e2a9e3.0000000000001980/head by client.790810466.0:55974546 2017-09-09 22:36:35.644687,1353586'11578772 (1353586'11578761) modify 277/b6d355af/rbd_data.28ab56362eb141f2.0000000000003483/head by client.790921544.0:36806288 2017-09-09 22:36:37.940141 Note the 10-second gap in lines containing "rbd_data.154cd34979e2a9e3.0000000000001980" between this log line and the next. But this cluster has an unusually long osd snap trim sleep of 500ms, so that might contribute to the delay here. Right in between, at 22:36:58, OSD 17 comes up. Next comes what I am most curious about: 2017-09-09 22:37:00.220011 7fe6edfea700 10 osd.30 pg_epoch: 1353589 pg[277.5af( v 1353586'11578776 (1353421'11574953,1353586'11578776] local-les=1353589 n=1735 ec=862683 les/c 1353563/1353563 1353588/1353588/1333335) [30,17,94] r=0 lpr=1353588 pi=1353562-1353587/1 crt=1353586'11578773 lcod 1353586'11578775 mlcod 0'0 inactive snaptrimq=[4f12c~1,4f136~2,4f139~1,4f13c~1,4f13f~1,4f142~1,4f146~1,4f3f6~1,4f3fa~2,4f3fd~2,4f400~3,4f404~1,4f406~1,4f40d~2,4f410~3,4f414~2,4f419~1,4f41b~1,4f41d~3,4f421~1,4f423~1,4f426~1,4f428~1,4f42a~1]] search_for_missing 277/bbddc5af/rbd_data.154cd34979e2a9e3.0000000000001980/head 1353586'11578774 is on osd.30 2017-09-09 22:37:00.220307 7fe6edfea700 10 osd.30 pg_epoch: 1353589 pg[277.5af( v 1353586'11578776 (1353421'11574953,1353586'11578776] local-les=1353589 n=1735 ec=862683 les/c 1353563/1353563 1353588/1353588/1333335) [30,17,94] r=0 lpr=1353588 pi=1353562-1353587/1 crt=1353586'11578773 lcod 1353586'11578775 mlcod 0'0 inactive snaptrimq=[4f12c~1,4f136~2,4f139~1,4f13c~1,4f13f~1,4f142~1,4f146~1,4f3f6~1,4f3fa~2,4f3fd~2,4f400~3,4f404~1,4f406~1,4f40d~2,4f410~3,4f414~2,4f419~1,4f41b~1,4f41d~3,4f421~1,4f423~1,4f426~1,4f428~1,4f42a~1]] search_for_missing 277/bbddc5af/rbd_data.154cd34979e2a9e3.0000000000001980/head 1353586'11578774 also missing on osd.17 2017-09-09 22:37:00.220599 7fe6edfea700 10 osd.30 pg_epoch: 1353589 pg[277.5af( v 1353586'11578776 (1353421'11574953,1353586'11578776] local-les=1353589 n=1735 ec=862683 les/c 1353563/1353563 1353588/1353588/1333335) [30,17,94] r=0 lpr=1353588 pi=1353562-1353587/1 crt=1353586'11578773 lcod 1353586'11578775 mlcod 0'0 inactive snaptrimq=[4f12c~1,4f136~2,4f139~1,4f13c~1,4f13f~1,4f142~1,4f146~1,4f3f6~1,4f3fa~2,4f3fd~2,4f400~3,4f404~1,4f406~1,4f40d~2,4f410~3,4f414~2,4f419~1,4f41b~1,4f41d~3,4f421~1,4f423~1,4f426~1,4f428~1,4f42a~1]] search_for_missing 277/bbddc5af/rbd_data.154cd34979e2a9e3.0000000000001980/head 1353586'11578774 is on osd.94 2017-09-09 22:37:02.793774 7fe6ed7e9700 7 osd.30 pg_epoch: 1353592 pg[277.5af( v 1353586'11578776 (1353421'11574953,1353586'11578776] local-les=1353589 n=1735 ec=862683 les/c 1353589/1353563 1353588/1353588/1333335) [30,17,94] r=0 lpr=1353588 pi=1353562-1353587/1 crt=1353586'11578773 lcod 1353586'11578775 mlcod 0'0 active+recovery_wait+degraded snaptrimq=[4f12c~1,4f136~2,4f139~1,4f13c~1,4f13f~1,4f142~1,4f146~1,4f3f6~1,4f3fa~2,4f3fd~2,4f400~3,4f404~1,4f406~1,4f40d~2,4f410~3,4f414~2,4f419~1,4f41b~1,4f41d~3,4f421~1,4f423~1,4f426~1,4f428~1,4f42a~1]] degraded 277/bbddc5af/rbd_data.154cd34979e2a9e3.0000000000001980/head, recovering So we have a client that is talking to OSD 30. OSD 30 was never down; OSD 17 was. OSD 30 is also the preferred primary for this PG (via primary affinity). The OSD now says that - it does itself have a copy of the object, - so does OSD 94, - but that the object is "also" missing on OSD 17. So I'd like to ask firstly: what does "also" mean here? Secondly, if the local copy is current, and we have no fewer than min_size objects, and recovery is meant to be a background operation, then why is the recovery in the I/O path here? Specifically, why is that the case on a write, where the object is being modified anyway, and the modification then needs to be replicated out to OSDs 17 and 94? Thirdly, (a) is this how things are meant to be, in general, or (b) was this broken in hammer and fixed with the client/recovery op rewrite in jewel, or (c) was it fixed at any other time, or (d) is it still broken? Looking forward to any insight that people might be able to share. Thanks in advance! Cheers, Florian _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com