On Wed, Sep 13, 2017 at 8:40 PM, Florian Haas <florian@xxxxxxxxxxx> wrote: > Hi everyone, > > > disclaimer upfront: this was seen in the wild on Hammer, and on 0.94.7 > no less. Reproducing this on 0.94.10 is a pending process, and we'll > update here with findings, but my goal with this post is really to Just making sure it's understood Hammer is a retired release. > establish whether the behavior as seen is expected, and if so, what > the rationale for it is. This is all about slow requests popping up on > a rather large scale after a previously down OSD node is brought back > into the cluster. > > > So here's the sequence of events for the issue I'm describing, as seen > in a test: > > 22:08:53 - OSD node stopped. OSDs 6, 17, 18, 22, 31, 32, 36, 45, 58 > mark themselves down. Cluster has noout set, so all OSDs remain in. > fio tests are running against RBD in a loop, thus there is heavy > client I/O activity generating lots of new objects. > > 22:36:57 - OSD node restarted. OSDs 6, 17, 18, 22, 31, 32, 36, 45, 58 > boot. Recovery commences, client I/O continues. > > 22:36:58 - OSD 17 is marked up (this will be relevant later). > > 22:37:01 - Last of the booted OSDs is marked up. Client I/O continues. > > 22:37:28 - Slow request warnings appear. osd op complaint time is set > to 30, so these are requests that were issued around the time of the > node restart. In other words, the cluster gets slow immediately after > the node comes back up. > > 22:49:28 - Last slow request warning seen. > > 23:00:43 - Recovery complete, all PGs active+clean. > > > Here is an example of such a slow request, as seen in ceph.log (and ceph -w) : > 2017-09-09 22:37:33.241347 osd.30 172.22.4.54:6808/8976 3274 : cluster > [WRN] slow request 30.447920 seconds old, received at 2017-09-09 > 22:37:02.793382: osd_op(client.790810466.0:55974959 > rbd_data.154cd34979e2a9e3.0000000000001980 [set-alloc-hint object_size > 4194304 write_size 4194304,write 438272~24576] 277.bbddc5af snapc > 4f37f=[4f37f] ack+ondisk+write+known_if_redirected e1353592) currently > waiting for degraded object > > > Having bumped the OSD's log level to 10 beforehand, these are some of > OSD 30's log entries correlating with this event, with a client > attempting to access rbd_data.154cd34979e2a9e3.0000000000001980: > > 2017-09-09 22:36:50.251219 7fe6ed7e9700 10 osd.30 pg_epoch: 1353586 > pg[277.5af( v 1353586'11578774 (1353421'11574953,1353586'11578774] > local-les=1353563 n=1735 ec=862683 les/c 1353563/1353563 > 1353562/1353562/1333335) [30,94] r=0 lpr=1353562 luod=1353586'11578773 > lua=1353586'11578773 crt=1353586'11578772 lcod 1353586'11578772 mlcod > 1353586'11578772 active+undersized+degraded > snaptrimq=[4f12c~1,4f136~2,4f139~1,4f13c~1,4f13f~1,4f142~1,4f146~1,4f3f6~1,4f3fa~2,4f3fd~2,4f400~3,4f404~1,4f406~1,4f40d~2,4f410~3,4f414~2,4f419~1,4f41b~1,4f41d~3,4f421~1,4f423~1,4f426~1,4f428~1,4f42a~1]] > append_log: trimming to 1353586'11578772 entries 1353586'11578771 > (1353585'11578560) modify > 277/bbddc5af/rbd_data.154cd34979e2a9e3.0000000000001980/head by > client.790810466.0:55974546 2017-09-09 > 22:36:35.644687,1353586'11578772 (1353586'11578761) modify > 277/b6d355af/rbd_data.28ab56362eb141f2.0000000000003483/head by > client.790921544.0:36806288 2017-09-09 22:36:37.940141 > > Note the 10-second gap in lines containing > "rbd_data.154cd34979e2a9e3.0000000000001980" between this log line and > the next. But this cluster has an unusually long osd snap trim sleep > of 500ms, so that might contribute to the delay here. Right in > between, at 22:36:58, OSD 17 comes up. > > > Next comes what I am most curious about: > > 2017-09-09 22:37:00.220011 7fe6edfea700 10 osd.30 pg_epoch: 1353589 > pg[277.5af( v 1353586'11578776 (1353421'11574953,1353586'11578776] > local-les=1353589 n=1735 ec=862683 les/c 1353563/1353563 > 1353588/1353588/1333335) [30,17,94] r=0 lpr=1353588 > pi=1353562-1353587/1 crt=1353586'11578773 lcod 1353586'11578775 mlcod > 0'0 inactive snaptrimq=[4f12c~1,4f136~2,4f139~1,4f13c~1,4f13f~1,4f142~1,4f146~1,4f3f6~1,4f3fa~2,4f3fd~2,4f400~3,4f404~1,4f406~1,4f40d~2,4f410~3,4f414~2,4f419~1,4f41b~1,4f41d~3,4f421~1,4f423~1,4f426~1,4f428~1,4f42a~1]] > search_for_missing > 277/bbddc5af/rbd_data.154cd34979e2a9e3.0000000000001980/head > 1353586'11578774 is on osd.30 > 2017-09-09 22:37:00.220307 7fe6edfea700 10 osd.30 pg_epoch: 1353589 > pg[277.5af( v 1353586'11578776 (1353421'11574953,1353586'11578776] > local-les=1353589 n=1735 ec=862683 les/c 1353563/1353563 > 1353588/1353588/1333335) [30,17,94] r=0 lpr=1353588 > pi=1353562-1353587/1 crt=1353586'11578773 lcod 1353586'11578775 mlcod > 0'0 inactive snaptrimq=[4f12c~1,4f136~2,4f139~1,4f13c~1,4f13f~1,4f142~1,4f146~1,4f3f6~1,4f3fa~2,4f3fd~2,4f400~3,4f404~1,4f406~1,4f40d~2,4f410~3,4f414~2,4f419~1,4f41b~1,4f41d~3,4f421~1,4f423~1,4f426~1,4f428~1,4f42a~1]] > search_for_missing > 277/bbddc5af/rbd_data.154cd34979e2a9e3.0000000000001980/head > 1353586'11578774 also missing on osd.17 > 2017-09-09 22:37:00.220599 7fe6edfea700 10 osd.30 pg_epoch: 1353589 > pg[277.5af( v 1353586'11578776 (1353421'11574953,1353586'11578776] > local-les=1353589 n=1735 ec=862683 les/c 1353563/1353563 > 1353588/1353588/1333335) [30,17,94] r=0 lpr=1353588 > pi=1353562-1353587/1 crt=1353586'11578773 lcod 1353586'11578775 mlcod > 0'0 inactive snaptrimq=[4f12c~1,4f136~2,4f139~1,4f13c~1,4f13f~1,4f142~1,4f146~1,4f3f6~1,4f3fa~2,4f3fd~2,4f400~3,4f404~1,4f406~1,4f40d~2,4f410~3,4f414~2,4f419~1,4f41b~1,4f41d~3,4f421~1,4f423~1,4f426~1,4f428~1,4f42a~1]] > search_for_missing > 277/bbddc5af/rbd_data.154cd34979e2a9e3.0000000000001980/head > 1353586'11578774 is on osd.94 > 2017-09-09 22:37:02.793774 7fe6ed7e9700 7 osd.30 pg_epoch: 1353592 > pg[277.5af( v 1353586'11578776 (1353421'11574953,1353586'11578776] > local-les=1353589 n=1735 ec=862683 les/c 1353589/1353563 > 1353588/1353588/1333335) [30,17,94] r=0 lpr=1353588 > pi=1353562-1353587/1 crt=1353586'11578773 lcod 1353586'11578775 mlcod > 0'0 active+recovery_wait+degraded > snaptrimq=[4f12c~1,4f136~2,4f139~1,4f13c~1,4f13f~1,4f142~1,4f146~1,4f3f6~1,4f3fa~2,4f3fd~2,4f400~3,4f404~1,4f406~1,4f40d~2,4f410~3,4f414~2,4f419~1,4f41b~1,4f41d~3,4f421~1,4f423~1,4f426~1,4f428~1,4f42a~1]] > degraded 277/bbddc5af/rbd_data.154cd34979e2a9e3.0000000000001980/head, > recovering > > > So we have a client that is talking to OSD 30. OSD 30 was never down; > OSD 17 was. OSD 30 is also the preferred primary for this PG (via > primary affinity). The OSD now says that > > - it does itself have a copy of the object, > - so does OSD 94, > - but that the object is "also" missing on OSD 17. > > So I'd like to ask firstly: what does "also" mean here? > > Secondly, if the local copy is current, and we have no fewer than > min_size objects, and recovery is meant to be a background operation, > then why is the recovery in the I/O path here? Specifically, why is > that the case on a write, where the object is being modified anyway, > and the modification then needs to be replicated out to OSDs 17 and > 94? > > Thirdly, (a) is this how things are meant to be, in general, or (b) > was this broken in hammer and fixed with the client/recovery op > rewrite in jewel, or (c) was it fixed at any other time, or (d) is it > still broken? > > > Looking forward to any insight that people might be able to share. > Thanks in advance! > > Cheers, > Florian > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Cheers, Brad _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com