Re: Clarification on sequence of recovery and client ops after OSDs rejoin cluster (also, slow requests)

Brad Hubbard <bhubbard@xxxxxxxxxx> · Thu, 14 Sep 2017 11:15:20 +1000

On Wed, Sep 13, 2017 at 8:40 PM, Florian Haas <florian@xxxxxxxxxxx> wrote:
> Hi everyone,
>
>
> disclaimer upfront: this was seen in the wild on Hammer, and on 0.94.7
> no less. Reproducing this on 0.94.10 is a pending process, and we'll
> update here with findings, but my goal with this post is really to

Just making sure it's understood Hammer is a retired release.

> establish whether the behavior as seen is expected, and if so, what
> the rationale for it is. This is all about slow requests popping up on
> a rather large scale after a previously down OSD node is brought back
> into the cluster.
>
>
> So here's the sequence of events for the issue I'm describing, as seen
> in a test:
>
> 22:08:53 - OSD node stopped. OSDs 6, 17, 18, 22, 31, 32, 36, 45, 58
> mark themselves down. Cluster has noout set, so all OSDs remain in.
> fio tests are running against RBD in a loop, thus there is heavy
> client I/O activity generating lots of new objects.
>
> 22:36:57 - OSD node restarted. OSDs 6, 17, 18, 22, 31, 32, 36, 45, 58
> boot. Recovery commences, client I/O continues.
>
> 22:36:58 - OSD 17 is marked up (this will be relevant later).
>
> 22:37:01 - Last of the booted OSDs is marked up. Client I/O continues.
>
> 22:37:28 - Slow request warnings appear. osd op complaint time is set
> to 30, so these are requests that were issued around the time of the
> node restart. In other words, the cluster gets slow immediately after
> the node comes back up.
>
> 22:49:28 - Last slow request warning seen.
>
> 23:00:43 - Recovery complete, all PGs active+clean.
>
>
> Here is an example of such a slow request, as seen in ceph.log (and ceph -w) :
> 2017-09-09 22:37:33.241347 osd.30 172.22.4.54:6808/8976 3274 : cluster
> [WRN] slow request 30.447920 seconds old, received at 2017-09-09
> 22:37:02.793382: osd_op(client.790810466.0:55974959
> rbd_data.154cd34979e2a9e3.0000000000001980 [set-alloc-hint object_size
> 4194304 write_size 4194304,write 438272~24576] 277.bbddc5af snapc
> 4f37f=[4f37f] ack+ondisk+write+known_if_redirected e1353592) currently
> waiting for degraded object
>
>
> Having bumped the OSD's log level to 10 beforehand, these are some of
> OSD 30's log entries correlating with this event, with a client
> attempting to access rbd_data.154cd34979e2a9e3.0000000000001980:
>
> 2017-09-09 22:36:50.251219 7fe6ed7e9700 10 osd.30 pg_epoch: 1353586
> pg[277.5af( v 1353586'11578774 (1353421'11574953,1353586'11578774]
> local-les=1353563 n=1735 ec=862683 les/c 1353563/1353563
> 1353562/1353562/1333335) [30,94] r=0 lpr=1353562 luod=1353586'11578773
> lua=1353586'11578773 crt=1353586'11578772 lcod 1353586'11578772 mlcod
> 1353586'11578772 active+undersized+degraded
> snaptrimq=[4f12c~1,4f136~2,4f139~1,4f13c~1,4f13f~1,4f142~1,4f146~1,4f3f6~1,4f3fa~2,4f3fd~2,4f400~3,4f404~1,4f406~1,4f40d~2,4f410~3,4f414~2,4f419~1,4f41b~1,4f41d~3,4f421~1,4f423~1,4f426~1,4f428~1,4f42a~1]]
> append_log: trimming to 1353586'11578772 entries 1353586'11578771
> (1353585'11578560) modify
> 277/bbddc5af/rbd_data.154cd34979e2a9e3.0000000000001980/head by
> client.790810466.0:55974546 2017-09-09
> 22:36:35.644687,1353586'11578772 (1353586'11578761) modify
> 277/b6d355af/rbd_data.28ab56362eb141f2.0000000000003483/head by
> client.790921544.0:36806288 2017-09-09 22:36:37.940141
>
> Note the 10-second gap in lines containing
> "rbd_data.154cd34979e2a9e3.0000000000001980" between this log line and
> the next. But this cluster has an unusually long osd snap trim sleep
> of 500ms, so that might contribute to the delay here. Right in
> between, at 22:36:58, OSD 17 comes up.
>
>
> Next comes what I am most curious about:
>
> 2017-09-09 22:37:00.220011 7fe6edfea700 10 osd.30 pg_epoch: 1353589
> pg[277.5af( v 1353586'11578776 (1353421'11574953,1353586'11578776]
> local-les=1353589 n=1735 ec=862683 les/c 1353563/1353563
> 1353588/1353588/1333335) [30,17,94] r=0 lpr=1353588
> pi=1353562-1353587/1 crt=1353586'11578773 lcod 1353586'11578775 mlcod
> 0'0 inactive snaptrimq=[4f12c~1,4f136~2,4f139~1,4f13c~1,4f13f~1,4f142~1,4f146~1,4f3f6~1,4f3fa~2,4f3fd~2,4f400~3,4f404~1,4f406~1,4f40d~2,4f410~3,4f414~2,4f419~1,4f41b~1,4f41d~3,4f421~1,4f423~1,4f426~1,4f428~1,4f42a~1]]
> search_for_missing
> 277/bbddc5af/rbd_data.154cd34979e2a9e3.0000000000001980/head
> 1353586'11578774 is on osd.30
> 2017-09-09 22:37:00.220307 7fe6edfea700 10 osd.30 pg_epoch: 1353589
> pg[277.5af( v 1353586'11578776 (1353421'11574953,1353586'11578776]
> local-les=1353589 n=1735 ec=862683 les/c 1353563/1353563
> 1353588/1353588/1333335) [30,17,94] r=0 lpr=1353588
> pi=1353562-1353587/1 crt=1353586'11578773 lcod 1353586'11578775 mlcod
> 0'0 inactive snaptrimq=[4f12c~1,4f136~2,4f139~1,4f13c~1,4f13f~1,4f142~1,4f146~1,4f3f6~1,4f3fa~2,4f3fd~2,4f400~3,4f404~1,4f406~1,4f40d~2,4f410~3,4f414~2,4f419~1,4f41b~1,4f41d~3,4f421~1,4f423~1,4f426~1,4f428~1,4f42a~1]]
> search_for_missing
> 277/bbddc5af/rbd_data.154cd34979e2a9e3.0000000000001980/head
> 1353586'11578774 also missing on osd.17
> 2017-09-09 22:37:00.220599 7fe6edfea700 10 osd.30 pg_epoch: 1353589
> pg[277.5af( v 1353586'11578776 (1353421'11574953,1353586'11578776]
> local-les=1353589 n=1735 ec=862683 les/c 1353563/1353563
> 1353588/1353588/1333335) [30,17,94] r=0 lpr=1353588
> pi=1353562-1353587/1 crt=1353586'11578773 lcod 1353586'11578775 mlcod
> 0'0 inactive snaptrimq=[4f12c~1,4f136~2,4f139~1,4f13c~1,4f13f~1,4f142~1,4f146~1,4f3f6~1,4f3fa~2,4f3fd~2,4f400~3,4f404~1,4f406~1,4f40d~2,4f410~3,4f414~2,4f419~1,4f41b~1,4f41d~3,4f421~1,4f423~1,4f426~1,4f428~1,4f42a~1]]
> search_for_missing
> 277/bbddc5af/rbd_data.154cd34979e2a9e3.0000000000001980/head
> 1353586'11578774 is on osd.94
> 2017-09-09 22:37:02.793774 7fe6ed7e9700  7 osd.30 pg_epoch: 1353592
> pg[277.5af( v 1353586'11578776 (1353421'11574953,1353586'11578776]
> local-les=1353589 n=1735 ec=862683 les/c 1353589/1353563
> 1353588/1353588/1333335) [30,17,94] r=0 lpr=1353588
> pi=1353562-1353587/1 crt=1353586'11578773 lcod 1353586'11578775 mlcod
> 0'0 active+recovery_wait+degraded
> snaptrimq=[4f12c~1,4f136~2,4f139~1,4f13c~1,4f13f~1,4f142~1,4f146~1,4f3f6~1,4f3fa~2,4f3fd~2,4f400~3,4f404~1,4f406~1,4f40d~2,4f410~3,4f414~2,4f419~1,4f41b~1,4f41d~3,4f421~1,4f423~1,4f426~1,4f428~1,4f42a~1]]
> degraded 277/bbddc5af/rbd_data.154cd34979e2a9e3.0000000000001980/head,
> recovering
>
>
> So we have a client that is talking to OSD 30. OSD 30 was never down;
> OSD 17 was. OSD 30 is also the preferred primary for this PG (via
> primary affinity). The OSD now says that
>
> - it does itself have a copy of the object,
> - so does OSD 94,
> - but that the object is "also" missing on OSD 17.
>
> So I'd like to ask firstly: what does "also" mean here?
>
> Secondly, if the local copy is current, and we have no fewer than
> min_size objects, and recovery is meant to be a background operation,
> then why is the recovery in the I/O path here? Specifically, why is
> that the case on a write, where the object is being modified anyway,
> and the modification then needs to be replicated out to OSDs 17 and
> 94?
>
> Thirdly, (a) is this how things are meant to be, in general, or (b)
> was this broken in hammer and fixed with the client/recovery op
> rewrite in jewel, or (c) was it fixed at any other time, or (d) is it
> still broken?
>
>
> Looking forward to any insight that people might be able to share.
> Thanks in advance!
>
> Cheers,
> Florian
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com