Re: More than 50% osds down, CPUs still busy; will the cluster recover without help?

Gregory Farnum <greg@xxxxxxxxxxx> · Mon, 16 Mar 2015 13:40:24 -0700

On Sat, Mar 14, 2015 at 1:56 AM, Chris Murray <chrismurray84@xxxxxxxxx> wrote:
> Good evening all,
>
> Just had another quick look at this with some further logging on and thought I'd post the results in case anyone can keep me moving in the right direction.
>
> Long story short, some OSDs just don't appear to come up after one failing after another. Dealing with one in isolation, after a load of IO, it never starts. The last in the log looks like this:
>
> ...
> 2015-03-13 18:43:11.875392 7f29d1e98780 10 filestore  > header.spos 0.0.0
> 2015-03-13 18:43:11.876568 7f29d1e98780 15 filestore(/var/lib/ceph/osd/ceph-1) _omap_rmkeyrange meta/39e3fb/pglog_4.57c/0//-1 [0000000000.00000000000000000000,4294967295.18446744073709551615]
> 2015-03-13 18:43:11.876598 7f29d1e98780 15 filestore(/var/lib/ceph/osd/ceph-1) get_omap_iterator meta/39e3fb/pglog_4.57c/0//-1
> 2015-03-13 18:43:11.952511 7f29d1e98780 15 filestore(/var/lib/ceph/osd/ceph-1) _omap_rmkeys meta/39e3fb/pglog_4.57c/0//-1
> 2015-03-13 18:43:11.952878 7f29d1e98780 10 filestore oid: 39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.3
> 2015-03-13 18:43:11.952892 7f29d1e98780 10 filestore  > header.spos 0.0.0
> 2015-03-13 18:43:11.961127 7f29d1e98780 15 filestore(/var/lib/ceph/osd/ceph-1) _omap_rmkeys meta/39e3fb/pglog_4.57c/0//-1
> 2015-03-13 18:43:11.961516 7f29d1e98780 10 filestore oid: 39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.4
> 2015-03-13 18:43:11.961529 7f29d1e98780 10 filestore  > header.spos 0.0.0
> 2015-03-13 18:43:11.965687 7f29d1e98780 15 filestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys meta/39e3fb/pglog_4.57c/0//-1
> 2015-03-13 18:43:11.966082 7f29d1e98780 10 filestore oid: 39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.5
> 2015-03-13 18:43:11.966095 7f29d1e98780 10 filestore  > header.spos 0.0.0
> 2015-03-13 18:43:11.989820 7f29d1e98780 10 journal op_apply_finish 13288339 open_ops 1 -> 0, max_applied_seq 13288338 -> 13288339
> 2015-03-13 18:43:11.989861 7f29d1e98780  3 journal journal_replay: r = 0, op_seq now 13288339
> 2015-03-13 18:43:11.989896 7f29d1e98780  2 journal read_entry 3951706112 : seq 13288340 1755 bytes
> 2015-03-13 18:43:11.989900 7f29d1e98780  3 journal journal_replay: applying op seq 13288340
> 2015-03-13 18:43:11.989903 7f29d1e98780 10 journal op_apply_start 13288340 open_ops 0 -> 1
> 2015-03-13 18:43:11.989906 7f29d1e98780 10 filestore(/var/lib/ceph/osd/ceph-1) _do_transaction on 0x2750480
> 2015-03-13 18:43:11.989919 7f29d1e98780 15 filestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys meta/16ef7597/infos/head//-1
> 2015-03-13 18:43:11.990251 7f29d1e98780 10 filestore oid: 16ef7597/infos/head//-1 not skipping op, *spos 13288340.0.1
> 2015-03-13 18:43:11.990263 7f29d1e98780 10 filestore  > header.spos 0.0.0
> 2015-03-13 18:43:15.404558 7f29c4439700 20 filestore(/var/lib/ceph/osd/ceph-1) sync_entry woke after 5.000217
> 2015-03-13 18:43:15.404600 7f29c4439700 10 journal commit_start max_applied_seq 13288339, open_ops 1
> 2015-03-13 18:43:15.404603 7f29c4439700 10 journal commit_start waiting for 1 open ops to drain
>
> What might this 'open op' mean when it never seems to finish 'draining'? Could my suspicions be true that it's somehow a BTRFS funny?

Well, this line:

> 2015-03-13 18:43:11.989820 7f29d1e98780 10 journal op_apply_finish 13288339 open_ops 1 -> 0, max_applied_seq 13288338 -> 13288339

is the system recognizing that an op has finished. Is that last one
not followed by a similar line?

In any case, yes, btrfs misbehaving stupendously badly could cause
things to hang, if eg it's just not finishing a write that the OSD is
trying to put in. Although I'd naively expect some dmesg output if
that were the case.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com