Re: More than 50% osds down, CPUs still busy; will the cluster recover without help?

"Chris Murray" <chrismurray84@xxxxxxxxx> · Wed, 18 Mar 2015 10:28:51 -0000

Hi again Greg :-)

No, it doesn't seem to progress past that point. I started the OSD again a couple of nights ago:

2015-03-16 21:34:46.221307 7fe4a8aa7780 10 journal op_apply_finish 13288339 open_ops 1 -> 0, max_applied_seq 13288338 -> 13288339
2015-03-16 21:34:46.221445 7fe4a8aa7780  3 journal journal_replay: r = 0, op_seq now 13288339
2015-03-16 21:34:46.221513 7fe4a8aa7780  2 journal read_entry 3951706112 : seq 13288340 1755 bytes
2015-03-16 21:34:46.221547 7fe4a8aa7780  3 journal journal_replay: applying op seq 13288340
2015-03-16 21:34:46.221579 7fe4a8aa7780 10 journal op_apply_start 13288340 open_ops 0 -> 1
2015-03-16 21:34:46.221610 7fe4a8aa7780 10 filestore(/var/lib/ceph/osd/ceph-1) _do_transaction on 0x3142480
2015-03-16 21:34:46.221651 7fe4a8aa7780 15 filestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys meta/16ef7597/infos/head//-1
2015-03-16 21:34:46.222017 7fe4a8aa7780 10 filestore oid: 16ef7597/infos/head//-1 not skipping op, *spos 13288340.0.1
2015-03-16 21:34:46.222053 7fe4a8aa7780 10 filestore  > header.spos 0.0.0
2015-03-16 21:34:48.096002 7fe49a5ac700 20 filestore(/var/lib/ceph/osd/ceph-1) sync_entry woke after 5.000178
2015-03-16 21:34:48.096037 7fe49a5ac700 10 journal commit_start max_applied_seq 13288339, open_ops 1
2015-03-16 21:34:48.096040 7fe49a5ac700 10 journal commit_start waiting for 1 open ops to drain

There's the success line for 13288339, like you mentioned. But not one for 13288340. 

Intriguing. So, those same 1755 bytes seem problematic every time the journal is replayed? Interestingly, there is a lot (in time, not exactly data mass or IOPs, but still more than 1755 bytes!) of activity while the log is at this line:

2015-03-16 21:34:48.096040 7fe49a5ac700 10 journal commit_start waiting for 1 open ops to drain

... but then the IO ceases and the log still doesn't go any further. I wonder why 13288339 doesn't have that same  'waiting for ... open ops to drain' line. Or the 'woke after' one for that matter.

While there is activity on sdb, it 'pulses' every 10 seconds or so, like this:

sdb              20.00         0.00      3404.00          0       3404
sdb              16.00         0.00      2100.00          0       2100
sdb              10.00         0.00      1148.00          0       1148
sdb               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdb               1.00         0.00       496.00          0        496
sdb              32.00         0.00      4940.00          0       4940
sdb               8.00         0.00      1144.00          0       1144
sdb               1.00         0.00         4.00          0          4
sdb               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdb              17.00         0.00      3340.00          0       3340
sdb              23.00         0.00      3368.00          0       3368
sdb               1.00         0.00         4.00          0          4
sdb               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdb              13.00         0.00      3332.00          0       3332
sdb              18.00         0.00      2360.00          0       2360
sdb              59.00         0.00      7464.00          0       7464
sdb               0.00         0.00         0.00          0          0

I was hoping Google may have held some clues, but it seems I'm the only one :-)

https://www.google.co.uk/?gws_rd=ssl#q=%22journal+commit_start+waiting+for%22+%22open+ops+to+drain%22

I tried removing "compress-force=lzo" from "osd mount options btrfs" in ceph.conf, in case it was the compression algorithm which was 'blocking' the writes, but the same seems to be observed. Last entry in log, activity, activity stops, log doesn't progress further. 

On another note, I'm not getting my own e-mails back to myself from the list anymore. In case I should be troubleshooting this side, is/isn't this happening to others?

-----Original Message-----
From: Gregory Farnum [mailto:greg@xxxxxxxxxxx] 
Sent: 16 March 2015 20:40
To: Chris Murray
Cc: ceph-users
Subject: Re:  More than 50% osds down, CPUs still busy; will the cluster recover without help?

On Sat, Mar 14, 2015 at 1:56 AM, Chris Murray <chrismurray84@xxxxxxxxx> wrote:
> Good evening all,
>
> Just had another quick look at this with some further logging on and thought I'd post the results in case anyone can keep me moving in the right direction.
>
> Long story short, some OSDs just don't appear to come up after one failing after another. Dealing with one in isolation, after a load of IO, it never starts. The last in the log looks like this:
>
> ...
> 2015-03-13 18:43:11.875392 7f29d1e98780 10 filestore  > header.spos 
> 0.0.0
> 2015-03-13 18:43:11.876568 7f29d1e98780 15 
> filestore(/var/lib/ceph/osd/ceph-1) _omap_rmkeyrange 
> meta/39e3fb/pglog_4.57c/0//-1 
> [0000000000.00000000000000000000,4294967295.18446744073709551615]
> 2015-03-13 18:43:11.876598 7f29d1e98780 15 
> filestore(/var/lib/ceph/osd/ceph-1) get_omap_iterator 
> meta/39e3fb/pglog_4.57c/0//-1
> 2015-03-13 18:43:11.952511 7f29d1e98780 15 
> filestore(/var/lib/ceph/osd/ceph-1) _omap_rmkeys 
> meta/39e3fb/pglog_4.57c/0//-1
> 2015-03-13 18:43:11.952878 7f29d1e98780 10 filestore oid: 
> 39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.3
> 2015-03-13 18:43:11.952892 7f29d1e98780 10 filestore  > header.spos 
> 0.0.0
> 2015-03-13 18:43:11.961127 7f29d1e98780 15 
> filestore(/var/lib/ceph/osd/ceph-1) _omap_rmkeys 
> meta/39e3fb/pglog_4.57c/0//-1
> 2015-03-13 18:43:11.961516 7f29d1e98780 10 filestore oid: 
> 39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.4
> 2015-03-13 18:43:11.961529 7f29d1e98780 10 filestore  > header.spos 
> 0.0.0
> 2015-03-13 18:43:11.965687 7f29d1e98780 15 
> filestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys 
> meta/39e3fb/pglog_4.57c/0//-1
> 2015-03-13 18:43:11.966082 7f29d1e98780 10 filestore oid: 
> 39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.5
> 2015-03-13 18:43:11.966095 7f29d1e98780 10 filestore  > header.spos 
> 0.0.0
> 2015-03-13 18:43:11.989820 7f29d1e98780 10 journal op_apply_finish 
> 13288339 open_ops 1 -> 0, max_applied_seq 13288338 -> 13288339
> 2015-03-13 18:43:11.989861 7f29d1e98780  3 journal journal_replay: r = 
> 0, op_seq now 13288339
> 2015-03-13 18:43:11.989896 7f29d1e98780  2 journal read_entry 
> 3951706112 : seq 13288340 1755 bytes
> 2015-03-13 18:43:11.989900 7f29d1e98780  3 journal journal_replay: 
> applying op seq 13288340
> 2015-03-13 18:43:11.989903 7f29d1e98780 10 journal op_apply_start 
> 13288340 open_ops 0 -> 1
> 2015-03-13 18:43:11.989906 7f29d1e98780 10 
> filestore(/var/lib/ceph/osd/ceph-1) _do_transaction on 0x2750480
> 2015-03-13 18:43:11.989919 7f29d1e98780 15 
> filestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys 
> meta/16ef7597/infos/head//-1
> 2015-03-13 18:43:11.990251 7f29d1e98780 10 filestore oid: 
> 16ef7597/infos/head//-1 not skipping op, *spos 13288340.0.1
> 2015-03-13 18:43:11.990263 7f29d1e98780 10 filestore  > header.spos 
> 0.0.0
> 2015-03-13 18:43:15.404558 7f29c4439700 20 
> filestore(/var/lib/ceph/osd/ceph-1) sync_entry woke after 5.000217
> 2015-03-13 18:43:15.404600 7f29c4439700 10 journal commit_start 
> max_applied_seq 13288339, open_ops 1
> 2015-03-13 18:43:15.404603 7f29c4439700 10 journal commit_start 
> waiting for 1 open ops to drain
>
> What might this 'open op' mean when it never seems to finish 'draining'? Could my suspicions be true that it's somehow a BTRFS funny?

Well, this line:

> 2015-03-13 18:43:11.989820 7f29d1e98780 10 journal op_apply_finish 
> 13288339 open_ops 1 -> 0, max_applied_seq 13288338 -> 13288339

is the system recognizing that an op has finished. Is that last one not followed by a similar line?

In any case, yes, btrfs misbehaving stupendously badly could cause things to hang, if eg it's just not finishing a write that the OSD is trying to put in. Although I'd naively expect some dmesg output if that were the case.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com