Hi again Greg :-) No, it doesn't seem to progress past that point. I started the OSD again a couple of nights ago: 2015-03-16 21:34:46.221307 7fe4a8aa7780 10 journal op_apply_finish 13288339 open_ops 1 -> 0, max_applied_seq 13288338 -> 13288339 2015-03-16 21:34:46.221445 7fe4a8aa7780 3 journal journal_replay: r = 0, op_seq now 13288339 2015-03-16 21:34:46.221513 7fe4a8aa7780 2 journal read_entry 3951706112 : seq 13288340 1755 bytes 2015-03-16 21:34:46.221547 7fe4a8aa7780 3 journal journal_replay: applying op seq 13288340 2015-03-16 21:34:46.221579 7fe4a8aa7780 10 journal op_apply_start 13288340 open_ops 0 -> 1 2015-03-16 21:34:46.221610 7fe4a8aa7780 10 filestore(/var/lib/ceph/osd/ceph-1) _do_transaction on 0x3142480 2015-03-16 21:34:46.221651 7fe4a8aa7780 15 filestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys meta/16ef7597/infos/head//-1 2015-03-16 21:34:46.222017 7fe4a8aa7780 10 filestore oid: 16ef7597/infos/head//-1 not skipping op, *spos 13288340.0.1 2015-03-16 21:34:46.222053 7fe4a8aa7780 10 filestore > header.spos 0.0.0 2015-03-16 21:34:48.096002 7fe49a5ac700 20 filestore(/var/lib/ceph/osd/ceph-1) sync_entry woke after 5.000178 2015-03-16 21:34:48.096037 7fe49a5ac700 10 journal commit_start max_applied_seq 13288339, open_ops 1 2015-03-16 21:34:48.096040 7fe49a5ac700 10 journal commit_start waiting for 1 open ops to drain There's the success line for 13288339, like you mentioned. But not one for 13288340. Intriguing. So, those same 1755 bytes seem problematic every time the journal is replayed? Interestingly, there is a lot (in time, not exactly data mass or IOPs, but still more than 1755 bytes!) of activity while the log is at this line: 2015-03-16 21:34:48.096040 7fe49a5ac700 10 journal commit_start waiting for 1 open ops to drain ... but then the IO ceases and the log still doesn't go any further. I wonder why 13288339 doesn't have that same 'waiting for ... open ops to drain' line. Or the 'woke after' one for that matter. While there is activity on sdb, it 'pulses' every 10 seconds or so, like this: sdb 20.00 0.00 3404.00 0 3404 sdb 16.00 0.00 2100.00 0 2100 sdb 10.00 0.00 1148.00 0 1148 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 1.00 0.00 496.00 0 496 sdb 32.00 0.00 4940.00 0 4940 sdb 8.00 0.00 1144.00 0 1144 sdb 1.00 0.00 4.00 0 4 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 17.00 0.00 3340.00 0 3340 sdb 23.00 0.00 3368.00 0 3368 sdb 1.00 0.00 4.00 0 4 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 0.00 0.00 0.00 0 0 sdb 13.00 0.00 3332.00 0 3332 sdb 18.00 0.00 2360.00 0 2360 sdb 59.00 0.00 7464.00 0 7464 sdb 0.00 0.00 0.00 0 0 I was hoping Google may have held some clues, but it seems I'm the only one :-) https://www.google.co.uk/?gws_rd=ssl#q=%22journal+commit_start+waiting+for%22+%22open+ops+to+drain%22 I tried removing "compress-force=lzo" from "osd mount options btrfs" in ceph.conf, in case it was the compression algorithm which was 'blocking' the writes, but the same seems to be observed. Last entry in log, activity, activity stops, log doesn't progress further. On another note, I'm not getting my own e-mails back to myself from the list anymore. In case I should be troubleshooting this side, is/isn't this happening to others? -----Original Message----- From: Gregory Farnum [mailto:greg@xxxxxxxxxxx] Sent: 16 March 2015 20:40 To: Chris Murray Cc: ceph-users Subject: Re: More than 50% osds down, CPUs still busy; will the cluster recover without help? On Sat, Mar 14, 2015 at 1:56 AM, Chris Murray <chrismurray84@xxxxxxxxx> wrote: > Good evening all, > > Just had another quick look at this with some further logging on and thought I'd post the results in case anyone can keep me moving in the right direction. > > Long story short, some OSDs just don't appear to come up after one failing after another. Dealing with one in isolation, after a load of IO, it never starts. The last in the log looks like this: > > ... > 2015-03-13 18:43:11.875392 7f29d1e98780 10 filestore > header.spos > 0.0.0 > 2015-03-13 18:43:11.876568 7f29d1e98780 15 > filestore(/var/lib/ceph/osd/ceph-1) _omap_rmkeyrange > meta/39e3fb/pglog_4.57c/0//-1 > [0000000000.00000000000000000000,4294967295.18446744073709551615] > 2015-03-13 18:43:11.876598 7f29d1e98780 15 > filestore(/var/lib/ceph/osd/ceph-1) get_omap_iterator > meta/39e3fb/pglog_4.57c/0//-1 > 2015-03-13 18:43:11.952511 7f29d1e98780 15 > filestore(/var/lib/ceph/osd/ceph-1) _omap_rmkeys > meta/39e3fb/pglog_4.57c/0//-1 > 2015-03-13 18:43:11.952878 7f29d1e98780 10 filestore oid: > 39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.3 > 2015-03-13 18:43:11.952892 7f29d1e98780 10 filestore > header.spos > 0.0.0 > 2015-03-13 18:43:11.961127 7f29d1e98780 15 > filestore(/var/lib/ceph/osd/ceph-1) _omap_rmkeys > meta/39e3fb/pglog_4.57c/0//-1 > 2015-03-13 18:43:11.961516 7f29d1e98780 10 filestore oid: > 39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.4 > 2015-03-13 18:43:11.961529 7f29d1e98780 10 filestore > header.spos > 0.0.0 > 2015-03-13 18:43:11.965687 7f29d1e98780 15 > filestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys > meta/39e3fb/pglog_4.57c/0//-1 > 2015-03-13 18:43:11.966082 7f29d1e98780 10 filestore oid: > 39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.5 > 2015-03-13 18:43:11.966095 7f29d1e98780 10 filestore > header.spos > 0.0.0 > 2015-03-13 18:43:11.989820 7f29d1e98780 10 journal op_apply_finish > 13288339 open_ops 1 -> 0, max_applied_seq 13288338 -> 13288339 > 2015-03-13 18:43:11.989861 7f29d1e98780 3 journal journal_replay: r = > 0, op_seq now 13288339 > 2015-03-13 18:43:11.989896 7f29d1e98780 2 journal read_entry > 3951706112 : seq 13288340 1755 bytes > 2015-03-13 18:43:11.989900 7f29d1e98780 3 journal journal_replay: > applying op seq 13288340 > 2015-03-13 18:43:11.989903 7f29d1e98780 10 journal op_apply_start > 13288340 open_ops 0 -> 1 > 2015-03-13 18:43:11.989906 7f29d1e98780 10 > filestore(/var/lib/ceph/osd/ceph-1) _do_transaction on 0x2750480 > 2015-03-13 18:43:11.989919 7f29d1e98780 15 > filestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys > meta/16ef7597/infos/head//-1 > 2015-03-13 18:43:11.990251 7f29d1e98780 10 filestore oid: > 16ef7597/infos/head//-1 not skipping op, *spos 13288340.0.1 > 2015-03-13 18:43:11.990263 7f29d1e98780 10 filestore > header.spos > 0.0.0 > 2015-03-13 18:43:15.404558 7f29c4439700 20 > filestore(/var/lib/ceph/osd/ceph-1) sync_entry woke after 5.000217 > 2015-03-13 18:43:15.404600 7f29c4439700 10 journal commit_start > max_applied_seq 13288339, open_ops 1 > 2015-03-13 18:43:15.404603 7f29c4439700 10 journal commit_start > waiting for 1 open ops to drain > > What might this 'open op' mean when it never seems to finish 'draining'? Could my suspicions be true that it's somehow a BTRFS funny? Well, this line: > 2015-03-13 18:43:11.989820 7f29d1e98780 10 journal op_apply_finish > 13288339 open_ops 1 -> 0, max_applied_seq 13288338 -> 13288339 is the system recognizing that an op has finished. Is that last one not followed by a similar line? In any case, yes, btrfs misbehaving stupendously badly could cause things to hang, if eg it's just not finishing a write that the OSD is trying to put in. Although I'd naively expect some dmesg output if that were the case. -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com