Re: More than 50% osds down, CPUs still busy; will the cluster recover without help?

Gregory Farnum <greg@xxxxxxxxxxx> · Wed, 18 Mar 2015 07:01:29 -0700

On Wed, Mar 18, 2015 at 3:28 AM, Chris Murray <chrismurray84@xxxxxxxxx> wrote:
> Hi again Greg :-)
>
> No, it doesn't seem to progress past that point. I started the OSD again a couple of nights ago:
>
> 2015-03-16 21:34:46.221307 7fe4a8aa7780 10 journal op_apply_finish 13288339 open_ops 1 -> 0, max_applied_seq 13288338 -> 13288339
> 2015-03-16 21:34:46.221445 7fe4a8aa7780  3 journal journal_replay: r = 0, op_seq now 13288339
> 2015-03-16 21:34:46.221513 7fe4a8aa7780  2 journal read_entry 3951706112 : seq 13288340 1755 bytes
> 2015-03-16 21:34:46.221547 7fe4a8aa7780  3 journal journal_replay: applying op seq 13288340
> 2015-03-16 21:34:46.221579 7fe4a8aa7780 10 journal op_apply_start 13288340 open_ops 0 -> 1
> 2015-03-16 21:34:46.221610 7fe4a8aa7780 10 filestore(/var/lib/ceph/osd/ceph-1) _do_transaction on 0x3142480
> 2015-03-16 21:34:46.221651 7fe4a8aa7780 15 filestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys meta/16ef7597/infos/head//-1
> 2015-03-16 21:34:46.222017 7fe4a8aa7780 10 filestore oid: 16ef7597/infos/head//-1 not skipping op, *spos 13288340.0.1
> 2015-03-16 21:34:46.222053 7fe4a8aa7780 10 filestore  > header.spos 0.0.0
> 2015-03-16 21:34:48.096002 7fe49a5ac700 20 filestore(/var/lib/ceph/osd/ceph-1) sync_entry woke after 5.000178
> 2015-03-16 21:34:48.096037 7fe49a5ac700 10 journal commit_start max_applied_seq 13288339, open_ops 1
> 2015-03-16 21:34:48.096040 7fe49a5ac700 10 journal commit_start waiting for 1 open ops to drain
>
> There's the success line for 13288339, like you mentioned. But not one for 13288340.
>
> Intriguing. So, those same 1755 bytes seem problematic every time the journal is replayed? Interestingly, there is a lot (in time, not exactly data mass or IOPs, but still more than 1755 bytes!) of activity while the log is at this line:
>
> 2015-03-16 21:34:48.096040 7fe49a5ac700 10 journal commit_start waiting for 1 open ops to drain
>
> ... but then the IO ceases and the log still doesn't go any further. I wonder why 13288339 doesn't have that same  'waiting for ... open ops to drain' line. Or the 'woke after' one for that matter.
>
> While there is activity on sdb, it 'pulses' every 10 seconds or so, like this:
>
> sdb              20.00         0.00      3404.00          0       3404
> sdb              16.00         0.00      2100.00          0       2100
> sdb              10.00         0.00      1148.00          0       1148
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               1.00         0.00       496.00          0        496
> sdb              32.00         0.00      4940.00          0       4940
> sdb               8.00         0.00      1144.00          0       1144
> sdb               1.00         0.00         4.00          0          4
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb              17.00         0.00      3340.00          0       3340
> sdb              23.00         0.00      3368.00          0       3368
> sdb               1.00         0.00         4.00          0          4
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb               0.00         0.00         0.00          0          0
> sdb              13.00         0.00      3332.00          0       3332
> sdb              18.00         0.00      2360.00          0       2360
> sdb              59.00         0.00      7464.00          0       7464
> sdb               0.00         0.00         0.00          0          0
>
> I was hoping Google may have held some clues, but it seems I'm the only one :-)
>
> https://www.google.co.uk/?gws_rd=ssl#q=%22journal+commit_start+waiting+for%22+%22open+ops+to+drain%22
>
> I tried removing "compress-force=lzo" from "osd mount options btrfs" in ceph.conf, in case it was the compression algorithm which was 'blocking' the writes, but the same seems to be observed. Last entry in log, activity, activity stops, log doesn't progress further.

Ooooohhhhhh.....I'm not terribly familiar with the code that's stuck,
but last time it came up btrfs compression was known to be broken
under the Ceph workload. :( I think it tended to crash rather than
hang like this so I'm a bit surprised, but if this op is touching a
"broken" file or something that could explain it.

> On another note, I'm not getting my own e-mails back to myself from the list anymore. In case I should be troubleshooting this side, is/isn't this happening to others?

Uh, no idea. I think that's a config option you can set on mailman?
But I'm using Gmail so this has always been hidden from me.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com