Ah, I was wondering myself if compression could be causing an issue, but I'm reconsidering now. My latest experiment should hopefully help troubleshoot. So, I remembered that ZLIB is slower, but is more 'safe for old kernels'. I try that: find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) -exec btrfs filesystem defragment -v -czlib -- {} + After much, much waiting, all files have been rewritten, but the OSD still gets stuck at the same point. I've now unset the compress attribute on all files and started the defragment process again, but I'm not too hopeful since the files must be readable/writeable if I didn't get some failure during the defrag process. find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) -exec chattr -c -- {} + find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) -exec btrfs filesystem defragment -v -- {} + (latter command still running) Any other ideas at all? In the absence of the problem being spelled out to me with an error of some sort, I'm not sure how to troubleshoot further. Is it safe to upgrade a problematic cluster, when the time comes, in case this ultimately is a CEPH bug which is fixed in something later than 0.80.9? -----Original Message----- From: Gregory Farnum [mailto:greg@xxxxxxxxxxx] Sent: 18 March 2015 14:01 To: Chris Murray Cc: ceph-users Subject: Re: More than 50% osds down, CPUs still busy; will the cluster recover without help? On Wed, Mar 18, 2015 at 3:28 AM, Chris Murray <chrismurray84@xxxxxxxxx> wrote: > Hi again Greg :-) > > No, it doesn't seem to progress past that point. I started the OSD again a couple of nights ago: > > 2015-03-16 21:34:46.221307 7fe4a8aa7780 10 journal op_apply_finish > 13288339 open_ops 1 -> 0, max_applied_seq 13288338 -> 13288339 > 2015-03-16 21:34:46.221445 7fe4a8aa7780 3 journal journal_replay: r = > 0, op_seq now 13288339 > 2015-03-16 21:34:46.221513 7fe4a8aa7780 2 journal read_entry > 3951706112 : seq 13288340 1755 bytes > 2015-03-16 21:34:46.221547 7fe4a8aa7780 3 journal journal_replay: > applying op seq 13288340 > 2015-03-16 21:34:46.221579 7fe4a8aa7780 10 journal op_apply_start > 13288340 open_ops 0 -> 1 > 2015-03-16 21:34:46.221610 7fe4a8aa7780 10 > filestore(/var/lib/ceph/osd/ceph-1) _do_transaction on 0x3142480 > 2015-03-16 21:34:46.221651 7fe4a8aa7780 15 > filestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys > meta/16ef7597/infos/head//-1 > 2015-03-16 21:34:46.222017 7fe4a8aa7780 10 filestore oid: > 16ef7597/infos/head//-1 not skipping op, *spos 13288340.0.1 > 2015-03-16 21:34:46.222053 7fe4a8aa7780 10 filestore > header.spos > 0.0.0 > 2015-03-16 21:34:48.096002 7fe49a5ac700 20 > filestore(/var/lib/ceph/osd/ceph-1) sync_entry woke after 5.000178 > 2015-03-16 21:34:48.096037 7fe49a5ac700 10 journal commit_start > max_applied_seq 13288339, open_ops 1 > 2015-03-16 21:34:48.096040 7fe49a5ac700 10 journal commit_start > waiting for 1 open ops to drain > > There's the success line for 13288339, like you mentioned. But not one for 13288340. > > Intriguing. So, those same 1755 bytes seem problematic every time the journal is replayed? Interestingly, there is a lot (in time, not exactly data mass or IOPs, but still more than 1755 bytes!) of activity while the log is at this line: > > 2015-03-16 21:34:48.096040 7fe49a5ac700 10 journal commit_start > waiting for 1 open ops to drain > > ... but then the IO ceases and the log still doesn't go any further. I wonder why 13288339 doesn't have that same 'waiting for ... open ops to drain' line. Or the 'woke after' one for that matter. > > While there is activity on sdb, it 'pulses' every 10 seconds or so, like this: > > sdb 20.00 0.00 3404.00 0 3404 > sdb 16.00 0.00 2100.00 0 2100 > sdb 10.00 0.00 1148.00 0 1148 > sdb 0.00 0.00 0.00 0 0 > sdb 0.00 0.00 0.00 0 0 > sdb 0.00 0.00 0.00 0 0 > sdb 0.00 0.00 0.00 0 0 > sdb 0.00 0.00 0.00 0 0 > sdb 0.00 0.00 0.00 0 0 > sdb 0.00 0.00 0.00 0 0 > sdb 1.00 0.00 496.00 0 496 > sdb 32.00 0.00 4940.00 0 4940 > sdb 8.00 0.00 1144.00 0 1144 > sdb 1.00 0.00 4.00 0 4 > sdb 0.00 0.00 0.00 0 0 > sdb 0.00 0.00 0.00 0 0 > sdb 0.00 0.00 0.00 0 0 > sdb 0.00 0.00 0.00 0 0 > sdb 0.00 0.00 0.00 0 0 > sdb 0.00 0.00 0.00 0 0 > sdb 0.00 0.00 0.00 0 0 > sdb 17.00 0.00 3340.00 0 3340 > sdb 23.00 0.00 3368.00 0 3368 > sdb 1.00 0.00 4.00 0 4 > sdb 0.00 0.00 0.00 0 0 > sdb 0.00 0.00 0.00 0 0 > sdb 0.00 0.00 0.00 0 0 > sdb 0.00 0.00 0.00 0 0 > sdb 0.00 0.00 0.00 0 0 > sdb 0.00 0.00 0.00 0 0 > sdb 0.00 0.00 0.00 0 0 > sdb 13.00 0.00 3332.00 0 3332 > sdb 18.00 0.00 2360.00 0 2360 > sdb 59.00 0.00 7464.00 0 7464 > sdb 0.00 0.00 0.00 0 0 > > I was hoping Google may have held some clues, but it seems I'm the > only one :-) > > https://www.google.co.uk/?gws_rd=ssl#q=%22journal+commit_start+waiting > +for%22+%22open+ops+to+drain%22 > > I tried removing "compress-force=lzo" from "osd mount options btrfs" in ceph.conf, in case it was the compression algorithm which was 'blocking' the writes, but the same seems to be observed. Last entry in log, activity, activity stops, log doesn't progress further. Ooooohhhhhh.....I'm not terribly familiar with the code that's stuck, but last time it came up btrfs compression was known to be broken under the Ceph workload. :( I think it tended to crash rather than hang like this so I'm a bit surprised, but if this op is touching a "broken" file or something that could explain it. > On another note, I'm not getting my own e-mails back to myself from the list anymore. In case I should be troubleshooting this side, is/isn't this happening to others? Uh, no idea. I think that's a config option you can set on mailman? But I'm using Gmail so this has always been hidden from me. -Greg ----- No virus found in this message. Checked by AVG - www.avg.com Version: 2015.0.5751 / Virus Database: 4306/9314 - Release Date: 03/16/15 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com