Re: More than 50% osds down, CPUs still busy; will the cluster recover without help?

"Chris Murray" <chrismurray84@xxxxxxxxx> · Tue, 3 Mar 2015 09:44:32 -0000

Ah yes, that's a good point :-)

Thank you for your assistance Greg, I'm understanding a little more about how Ceph operates under the hood now.

We're probably at a reasonable point for me to say I'll just switch the machines off and forget about them for a while. It's no great loss; I just wanted to see if the cluster would come back to life despite any mis-treatment, and how far it can be pushed with the limited resources on the Microservers.

Getting to the admin socket fails:

root@ceph26:~# ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok help
admin_socket: exception getting command descriptions: [Errno 111] Connection refused

And after activity ceased on /dev/sdb ...

(60 second intervals again, snipped many hours of these sorts of figures)
sdb               5.52         0.00       801.27          0      48076
sdb               4.68         0.00       731.80          0      43908
sdb               5.25         0.00       792.80          0      47568
sdb              18.83       483.07       569.53      28984      34172
sdb              28.28       894.60        35.40      53676       2124
sdb               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0
sdb               0.00         0.00         0.00          0          0

... the log hadn't progressed beyond the below. Note the last entry was 13 hours prior to activity on sdb ending, so whatever finished writing (then momentarily reading) this morning, didn't add anything to the log.

...
2015-03-02 18:24:45.942970 7f27f03ef780 15 filestore(/var/lib/ceph/osd/ceph-1) get_omap_iterator meta/39e3fb/pglog_4.57c/0//-1
2015-03-02 18:24:45.977857 7f27f03ef780 15 filestore(/var/lib/ceph/osd/ceph-1) _omap_rmkeys meta/39e3fb/pglog_4.57c/0//-1
2015-03-02 18:24:45.978400 7f27f03ef780 10 filestore oid: 39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.3
2015-03-02 18:24:45.978414 7f27f03ef780 10 filestore  > header.spos 0.0.0
2015-03-02 18:24:45.986763 7f27f03ef780 15 filestore(/var/lib/ceph/osd/ceph-1) _omap_rmkeys meta/39e3fb/pglog_4.57c/0//-1
2015-03-02 18:24:45.987350 7f27f03ef780 10 filestore oid: 39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.4
2015-03-02 18:24:45.987363 7f27f03ef780 10 filestore  > header.spos 0.0.0
2015-03-02 18:24:45.991651 7f27f03ef780 15 filestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys meta/39e3fb/pglog_4.57c/0//-1
2015-03-02 18:24:45.992119 7f27f03ef780 10 filestore oid: 39e3fb/pglog_4.57c/0//-1 not skipping op, *spos 13288339.0.5
2015-03-02 18:24:45.992128 7f27f03ef780 10 filestore  > header.spos 0.0.0
2015-03-02 18:24:46.016116 7f27f03ef780 10 filestore(/var/lib/ceph/osd/ceph-1) _do_transaction on 0x1a92540
2015-03-02 18:24:46.016133 7f27f03ef780 15 filestore(/var/lib/ceph/osd/ceph-1) _omap_setkeys meta/16ef7597/infos/head//-1
2015-03-02 18:24:46.016542 7f27f03ef780 10 filestore oid: 16ef7597/infos/head//-1 not skipping op, *spos 13288340.0.1
2015-03-02 18:24:46.016555 7f27f03ef780 10 filestore  > header.spos 0.0.0
2015-03-02 18:24:48.855098 7f27e2fe0700 20 filestore(/var/lib/ceph/osd/ceph-1) sync_entry woke after 5.000291

The complete file is attached, in case it's of interest to anyone.

I get the feeling it's BTRFS which is the 'cause' here. I'm running a scrub in case it highlights anything wrong with the filesystem. If it all springs back to life, I'll post back here with my findings!

Thanks again for the pointers,
Chris

-----Original Message-----
From: Gregory Farnum [mailto:greg@xxxxxxxxxxx] 
Sent: 02 March 2015 18:05
To: Chris Murray
Cc: ceph-users
Subject: Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

You can turn the filestore up to 20 instead of 1. ;) You might also explore what information you can get out of the admin socket.

You are correct that those numbers are the OSD epochs, although note that when the system is running you'll get output both for the OSD as a whole and for individual PGs within it (which can be lagging behind). I'm still pretty convinced the OSDs are simply stuck trying to bring their PGs up to date and are thrashing the maps on disk, but we're well past what I can personally diagnose without log diving.
-Greg

On Sat, Feb 28, 2015 at 11:51 AM, Chris Murray <chrismurray84@xxxxxxxxx> wrote:
> After noticing that the number increases by 101 on each attempt to 
> start osd.11, I figured I was only 7 iterations away from the output 
> being within 101 of 63675. So, I killed the osd process, started it 
> again, lather, rinse, repeat. I then did the same for other OSDs. Some 
> created very small logs, and some created logs into the gigabytes. 
> Grepping the latter for "update_osd_stat" showed me where the maps 
> were up to, and therefore which OSDs needed some special attention. 
> Some of the epoch numbers appeared to increase by themselves to a 
> point and then plateaux, after which I'd kill then start the osd 
> again, and this number would start to increase again.
>
> After all either showed 63675, or nothing at all, I turned debugging 
> back off, deleted logs, and tried to bring the cluster back by 
> unsetting noup, nobackfill, norecovery etc. It hasn't got very far 
> before appearing stuck again, with nothing progressing in ceph status. 
> It appears that 11/15 OSDs are now properly up, but four still aren't. 
> A lot of placement groups are stale, so I guess I really need the 
> remaining four to come up.
>
> The OSDs in question are 1, 7, 10 & 12. All have a line similar to 
> this as the last in their log:
>
> 2015-02-28 10:35:04.240822 7f375ef40780  1 journal _open 
> /var/lib/ceph/osd/ceph-1/journal fd 21: 5367660544 bytes, block size
> 4096 bytes, directio = 1, aio = 1
>
> Even with the following in ceph.conf, I'm not seeing anything after 
> that last line in the log.
>
>          debug osd = 20
>          debug filestore = 1
>
> CPU is still being consumed by the ceph-osd process though, but not 
> much memory is being used compared to the other two OSDs which are up 
> on that node.
>
> Is there perhaps even further logging that I can use to see why the 
> logs aren't progressing past this point?
> Osd.1 is on /dev/sdb. iostat still shows some activity as the minutes 
> go on, but not much:
>
> (60 second intervals)
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> sdb               5.45         0.00       807.33          0      48440
> sdb               5.75         0.00       807.33          0      48440
> sdb               5.43         0.00       807.20          0      48440
>
> Thanks,
> Chris
>
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf 
> Of Chris Murray
> Sent: 27 February 2015 10:32
> To: Gregory Farnum
> Cc: ceph-users
> Subject: Re: [ceph-users] More than 50% osds down, CPUs still 
> busy;will the cluster recover without help?
>
> A little further logging:
>
> 2015-02-27 10:27:15.745585 7fe8e3f2f700 20 osd.11 62839 
> update_osd_stat
> osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op 
> hist
> [])
> 2015-02-27 10:27:15.745619 7fe8e3f2f700  5 osd.11 62839 heartbeat:
> osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op 
> hist
> [])
> 2015-02-27 10:27:23.530913 7fe8e8536700  1 -- 
> 192.168.12.25:6800/673078
> --> 192.168.12.25:6789/0 -- 
> --> mon_subscribe({monmap=6+,osd_pg_creates=0})
> v2 -- ?+0 0xe5f26380 con 0xe1f0cc60
> 2015-02-27 10:27:30.645902 7fe8e3f2f700 20 osd.11 62839 
> update_osd_stat
> osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op 
> hist
> [])
> 2015-02-27 10:27:30.645938 7fe8e3f2f700  5 osd.11 62839 heartbeat:
> osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op 
> hist
> [])
> 2015-02-27 10:27:33.531142 7fe8e8536700  1 -- 
> 192.168.12.25:6800/673078
> --> 192.168.12.25:6789/0 -- 
> --> mon_subscribe({monmap=6+,osd_pg_creates=0})
> v2 -- ?+0 0xe5f26540 con 0xe1f0cc60
> 2015-02-27 10:27:43.531333 7fe8e8536700  1 -- 
> 192.168.12.25:6800/673078
> --> 192.168.12.25:6789/0 -- 
> --> mon_subscribe({monmap=6+,osd_pg_creates=0})
> v2 -- ?+0 0xe5f26700 con 0xe1f0cc60
> 2015-02-27 10:27:45.546275 7fe8e3f2f700 20 osd.11 62839 
> update_osd_stat
> osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op 
> hist
> [])
> 2015-02-27 10:27:45.546311 7fe8e3f2f700  5 osd.11 62839 heartbeat:
> osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op 
> hist
> [])
> 2015-02-27 10:27:53.531564 7fe8e8536700  1 -- 
> 192.168.12.25:6800/673078
> --> 192.168.12.25:6789/0 -- 
> --> mon_subscribe({monmap=6+,osd_pg_creates=0})
> v2 -- ?+0 0xe5f268c0 con 0xe1f0cc60
> 2015-02-27 10:27:56.846593 7fe8e3f2f700 20 osd.11 62839 
> update_osd_stat
> osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op 
> hist
> [])
> 2015-02-27 10:27:56.846627 7fe8e3f2f700  5 osd.11 62839 heartbeat:
> osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op 
> hist
> [])
> 2015-02-27 10:27:57.346965 7fe8e3f2f700 20 osd.11 62839 
> update_osd_stat
> osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op 
> hist
> [])
> 2015-02-27 10:27:57.347001 7fe8e3f2f700  5 osd.11 62839 heartbeat:
> osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op 
> hist
> [])
> 2015-02-27 10:28:03.531785 7fe8e8536700  1 -- 
> 192.168.12.25:6800/673078
> --> 192.168.12.25:6789/0 -- 
> --> mon_subscribe({monmap=6+,osd_pg_creates=0})
> v2 -- ?+0 0xe5f26a80 con 0xe1f0cc60
> 2015-02-27 10:28:13.532027 7fe8e8536700  1 -- 
> 192.168.12.25:6800/673078
> --> 192.168.12.25:6789/0 -- 
> --> mon_subscribe({monmap=6+,osd_pg_creates=0})
> v2 -- ?+0 0xe5f26c40 con 0xe1f0cc60
> 2015-02-27 10:28:23.047382 7fe8e3f2f700 20 osd.11 62839 
> update_osd_stat
> osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op 
> hist
> [])
> 2015-02-27 10:28:23.047419 7fe8e3f2f700  5 osd.11 62839 heartbeat:
> osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op 
> hist
> [])
> 2015-02-27 10:28:23.532271 7fe8e8536700  1 -- 
> 192.168.12.25:6800/673078
> --> 192.168.12.25:6789/0 -- 
> --> mon_subscribe({monmap=6+,osd_pg_creates=0})
> v2 -- ?+0 0xe5f26e00 con 0xe1f0cc60
> 2015-02-27 10:28:33.532496 7fe8e8536700  1 -- 
> 192.168.12.25:6800/673078
> --> 192.168.12.25:6789/0 -- 
> --> mon_subscribe({monmap=6+,osd_pg_creates=0})
> v2 -- ?+0 0xe5f26fc0 con 0xe1f0cc60
>
> 62839? But it was 62738 earlier, so it is actually advancing toward 
> the 63675? If what I've assumed about the osd map numbers is true.
>
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf 
> Of Chris Murray
> Sent: 27 February 2015 08:33
> To: Gregory Farnum
> Cc: ceph-users
> Subject: Re: [ceph-users] More than 50% osds down, CPUs still 
> busy;will the cluster recover without help?
>
> That's interesting, it seems to be alternating between two lines, but 
> only one thread this time? I'm guessing the 62738 is the osdmap, which 
> is much behind where it should be? Osd.0 and osd.3 are on 63675, if 
> I'm understanding that correctly.
>
> 2015-02-27 08:18:48.724645 7f2fbd1e8700 20 osd.11 62738 
> update_osd_stat
> osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op 
> hist
> [])
> 2015-02-27 08:18:48.724683 7f2fbd1e8700  5 osd.11 62738 heartbeat:
> osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op 
> hist
> [])
> 2015-02-27 08:19:00.025003 7f2fbd1e8700 20 osd.11 62738 
> update_osd_stat
> osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op 
> hist
> [])
> 2015-02-27 08:19:00.025040 7f2fbd1e8700  5 osd.11 62738 heartbeat:
> osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op 
> hist
> [])
> 2015-02-27 08:19:04.125395 7f2fbd1e8700 20 osd.11 62738 
> update_osd_stat
> osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op 
> hist
> [])
> 2015-02-27 08:19:04.125431 7f2fbd1e8700  5 osd.11 62738 heartbeat:
> osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op 
> hist
> [])
> 2015-02-27 08:19:26.225763 7f2fbd1e8700 20 osd.11 62738 
> update_osd_stat
> osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op 
> hist
> [])
> 2015-02-27 08:19:26.225797 7f2fbd1e8700  5 osd.11 62738 heartbeat:
> osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op 
> hist
> [])
> 2015-02-27 08:19:26.726140 7f2fbd1e8700 20 osd.11 62738 
> update_osd_stat
> osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op 
> hist
> [])
> 2015-02-27 08:19:26.726177 7f2fbd1e8700  5 osd.11 62738 heartbeat:
> osd_stat(1305 GB used, 1431 GB avail, 2789 GB total, peers []/[] op 
> hist
> [])
>
> Activity on /dev/sdb looks similar to how it did without debugging:
>
> sdb               5.95         0.00       701.20          0      42072
> sdb               5.10         0.00       625.60          0      37536
> sdb               4.97         0.00       611.33          0      36680
> sdb               5.77         0.00       701.20          0      42072
>
>
> Some Googling reveals references to log files which have very similar 
> entries, but I can't see anything that just repeats like mine does.
>
> -----Original Message-----
> From: Gregory Farnum [mailto:greg@xxxxxxxxxxx]
> Sent: 26 February 2015 22:37
> To: Chris Murray
> Cc: ceph-users
> Subject: Re: [ceph-users] More than 50% osds down, CPUs still busy; 
> will the cluster recover without help?
>
> If you turn up "debug osd = 20" or something it'll apply a good bit 
> more disk load but give you more debugging logs about what's going on.
> It could be that you're in enough of a mess now that it's stuck trying 
> to calculate past intervals for a bunch of PGs across so many maps 
> that it's swapping things in and out of memory and going slower (if 
> that's the case, then increasing the size of your map cache will help).
> -Greg
>
> On Thu, Feb 26, 2015 at 1:56 PM, Chris Murray 
> <chrismurray84@xxxxxxxxx>
> wrote:
>> Tackling this on a more piecemeal basis, I've stopped all OSDs, and 
>> started just the three which exist on the first host.
>>
>> osd.0 comes up without complaint: "osd.0 63675 done with init, 
>> starting boot process"
>> osd.3 comes up without complaint: "osd.3 63675 done with init, 
>> starting boot process
>> osd.11 is a problematic one.
>>
>> It does something like this ...
>>
>> 2015-02-26 10:44:50.260593 7f7e23551780  0 ceph version 0.80.8 
>> (69eaad7f8308f21573c604f121956e64679a52a7), process ceph-osd, pid
>> 305080
>> 2015-02-26 10:44:50.265525 7f7e23551780  0
>> filestore(/var/lib/ceph/osd/ceph-11) mount detected btrfs
>> 2015-02-26 10:44:51.155501 7f7e23551780  0
>> genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features:
>> FIEMAP ioctl is supported and appears to work
>> 2015-02-26 10:44:51.155536 7f7e23551780  0
>> genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features:
>> FIEMAP ioctl is disabled via 'filestore fiemap' config option
>> 2015-02-26 10:44:51.433239 7f7e23551780  0
>> genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features:
>> syscall(SYS_syncfs, fd) fully supported
>> 2015-02-26 10:44:51.433467 7f7e23551780  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
>> CLONE_RANGE ioctl is supported
>> 2015-02-26 10:44:51.644373 7f7e23551780  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
>> SNAP_CREATE is supported
>> 2015-02-26 10:44:51.668424 7f7e23551780  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
>> SNAP_DESTROY is supported
>> 2015-02-26 10:44:51.668741 7f7e23551780  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
>> START_SYNC is supported (transid 43205)
>> 2015-02-26 10:44:51.766577 7f7e23551780  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
>> WAIT_SYNC is supported
>> 2015-02-26 10:44:51.814761 7f7e23551780  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
>> SNAP_CREATE_V2 is supported
>> 2015-02-26 10:44:52.181382 7f7e23551780  0
>> filestore(/var/lib/ceph/osd/ceph-11) mount: enabling WRITEAHEAD 
>> journal
>> mode: checkpoint is not enabled
>> 2015-02-26 10:44:52.221994 7f7e23551780  1 journal _open 
>> /var/lib/ceph/osd/ceph-11/journal fd 22: 5367660544 bytes, block size
>> 4096 bytes, directio = 1, aio = 1
>> 2015-02-26 10:44:53.073348 7f7e23551780  1 journal _open 
>> /var/lib/ceph/osd/ceph-11/journal fd 22: 5367660544 bytes, block size
>> 4096 bytes, directio = 1, aio = 1
>> 2015-02-26 10:44:53.300055 7f7e23551780  1 journal close 
>> /var/lib/ceph/osd/ceph-11/journal
>> 2015-02-26 10:44:53.388716 7f7e23551780  0
>> filestore(/var/lib/ceph/osd/ceph-11) mount detected btrfs
>> 2015-02-26 10:44:53.522010 7f7e23551780  0
>> genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features:
>> FIEMAP ioctl is supported and appears to work
>> 2015-02-26 10:44:53.522043 7f7e23551780  0
>> genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features:
>> FIEMAP ioctl is disabled via 'filestore fiemap' config option
>> 2015-02-26 10:44:53.898121 7f7e23551780  0
>> genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features:
>> syscall(SYS_syncfs, fd) fully supported
>> 2015-02-26 10:44:53.898273 7f7e23551780  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
>> CLONE_RANGE ioctl is supported
>> 2015-02-26 10:44:54.122056 7f7e23551780  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
>> SNAP_CREATE is supported
>> 2015-02-26 10:44:54.142426 7f7e23551780  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
>> SNAP_DESTROY is supported
>> 2015-02-26 10:44:54.142722 7f7e23551780  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
>> START_SYNC is supported (transid 43212)
>> 2015-02-26 10:44:54.233152 7f7e23551780  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
>> WAIT_SYNC is supported
>> 2015-02-26 10:44:54.276165 7f7e23551780  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
>> SNAP_CREATE_V2 is supported
>> 2015-02-26 10:44:54.550810 7f7e23551780  0
>> filestore(/var/lib/ceph/osd/ceph-11) mount: WRITEAHEAD journal mode 
>> explicitly enabled in conf
>> 2015-02-26 10:44:54.585626 7f7e23551780  1 journal _open 
>> /var/lib/ceph/osd/ceph-11/journal fd 23: 5367660544 bytes, block size
>> 4096 bytes, directio = 1, aio = 1
>> 2015-02-26 10:44:54.589469 7f7e23551780  1 journal _open 
>> /var/lib/ceph/osd/ceph-11/journal fd 23: 5367660544 bytes, block size
>> 4096 bytes, directio = 1, aio = 1
>> 2015-02-26 10:44:54.590413 7f7e23551780  0 <cls>
>> cls/hello/cls_hello.cc:271: loading cls_hello
>> 2015-02-26 10:44:54.604303 7f7e23551780  0 osd.11 61425 crush map has 
>> features 1107558400, adjusting msgr requires for clients
>> 2015-02-26 10:44:54.604323 7f7e23551780  0 osd.11 61425 crush map has 
>> features 1107558400 was 8705, adjusting msgr requires for mons
>> 2015-02-26 10:44:54.604329 7f7e23551780  0 osd.11 61425 crush map has 
>> features 1107558400, adjusting msgr requires for osds
>> 2015-02-26 10:44:54.604343 7f7e23551780  0 osd.11 61425 load_pgs
>>
>> ... then pauses for a couple of minutes ...
>>
>> 2015-02-26 10:46:26.313449 7f7e23551780  0 osd.11 61425 load_pgs 
>> opened
>> 2426 pgs
>> 2015-02-26 10:46:26.316307 7f7e23551780 -1 osd.11 61425
>> set_disk_tp_priority(22) Invalid argument:
>> osd_disk_thread_ioprio_class is  but only the following values are
>> allowed: idle, be or rt
>> 2015-02-26 10:46:26.320354 7f7df1017700  0 osd.11 61425 ignoring 
>> osdmap until we have initialized
>> 2015-02-26 10:46:26.320431 7f7df1017700  0 osd.11 61425 ignoring 
>> osdmap until we have initialized
>> 2015-02-26 10:46:26.334867 7f7df544e700  0 -- 0.0.0.0:6801/305080 >>
>> 192.168.12.27:6801/3135 pipe(0x2c9e500 sd=39 :0 s=1 pgs=0 cs=0 l=0 
>> c=0xe248d1e0).fault with nothing to send, going to standby
>> 2015-02-26 10:46:26.334985 7f7df3378700  0 -- 0.0.0.0:6801/305080 >>
>> 192.168.12.28:6811/3824 pipe(0x2c9ec80 sd=41 :0 s=1 pgs=0 cs=0 l=0 
>> c=0xe248d4a0).fault with nothing to send, going to standby
>> 2015-02-26 10:46:26.335074 7f7df9a90700  0 -- 0.0.0.0:6801/305080 >>
>> 192.168.12.26:6806/3574 pipe(0x2c9f400 sd=43 :0 s=1 pgs=0 cs=0 l=0 
>> c=0xe248d340).fault with nothing to send, going to standby
>> 2015-02-26 10:46:26.335152 7f7df9b91700  0 -- 0.0.0.0:6801/305080 >>
>> 192.168.12.26:6801/3188 pipe(0x2c9f180 sd=44 :0 s=1 pgs=0 cs=0 l=0 
>> c=0xe248d080).fault with nothing to send, going to standby
>> 2015-02-26 10:46:26.341356 7f7df534d700  0 -- 0.0.0.0:6801/305080 >>
>> 192.168.12.25:6801/3318 pipe(0x2c9ef00 sd=40 :52612 s=1 pgs=0 cs=0 
>> l=0
>
>> c=0xe248d600).connect claims to be 0.0.0.0:6801/305080 not
>> 192.168.12.25:6801/3318 - wrong node!
>> 2015-02-26 10:46:26.341522 7f7df534d700  0 -- 0.0.0.0:6801/305080 >>
>> 192.168.12.25:6801/3318 pipe(0x2c9ef00 sd=40 :52612 s=1 pgs=0 cs=0 
>> l=0
>
>> c=0xe248d600).fault with nothing to send, going to standby
>> 2015-02-26 10:46:26.467208 7f7df2233700  0 -- 0.0.0.0:6801/305080 >>
>> 192.168.12.27:6806/3582 pipe(0xe3566000 sd=148 :0 s=1 pgs=0 cs=0 l=0 
>> c=0xe355ac60).fault with nothing to send, going to standby
>> 2015-02-26 10:46:26.467400 7f7df1ef4700  0 -- 0.0.0.0:6801/305080 >>
>> 192.168.12.29:6801/3169 pipe(0xe3566500 sd=149 :0 s=1 pgs=0 cs=0 l=0 
>> c=0xe355ab00).fault with nothing to send, going to standby
>>
>> At this point, it's writing to /dev/sdb at up to about 100 IOPs, 
>> which
>
>> would be sensible for the drive in question: a WD Red 3TB. Over 60 
>> second intervals, iostat looks a bit like this:
>>
>> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
>> sdb               5.17         0.00       660.40          0      39624
>> sdb               5.40         0.00       688.27          0      41296
>> sdb               4.62         0.00       564.47          0      33868
>> sdb               4.95         0.00       655.13          0      39308
>> sdb               5.33         0.00       688.87          0      41332
>> sdb               5.05         0.00       663.47          0      39808
>>
>> Every now and then there'll be a few kb of reads, but the rest of the 
>> time it's just writing. /dev/sdc and /dev/sdd don't appear to be 
>> doing
>
>> that, despite also being formatted with BTRFS.
>>
>> Then, after whatever I've set "filestore op thread timeout" to, these 
>> messages start. I've tried 600, 3600, and even a whopping 32400 (9 
>> hours, to fit troubleshooting into what's left of my day here in the
>> UK)
>>
>> 2015-02-26 21:39:03.215672 7f724a58d700  1 heartbeat_map is_healthy 
>> 'FileStore::op_tp thread 0x7f723d298700' had timed out after 32400
>> 2015-02-26 21:39:03.215699 7f724a58d700  1 heartbeat_map is_healthy 
>> 'FileStore::op_tp thread 0x7f723dc99700' had timed out after 32400
>> 2015-02-26 21:39:08.215855 7f724a58d700  1 heartbeat_map is_healthy 
>> 'FileStore::op_tp thread 0x7f723d298700' had timed out after 32400
>> 2015-02-26 21:39:08.215883 7f724a58d700  1 heartbeat_map is_healthy 
>> 'FileStore::op_tp thread 0x7f723dc99700' had timed out after 32400
>> 2015-02-26 21:39:13.216038 7f724a58d700  1 heartbeat_map is_healthy 
>> 'FileStore::op_tp thread 0x7f723d298700' had timed out after 32400
>> 2015-02-26 21:39:13.216066 7f724a58d700  1 heartbeat_map is_healthy 
>> 'FileStore::op_tp thread 0x7f723dc99700' had timed out after 32400
>>
>> Note the subtle alternation between two thread numbers.
>>
>> Even after the start of these timeouts, /dev/sdb is still having 
>> bursts of write activity, which is interesting.
>>
>> sdb               5.27         0.00       683.60          0      41016
>> sdb               5.20         0.00       660.40          0      39624
>> sdb               5.17         0.00       639.67          0      38380
>> sdb               5.00         0.53       617.33         32      37040
>>
>> Can anyone comment on what might be causing this error for this osd?
>> Many years ago, when ZFS was in its infancy, I had a dedup disaster 
>> which I thought would never end, but that just needed to do its thing 
>> before the pool came back to life. Could this be a similar scenario 
>> perhaps? Is the activity leading up to something, and BTRFS is slowly 
>> doing what Ceph is asking of it, or is it just going round and round 
>> in circles and I just can't see? :-)
>>
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf 
>> Of Chris Murray
>> Sent: 25 February 2015 12:58
>> To: Gregory Farnum
>> Cc: ceph-users
>> Subject: Re: [ceph-users] More than 50% osds down, CPUs still 
>> busy;will the cluster recover without help?
>>
>> Thanks Greg
>>
>> After seeing some recommendations I found in another thread, my 
>> impatience got the better of me, and I've start the process again, 
>> but
>
>> there is some logic, I promise :-) I've copied the process from 
>> Michael Kidd, I believe, and it goes along the lines of:
>>
>> setting noup, noin, noscrub, nodeep-scrub, norecover, nobackfill 
>> stopping all OSDs setting all OSDs down & out setting various options 
>> in ceph.conf to limit backfill activity etc starting all OSDs wait 
>> until all CPU settles to 0%  <-- I am here unset the noup flag wait 
>> until all CPU settles to 0% unset the noin flag wait until all CPU 
>> settles to 0% unset the nobackfill flag wait until all CPU settles to 
>> 0% unset the norecover flag remove options from ceph.conf unset the 
>> noscrub flag unset the nodeep-scrub flag
>>
>>
>> Currently, host CPU usage is approx the following, so something's 
>> changed, and I'm tempted to leave things a little longer before my 
>> next step, just in case CPU does eventually stop spinning. I read 
>> reports of things taking "a while" even with modern Xeons, so I 
>> suppose it's not outside the realms of possibility that an AMD Neo 
>> might take days to work things out. We're up to 24.5 hours now:
>>
>> 192.168.12.25           20%
>> 192.168.12.26           1%
>> 192.168.12.27           15%
>> 192.168.12.28           1%
>> 192.168.12.29           12%
>>
>> Interesting, as 192.168.12.26 and .28 are the two which stopped 
>> spinning before I restarted this process too.
>>
>> The number of different states is slightly less confusing now, but 
>> not
>
>> by much: :-)
>>
>> 788386/2591752 objects degraded (30.419%)
>>                   90 stale+active+clean
>>                    2 stale+down+remapped+peering
>>                    2 stale+incomplete
>>                    1
>> stale+active+degraded+remapped+wait_backfill+backfill_toofull
>>                    1 stale+degraded
>>                 1255 stale+active+degraded
>>                   32 stale+remapped+peering
>>                  773 stale+active+remapped
>>                    4 stale+active+degraded+remapped+backfill_toofull
>>                 1254 stale+down+peering
>>                  278 stale+peering
>>                   33 stale+active+remapped+backfill_toofull
>>                  563 stale+active+degraded+remapped
>>
>>> Well, you below indicate that osd.14's log says it crashed on an
>> internal heartbeat timeout (usually, it got stuck waiting for disk IO 
>> or the kernel/btrfs hung), so that's why. The osd.12 process exists 
>> but isn't "up"; osd.14 doesn't even have a socket to connect to.
>>
>> Ah, that does make sense, thank you.
>>
>>> That's not what I'd expect to see (it appears to have timed out and
>> not be recognizing it?) but I don't look at these things too often so 
>> maybe that's the normal indication that heartbeats are failing.
>>
>> I'm not sure what this means either. A google for "heartbeat_map 
>> is_healthy  FileStore::op_tp thread had timed out after" doesn't 
>> return much, but I did see this quote from Sage on what looks like a 
>> similar
>> matter:
>>
>>> - the filestore op_queue is blocked on the throttler (too much io
>>> queued)
>>> - the commit thread is also waiting for ops to finish
>>> - i see no actual thread processing the op_queue Usually that's 
>>> because it hit a kernel bug and got killed.  Not sure what else 
>>> would
>
>>> make that thread disappear...
>>> sage
>>
>> Oh!
>>
>>> Also, you want to find out why they're dying. That's probably the 
>>> root
>>
>>> cause of your issues
>>
>> I have a sneaking suspicion it's BTRFS, but don't have the evidence 
>> or
>
>> perhaps the knowledge to prove it. If XFS did compression, I'd go 
>> with
>
>> that, but at the moment I need to rely on compression to solve the 
>> problem of reclaiming space *within* files which reside on ceph. As 
>> far as I remember, overwriting with zeros didn't re-do the thin 
>> provisioning on XFS, if that makes sense.
>>
>> Thanks again,
>> Chris
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2015.0.5736 / Virus Database: 4293/9154 - Release Date: 02/21/15 Internal Virus Database is out of date.
Attachment:
ceph-osd.1.log.1

Description: ceph-osd.1.log.1
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com