Power Outage

hjcho616@xxxxxxxxx (hjcho616) · Tue, 12 Aug 2014 12:44:12 -0700

Craig,

Thanks. ?It turns out one of my memory stick went bad after that power outage. ?While trying to fix the OSDs I ran in to many kernel crashes. ?After removing that bad memory, I was able to fix them. ?I did remove all OSD on that machine and rebuilt it as I didn't trust that data anymore. =P

I was hoping MDS would come up after that. ?But it didn't. ?It shows this and kills itself. ?Is this related to 0.82 MDS issue?
2014-08-12 14:35:11.250634 7ff794bd57c0 ?0 ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6), process ceph-mds, pid 10244
2014-08-12 14:35:11.251092 7ff794bd57c0 ?1 -- 192.168.1.20:0/0 learned my addr 192.168.1.20:0/0
2014-08-12 14:35:11.251118 7ff794bd57c0 ?1 accepter.accepter.bind my_inst.addr is 192.168.1.20:6800/10244 need_addr=0
2014-08-12 14:35:11.259207 7ff794bd57c0 ?1 -- 192.168.1.20:6800/10244 messenger.start
2014-08-12 14:35:11.259576 7ff794bd57c0 10 mds.-1.0 168 MDSCacheObject
2014-08-12 14:35:11.259625 7ff794bd57c0 10 mds.-1.0 2304 ? ? ? ?CInode
2014-08-12 14:35:11.259630 7ff794bd57c0 10 mds.-1.0 16 ? elist<>::item ? *7=112
2014-08-12 14:35:11.259635 7ff794bd57c0 10 mds.-1.0 480 ?inode_t
2014-08-12 14:35:11.259639 7ff794bd57c0 10 mds.-1.0 56 ? ?nest_info_t
2014-08-12 14:35:11.259644 7ff794bd57c0 10 mds.-1.0 32 ? ?frag_info_t
2014-08-12 14:35:11.259648 7ff794bd57c0 10 mds.-1.0 40 ? SimpleLock ? *5=200
2014-08-12 14:35:11.259652 7ff794bd57c0 10 mds.-1.0 48 ? ScatterLock ?*3=144
2014-08-12 14:35:11.259656 7ff794bd57c0 10 mds.-1.0 488 CDentry
2014-08-12 14:35:11.259661 7ff794bd57c0 10 mds.-1.0 16 ? elist<>::item
2014-08-12 14:35:11.259669 7ff794bd57c0 10 mds.-1.0 40 ? SimpleLock
2014-08-12 14:35:11.259674 7ff794bd57c0 10 mds.-1.0 1016 ? ? ? ?CDir
2014-08-12 14:35:11.259678 7ff794bd57c0 10 mds.-1.0 16 ? elist<>::item ? *2=32
2014-08-12 14:35:11.259682 7ff794bd57c0 10 mds.-1.0 192 ?fnode_t
2014-08-12 14:35:11.259687 7ff794bd57c0 10 mds.-1.0 56 ? ?nest_info_t *2
2014-08-12 14:35:11.259691 7ff794bd57c0 10 mds.-1.0 32 ? ?frag_info_t *2
2014-08-12 14:35:11.259695 7ff794bd57c0 10 mds.-1.0 176 Capability
2014-08-12 14:35:11.259699 7ff794bd57c0 10 mds.-1.0 32 ? xlist<>::item ? *2=64
2014-08-12 14:35:11.259767 7ff794bd57c0 ?1 accepter.accepter.start
2014-08-12 14:35:11.260734 7ff794bd57c0 ?1 -- 192.168.1.20:6800/10244 --> 192.168.1.20:6789/0 -- auth(proto 0 31 bytes epoch 0) v1 -- ?+0 0x3684000 con 0x36ac580
2014-08-12 14:35:11.261346 7ff794bcd700 10 mds.-1.0 MDS::ms_get_authorizer type=mon
2014-08-12 14:35:11.261696 7ff78fe4f700 ?5 mds.-1.0 ms_handle_connect on 192.168.1.20:6789/0
2014-08-12 14:35:11.262409 7ff78fe4f700 ?1 -- 192.168.1.20:6800/10244 <== mon.0 192.168.1.20:6789/0 1 ==== mon_map v1 ==== 194+0+0 (4155369063 0 0) 0x36d4000 con 0x36ac580
2014-08-12 14:35:11.262572 7ff78fe4f700 ?1 -- 192.168.1.20:6800/10244 <== mon.0 192.168.1.20:6789/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (2093056952 0 0) 0x3691400 con 0x36ac580
2014-08-12 14:35:11.262925 7ff78fe4f700 ?1 -- 192.168.1.20:6800/10244 --> 192.168.1.20:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- ?+0 0x3684240 con 0x36ac580
2014-08-12 14:35:11.263643 7ff78fe4f700 ?1 -- 192.168.1.20:6800/10244 <== mon.0 192.168.1.20:6789/0 3 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 206+0+0 (1371651101 0 0) 0x3691800 con 0x36ac580
2014-08-12 14:35:11.263807 7ff78fe4f700 ?1 -- 192.168.1.20:6800/10244 --> 192.168.1.20:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- ?+0 0x36846c0 con 0x36ac580
2014-08-12 14:35:11.264518 7ff78fe4f700 ?1 -- 192.168.1.20:6800/10244 <== mon.0 192.168.1.20:6789/0 4 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 580+0+0 (1904484134 0 0) 0x3691600 con 0x36ac580
2014-08-12 14:35:11.264662 7ff78fe4f700 ?1 -- 192.168.1.20:6800/10244 --> 192.168.1.20:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x36b4380 con 0x36ac580
2014-08-12 14:35:11.264744 7ff78fe4f700 ?1 -- 192.168.1.20:6800/10244 --> 192.168.1.20:6789/0 -- auth(proto 2 2 bytes epoch 0) v1 -- ?+0 0x3684480 con 0x36ac580
2014-08-12 14:35:11.265027 7ff78fe4f700 ?1 -- 192.168.1.20:6800/10244 <== mon.0 192.168.1.20:6789/0 5 ==== mon_map v1 ==== 194+0+0 (4155369063 0 0) 0x36d43c0 con 0x36ac580
2014-08-12 14:35:11.265203 7ff78fe4f700 ?1 -- 192.168.1.20:6800/10244 <== mon.0 192.168.1.20:6789/0 6 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (2253672535 0 0) 0x36b4540 con 0x36ac580
2014-08-12 14:35:11.265251 7ff78fe4f700 ?1 -- 192.168.1.20:6800/10244 <== mon.0 192.168.1.20:6789/0 7 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 194+0+0 (1999696020 0 0) 0x3691a00 con 0x36ac580
2014-08-12 14:35:11.265506 7ff794bd57c0 ?1 -- 192.168.1.20:6800/10244 --> 192.168.1.20:6789/0 -- mon_subscribe({monmap=2+,osdmap=0}) v2 -- ?+0 0x36b41c0 con 0x36ac580
2014-08-12 14:35:11.265580 7ff794bd57c0 ?1 -- 192.168.1.20:6800/10244 --> 192.168.1.20:6789/0 -- mon_subscribe({mdsmap=0+,monmap=2+,osdmap=0}) v2 -- ?+0 0x36b4a80 con 0x36ac580
2014-08-12 14:35:11.266159 7ff78fe4f700 ?1 -- 192.168.1.20:6800/10244 <== mon.0 192.168.1.20:6789/0 8 ==== osd_map(9687..9687 src has 9090..9687) v3 ==== 6983+0+0 (1578463925 0 0) 0x3684b40 con 0x36ac580
2014-08-12 14:35:11.266453 7ff78fe4f700 ?1 -- 192.168.1.20:6800/10244 <== mon.0 192.168.1.20:6789/0 9 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (2253672535 0 0) 0x36b41c0 con 0x36ac580
2014-08-12 14:35:11.266491 7ff78fe4f700 ?1 -- 192.168.1.20:6800/10244 <== mon.0 192.168.1.20:6789/0 10 ==== mdsmap(e 7182) v1 ==== 653+0+0 (374906493 0 0) 0x3691800 con 0x36ac580
2014-08-12 14:35:11.266518 7ff794bd57c0 10 mds.-1.0 beacon_send up:boot seq 1 (currently up:boot)
2014-08-12 14:35:11.266585 7ff794bd57c0 ?1 -- 192.168.1.20:6800/10244 --> 192.168.1.20:6789/0 -- mdsbeacon(12799/MDS1.1 up:boot seq 1 v0) v2 -- ?+0 0x36bc2c0 con 0x36ac580
2014-08-12 14:35:11.266626 7ff794bd57c0 10 mds.-1.0 create_logger
2014-08-12 14:35:11.266677 7ff78fe4f700 ?5 mds.-1.0 handle_mds_map epoch 7182 from mon.0
2014-08-12 14:35:11.266779 7ff78fe4f700 10 mds.-1.0 ? ? ?my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data}
2014-08-12 14:35:11.266793 7ff78fe4f700 10 mds.-1.0 ?mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table}
2014-08-12 14:35:11.266803 7ff78fe4f700 ?0 mds.-1.0 handle_mds_map mdsmap compatset compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table} not writeable with daemon features compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data}, killing myself
2014-08-12 14:35:11.266821 7ff78fe4f700 ?1 mds.-1.0 suicide. ?wanted down:dne, now up:boot
2014-08-12 14:35:11.267081 7ff78fe4f700 ?1 -- 192.168.1.20:6800/10244 mark_down 0x36ac580 -- 0x36c8500
2014-08-12 14:35:11.267204 7ff78fe4f700 ?1 -- 192.168.1.20:6800/10244 mark_down_all
2014-08-12 14:35:11.267612 7ff794bd57c0 ?1 -- 192.168.1.20:6800/10244 shutdown complete.

Regards,
Hong

On Tuesday, July 22, 2014 4:03 PM, Craig Lewis <clewis at centraldesktop.com> wrote:

The osd lost is useful, but not strictly required. ?It accelerates the recovery once things are stable. ?It tells Ceph to give up trying to recovery data off those disks. ?Without it, Ceph will still check, then give up when it can't find it.

I was having problems with the suicide timeout at one point. ?Basically, the OSDs fail and restart so many times that they can't apply all of the map changes before they hit the timeout. ?Sage gave me some suggestions. ?Give this a try:?https://www.mail-archive.com/ceph-devel at vger.kernel.org/msg18862.html

That process solved suicide timeouts, with one caveat.? When I followed it, I filled up /var/log/ceph/ and the recovery failed.? I had to manually run each OSD in debugging mode until it completed the map update.? Aside from that, I followed your procedure.

I had to run that procedure on all OSDs. ?I did all OSDs on a node at the same time.

On Mon, Jul 21, 2014 at 11:45 PM, hjcho616 <hjcho616 at yahoo.com> wrote:

Craig,
>
>
>osd.2 was down and out. ?lost wasn't working.. so skipped it. =P ?Formatted the drive XFS and got mostly working but couldn't figure out how to get the journal to point at my SSD, and init script wasn't able to find the osd.2 for some reason. ?So just used ceph-deploy. ?It created new osd.6 on the disks that were used for osd.2. ?I removed norecover and nobackfill and let the system rebuild. ?It seemed like it was doing well until it hit that suicide timeout. What should I do in this case?
>
>
>? ?-20> 2014-07-22 01:01:26.087707 7f3a90012700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2014-07-22 01:00:56.087703)
>? ?-19> 2014-07-22 01:01:26.087743 7f3a90012700 10 monclient: renew subs? (now: 2014-07-22 01:01:26.087742; renew after: 2014-07-22 01:01:16.084357) -- yes
>? ?-18> 2014-07-22 01:01:26.087775 7f3a90012700 10 monclient: renew_subs
>? ?-17> 2014-07-22 01:01:26.087793 7f3a90012700 10 monclient: _send_mon_message to mon.MDS1 at 192.168.1.20:6789/0
>? ?-16> 2014-07-22 01:01:26.087822 7f3a90012700 ?1 -- 192.168.1.30:6800/6297 --> 192.168.1.20:6789/0 -- mon_subscribe({monmap=2+,osd_pg_creates=0}) v2 -- ?+0 0x1442c000 con 0xf73e2c0
>? ?-15> 2014-07-22 01:01:27.916972 7f3a8c80b700 ?5 osd.6 3252 heartbeat: osd_stat(66173 MB used, 1797 GB avail, 1862 GB total, peers [3,4,5]/[] op hist [])
>? ?-14> 2014-07-22 01:01:27.917061 7f3a8c80b700 ?1 -- 192.168.2.30:0/6297 --> 192.168.2.31:6803/13623 -- osd_ping(ping e3252 stamp 2014-07-22 01:01:27.917024) v2 -- ?+0 0x140201c0 con 0xfa68160
>? ?-13> 2014-07-22 01:01:27.917131 7f3a8c80b700 ?1 -- 192.168.2.30:0/6297 --> 192.168.1.31:6804/13623 -- osd_ping(ping e3252 stamp 2014-07-22 01:01:27.917024) v2 -- ?+0 0x17d0b500 con 0xfa68000
>? ?-12> 2014-07-22 01:01:27.917180 7f3a8c80b700 ?1 -- 192.168.2.30:0/6297 --> 192.168.2.31:6805/13991 -- osd_ping(ping e3252 stamp 2014-07-22 01:01:27.917024) v2 -- ?+0 0xfa8fdc0 con 0x19208c60
>? ?-11> 2014-07-22 01:01:27.917229 7f3a8c80b700 ?1 -- 192.168.2.30:0/6297 --> 192.168.1.31:6807/13991 -- osd_ping(ping e3252 stamp 2014-07-22 01:01:27.917024) v2 -- ?+0 0xffcee00 con 0x205c000
>? ?-10> 2014-07-22 01:01:27.917276 7f3a8c80b700 ?1 -- 192.168.2.30:0/6297 --> 192.168.2.31:6801/13249 -- osd_ping(ping e3252 stamp 2014-07-22 01:01:27.917024) v2 -- ?+0 0x224fdc0 con 0xf9f8dc0
>? ? -9> 2014-07-22 01:01:27.917325 7f3a8c80b700 ?1 -- 192.168.2.30:0/6297 --> 192.168.1.31:6801/13249 -- osd_ping(ping e3252 stamp 2014-07-22 01:01:27.917024) v2 -- ?+0 0xf8ce000 con 0x19208840
>? ? -8> 2014-07-22 01:01:27.918723 7f3a9581d700 ?1 -- 192.168.2.30:0/6297 <== osd.3 192.168.1.31:6804/13623 28 ==== osd_ping(ping_reply e3252 stamp 2014-07-22 01:01:27.917024) v2 ==== 47+0+0 (2181285829 0 0) 0xffcf500 con 0xfa68000
>? ? -7> 2014-07-22 01:01:27.918830 7f3a9581d700 ?1 -- 192.168.2.30:0/6297 <== osd.5 192.168.1.31:6801/13249 28 ==== osd_ping(ping_reply e3252 stamp 2014-07-22 01:01:27.917024) v2 ==== 47+0+0 (2181285829 0 0) 0x11a5e700 con 0x19208840
>? ? -6> 2014-07-22 01:01:27.919218 7f3a9581d700 ?1 -- 192.168.2.30:0/6297 <== osd.5 192.168.2.31:6801/13249 28 ==== osd_ping(ping_reply e3252 stamp 2014-07-22 01:01:27.917024) v2 ==== 47+0+0 (2181285829 0 0) 0xfa8fa40 con 0xf9f8dc0
>? ? -5> 2014-07-22 01:01:27.919396 7f3a9581d700 ?1 -- 192.168.2.30:0/6297 <== osd.3 192.168.2.31:6803/13623 28 ==== osd_ping(ping_reply e3252 stamp 2014-07-22 01:01:27.917024) v2 ==== 47+0+0 (2181285829 0 0) 0x1dd8bc00 con 0xfa68160
>? ? -4> 2014-07-22 01:01:27.919521 7f3a9581d700 ?1 -- 192.168.2.30:0/6297 <== osd.4 192.168.2.31:6805/13991 28 ==== osd_ping(ping_reply e3252 stamp 2014-07-22 01:01:27.917024) v2 ==== 47+0+0 (2181285829 0 0) 0x14021a40 con 0x19208c60
>? ? -3> 2014-07-22 01:01:27.919606 7f3a9581d700 ?1 -- 192.168.2.30:0/6297 <== osd.4 192.168.1.31:6807/13991 28 ==== osd_ping(ping_reply e3252 stamp 2014-07-22 01:01:27.917024) v2 ==== 47+0+0 (2181285829 0 0) 0x10fdd6c0 con 0x205c000
>? ? -2> 2014-07-22 01:01:29.976382 7f3aa5c22700 ?1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f3a9d0b0700' had timed out after 60
>? ? -1> 2014-07-22 01:01:29.976416 7f3aa5c22700 ?1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f3a9d0b0700' had suicide timed out after 180
>? ? ?0> 2014-07-22 01:01:29.985984 7f3aa5c22700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f3aa5c22700 time 2014-07-22 01:01:29.976450
>common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
>
>
>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>?1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2eb) [0xad2cbb]
>?2: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6]
>?3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8]
>?4: (CephContextServiceThread::entry()+0x13f) [0xb9911f]
>?5: (()+0x8062) [0x7f3aa8f89062]
>?6: (clone()+0x6d) [0x7f3aa78c9a3d]
>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
>
>--- logging levels ---
>? ?0/ 5 none
>? ?0/ 1 lockdep
>? ?0/ 1 context
>? ?1/ 1 crush
>? ?1/ 5 mds
>? ?1/ 5 mds_balancer
>? ?1/ 5 mds_locker
>? ?1/ 5 mds_log
>? ?1/ 5 mds_log_expire
>? ?1/ 5 mds_migrator
>? ?0/ 1 buffer
>? ?0/ 1 timer
>? ?0/ 1 filer
>? ?0/ 1 striper
>? ?0/ 1 objecter
>? ?0/ 5 rados
>? ?0/ 5 rbd
>? ?0/ 5 journaler
>? ?0/ 5 objectcacher
>? ?0/ 5 client
>? ?0/ 5 osd
>? ?0/ 5 optracker
>? ?0/ 5 objclass
>? ?1/ 3 filestore
>? ?1/ 3 keyvaluestore
>? ?1/ 3 journal
>? ?0/ 5 ms
>? ?1/ 5 mon
>? ?0/10 monc
>? ?1/ 5 paxos
>? ?0/ 5 tp
>? ?1/ 5 auth
>? ?1/ 5 crypto
>? ?1/ 1 finisher
>? ?1/ 5 heartbeatmap
>? ?1/ 5 perfcounter
>? ?1/ 5 rgw
>? ?1/ 5 javaclient
>? ?1/ 5 asok
>? ?1/ 1 throttle
>? -2/-2 (syslog threshold)
>? -1/-1 (stderr threshold)
>? max_recent ? ? 10000
>? max_new ? ? ? ? 1000
>? log_file /var/log/ceph/ceph-osd.6.log
>--- end dump of recent events ---
>2014-07-22 01:01:30.352843 7f3aa5c22700 -1 *** Caught signal (Aborted) **
>?in thread 7f3aa5c22700
>
>
>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>?1: /usr/bin/ceph-osd() [0xaac562]
>?2: (()+0xf880) [0x7f3aa8f90880]
>?3: (gsignal()+0x39) [0x7f3aa78193a9]
>?4: (abort()+0x148) [0x7f3aa781c4c8]
>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f3aa81065e5]
>?6: (()+0x5e746) [0x7f3aa8104746]
>?7: (()+0x5e773) [0x7f3aa8104773]
>?8: (()+0x5e9b2) [0x7f3aa81049b2]
>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0xb85b6a]
>?10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2eb) [0xad2cbb]
>?11: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6]
>?12: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8]
>?13: (CephContextServiceThread::entry()+0x13f) [0xb9911f]
>?14: (()+0x8062) [0x7f3aa8f89062]
>?15: (clone()+0x6d) [0x7f3aa78c9a3d]
>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
>
>--- begin dump of recent events ---
>? ? ?0> 2014-07-22 01:01:30.352843 7f3aa5c22700 -1 *** Caught signal (Aborted) **
>?in thread 7f3aa5c22700
>
>
>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>?1: /usr/bin/ceph-osd() [0xaac562]
>?2: (()+0xf880) [0x7f3aa8f90880]
>?3: (gsignal()+0x39) [0x7f3aa78193a9]
>?4: (abort()+0x148) [0x7f3aa781c4c8]
>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f3aa81065e5]
>?6: (()+0x5e746) [0x7f3aa8104746]
>?7: (()+0x5e773) [0x7f3aa8104773]
>?8: (()+0x5e9b2) [0x7f3aa81049b2]
>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0xb85b6a]
>?10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2eb) [0xad2cbb]
>?11: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6]
>?12: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8]
>?13: (CephContextServiceThread::entry()+0x13f) [0xb9911f]
>?14: (()+0x8062) [0x7f3aa8f89062]
>?15: (clone()+0x6d) [0x7f3aa78c9a3d]
>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>
>
>--- logging levels ---
>? ?0/ 5 none
>? ?0/ 1 lockdep
>? ?0/ 1 context
>? ?1/ 1 crush
>? ?1/ 5 mds
>? ?1/ 5 mds_balancer
>? ?1/ 5 mds_locker
>? ?1/ 5 mds_log
>? ?1/ 5 mds_log_expire
>? ?1/ 5 mds_migrator
>? ?0/ 1 buffer
>? ?0/ 1 timer
>? ?0/ 1 filer
>? ?0/ 1 striper
>? ?0/ 1 objecter
>? ?0/ 5 rados
>? ?0/ 5 rbd
>? ?0/ 5 journaler
>? ?0/ 5 objectcacher
>? ?0/ 5 client
>? ?0/ 5 osd
>? ?0/ 5 optracker
>? ?0/ 5 objclass
>? ?1/ 3 filestore
>? ?1/ 3 keyvaluestore
>? ?1/ 3 journal
>? ?0/ 5 ms
>? ?1/ 5 mon
>? ?0/10 monc
>? ?1/ 5 paxos
>? ?0/ 5 tp
>? ?1/ 5 auth
>? ?1/ 5 crypto
>? ?1/ 1 finisher
>? ?1/ 5 heartbeatmap
>? ?1/ 5 perfcounter
>? ?1/ 5 rgw
>? ?1/ 5 javaclient
>? ?1/ 5 asok
>? ?1/ 1 throttle
>? -2/-2 (syslog threshold)
>? -1/-1 (stderr threshold)
>? max_recent ? ? 10000
>? max_new ? ? ? ? 1000
>? log_file /var/log/ceph/ceph-osd.6.log
>--- end dump of recent events ---
>
>
>Regards,
>Hong
>
>
>
>
>
>On Monday, July 21, 2014 9:35 PM, Craig Lewis <clewis at centraldesktop.com> wrote:
> 
>
>
>I'd like to get rid of those inconsistent PGs. ?I think fixing those will get your MDS working again, but I don't actually know anything about MDS. ?Still, it's best to work your way up from the bottom. ?If the OSDs aren't stable, there's no use building services on top of them.
>
>
>
>
>It's strange that osd.0 was up, but crashed during deep-scrubbing. ?You might try disabling deep-scrubs (ceph osd set nodeep-scrub), and see if osd.0 will stay up. ?If running without deep-scrubbing will get your cluster consistent, you can reformat the disk later.
>
>
>You said osd.2 fails to start, with a corrupt journal error. ?There's not much you can do there. ?You should remove it again, mark it lost, reformat the disk, and re-add it to the cluster.
>
>
>
>
>
>I'd rebuild osd.2 first, while leaving osd.0 and osd.1 down.
>
>
>Do you have enough disk space that osd.2 can take all of the data from osd.0 and osd.1? ?If so, you can mark osd.0 and osd.1 as DOWN and OUT. ?If not, make sure that osd.0 and osd.1 are marked DOWN and IN.
>
>
>Once osd.2 finishes rebuilding, I'd set noin, then bring osd.0 and osd.1 up. ?If they're OUT, that will allow Ceph to copy any unique data they might have, but it won't try to write anything to them. ?If they're IN, well, Ceph will try to write to them. ?Either way, I'm hoping that they stay up long enough for you to get 100% consistent.
>
>
>
>
>
>
>
>
>
>
>
>On Sun, Jul 20, 2014 at 7:01 PM, hjcho616 <hjcho616 at yahoo.com> wrote:
>
>Based on your suggestion here is what I did.
>>
>>
>># ceph osd set nobackfill
>>set nobackfill
>># ceph osd set norecovery
>>Invalid command: ?norecovery not in pause|noup|nodown|noout|noin|nobackfill|norecover|noscrub|nodeep-scrub|notieragent
>>osd set pause|noup|nodown|noout|noin|nobackfill|norecover|noscrub|nodeep-scrub|notieragent : ?set <key>
>>Error EINVAL: invalid command
>># ceph osd set norecover
>>set norecover
>># ceph osd set noin
>>set noin
>># ceph create osd
>>no valid command found; 10 closest matches:
>>osd tier remove <poolname> <poolname>
>>osd tier cache-mode <poolname> none|writeback|forward|readonly
>>osd thrash <int[0-]>
>>osd tier add <poolname> <poolname> {--force-nonempty}
>>osd pool stats {<name>}
>>osd reweight-by-utilization {<int[100-]>}
>>osd pool set <poolname> size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hashpspool|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|debug_fake_ec_pool|target_max_bytes|target_max_objects|cache_target_dirty_ratio|cache_target_full_ratio|cache_min_flush_age|cache_min_evict_age|auid <val> {--yes-i-really-mean-it}
>>osd pool set-quota <poolname> max_objects|max_bytes <val>
>>osd pool rename <poolname> <poolname>
>>osd pool get <poolname> size|min_size|crash_replay_interval|pg_num|pgp_num|crush_ruleset|hit_set_type|hit_set_period|hit_set_count|hit_set_fpp|auid
>>Error EINVAL: invalid command
>># ceph osd create
>>0
>># ceph osd create
>>1
>># ceph osd create
>>2
>># start ceph-osd id=0
>>bash: start: command not found
>># /etc/init.d/ceph start osd.0
>>=== osd.0 ===?
>>2014-07-18 21:21:37.207159 7ff2c64d7700 ?0 librados: osd.0 authentication error (1) Operation not permitted
>>Error connecting to cluster: PermissionError
>>failed: 'timeout 10 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.0 --keyring=/var/lib/ceph/osd/ceph-0/keyring osd crush create-or-move -- 0 1.82 host=OSD1 root=default'
>># ceph status
>>? ? cluster 9b2c9bca-112e-48b0-86fc-587ef9a52948
>>? ? ?health HEALTH_ERR 164 pgs degraded; 38 pgs inconsistent; 192 pgs stuck unclean; recovery 1484224/3513098 objects degraded (42.248%); 1374 scrub errors; mds cluster is degraded; mds MDS1 is laggy; noin,nobackfill,norecover flag(s) set
>>? ? ?monmap e1: 1 mons at {MDS1=192.168.1.20:6789/0}, election epoch 1, quorum 0 MDS1
>>? ? ?mdsmap e7182: 1/1/1 up {0=MDS1=up:replay(laggy or crashed)}
>>? ? ?osdmap e3133: 6 osds: 3 up, 3 in
>>? ? ? ? ? ? flags noin,nobackfill,norecover
>>? ? ? pgmap v309437: 192 pgs, 3 pools, 1571 GB data, 1715 kobjects
>>? ? ? ? ? ? 1958 GB used, 3627 GB / 5586 GB avail
>>? ? ? ? ? ? 1484224/3513098 objects degraded (42.248%)
>>? ? ? ? ? ? ? ? ?131 active+degraded
>>? ? ? ? ? ? ? ? ? 23 active+remapped
>>? ? ? ? ? ? ? ? ? 33 active+degraded+inconsistent
>>? ? ? ? ? ? ? ? ? ?5 active+remapped+inconsistent
>># ceph osd stat
>>? ? ?osdmap e3133: 6 osds: 3 up, 3 in
>>? ? ? ? ? ? flags noin,nobackfill,norecover
>># ceph auth get-or-create osd.0 mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-0/keyring
>>
>>
>># /etc/init.d/ceph start osd.0
>>=== osd.0 ===?
>>create-or-move updating item name 'osd.0' weight 1.82 at location {host=OSD1,root=default} to crush map
>>Starting Ceph osd.0 on OSD1...
>>starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0 /var/lib/ceph/osd/ceph-0/journal
>>root at OSD1:/home/genie# ceph auth get-or-create osd.1 mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-1/keyring
>>root at OSD1:/home/genie# ceph auth get-or-create osd.2 mon 'allow rwx' osd 'allow *' -o /var/lib/ceph/osd/ceph-2/keyring
>>root at OSD1:/home/genie# /etc/init.d/ceph start osd.1
>>=== osd.1 ===?
>>failed: 'timeout 10 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.1 --keyring=/var/lib/ceph/osd/ceph-1/keyring osd crush create-or-move -- 1 1.82 host=OSD1 root=default'
>># /etc/init.d/ceph start osd.2
>>=== osd.2 ===?
>>create-or-move updating item name 'osd.2' weight 1.82 at location {host=OSD1,root=default} to crush map
>>Starting Ceph osd.2 on OSD1...
>>starting osd.2 at :/0 osd_data /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal
>># /etc/init.d/ceph start osd.1
>>=== osd.1 ===?
>>failed: 'timeout 10 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.1 --keyring=/var/lib/ceph/osd/ceph-1/keyring osd crush create-or-move -- 1 1.82 host=OSD1 root=default'
>># ceph health
>>Segmentation fault
>># ceph health
>>Bus error
>># ceph health
>>HEALTH_ERR 164 pgs degraded; 38 pgs inconsistent; 192 pgs stuck unclean; recovery 1484224/3513098 objects degraded (42.248%); 1374 scrub errors; mds cluster is degraded; mds MDS1 is laggy; noin,nobackfill,norecover flag(s) set
>># /etc/init.d/ceph start osd.1
>>=== osd.1 ===?
>>create-or-move updating item name 'osd.1' weight 1.82 at location {host=OSD1,root=default} to crush map
>>Starting Ceph osd.1 on OSD1...
>>starting osd.1 at :/0 osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal
>># ceph -w
>>? ? cluster 9b2c9bca-112e-48b0-86fc-587ef9a52948
>>? ? ?health HEALTH_ERR 164 pgs degraded; 38 pgs inconsistent; 192 pgs stuck unclean; recovery 1484224/3513098 objects degraded (42.248%); 1374 scrub errors; mds cluster is degraded; mds MDS1 is laggy; noin,nobackfill,norecover flag(s) set
>>? ? ?monmap e1: 1 mons at {MDS1=192.168.1.20:6789/0}, election epoch 1, quorum 0 MDS1
>>? ? ?mdsmap e7182: 1/1/1 up {0=MDS1=up:replay(laggy or crashed)}
>>? ? ?osdmap e3137: 6 osds: 4 up, 3 in
>>? ? ? ? ? ? flags noin,nobackfill,norecover
>>? ? ? pgmap v309463: 192 pgs, 3 pools, 1571 GB data, 1715 kobjects
>>? ? ? ? ? ? 1958 GB used, 3627 GB / 5586 GB avail
>>? ? ? ? ? ? 1484224/3513098 objects degraded (42.248%)
>>? ? ? ? ? ? ? ? ?131 active+degraded
>>? ? ? ? ? ? ? ? ? 23 active+remapped
>>? ? ? ? ? ? ? ? ? 33 active+degraded+inconsistent
>>? ? ? ? ? ? ? ? ? ?5 active+remapped+inconsistent
>>
>>
>>2014-07-19 21:34:59.166709 mon.0 [INF] pgmap v309463: 192 pgs: 131 active+degraded, 23 active+remapped, 33 active+degraded+inconsistent, 5 active+remapped+inconsistent; 1571 GB data, 1958 GB used, 3627 GB / 5586 GB avail; 1484224/3513098 objects degraded (42.248%)
>>
>>
>>
>>
>>osd.2 doesn't come up. ?osd.1 uses little memory compared to osd.0, but it stays alive. ?Killed osd.1 and osd.2 for now. ?At this point osd.0's CPU was on and off for a while. ?But it didn't kill it. ?So I did ceph osd unset noin and restarted osd.0. ?It seemed to be doing something for a long time. ?I let it run over night. ?Found it crashed today. ?Below is the log of it.
>>
>>
>>
>>
>>? ?-20> 2014-07-20 00:54:10.924602 7fb562528700 ?5 -- op tracker -- , seq: 4847, time: 2014-07-20 00:54:10.924244, event: header_read, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
>>? ?-19> 2014-07-20 00:54:10.924652 7fb562528700 ?5 -- op tracker -- , seq: 4847, time: 2014-07-20 00:54:10.924250, event: throttled, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
>>? ?-18> 2014-07-20 00:54:10.924698 7fb562528700 ?5 -- op tracker -- , seq: 4847, time: 2014-07-20 00:54:10.924458, event: all_read, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
>>? ?-17> 2014-07-20 00:54:10.924743 7fb562528700 ?5 -- op tracker -- , seq: 4847, time: 0.000000, event: dispatched, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
>>? ?-16> 2014-07-20 00:54:10.924880 7fb54d78f700 ?5 -- op tracker -- , seq: 4847, time: 2014-07-20 00:54:10.924861, event: reached_pg, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
>>? ?-15> 2014-07-20 00:54:10.924936 7fb54d78f700 ?5 -- op tracker -- , seq: 4847, time: 2014-07-20 00:54:10.924915, event: started, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
>>? ?-14> 2014-07-20 00:54:10.924974 7fb54d78f700 ?1 -- 192.168.2.30:6800/18511 --> 192.168.2.31:6804/13991 -- osd_sub_op_reply(unknown.0.0:0 1.29 0//0//-1 [scrub-reserve] ack, result = 0) v2 -- ?+1 0x10503680 con 0xfdca000
>>? ?-13> 2014-07-20 00:54:10.925053 7fb54d78f700 ?5 -- op tracker -- , seq: 4847, time: 2014-07-20 00:54:10.925034, event: done, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
>>? ?-12> 2014-07-20 00:54:10.926801 7fb562528700 ?1 -- 192.168.2.30:6800/18511 <== osd.4 192.168.2.31:6804/13991 1742 ==== osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[]) v10 ==== 1145+0+0 (2357365982 0 0) 0x1045a100 con 0xfdca000
>>? ?-11> 2014-07-20 00:54:10.926912 7fb562528700 ?5 -- op tracker -- , seq: 4848, time: 2014-07-20 00:54:10.926624, event: header_read, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
>>? ?-10> 2014-07-20 00:54:10.926961 7fb562528700 ?5 -- op tracker -- , seq: 4848, time: 2014-07-20 00:54:10.926628, event: throttled, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
>>? ? -9> 2014-07-20 00:54:10.927004 7fb562528700 ?5 -- op tracker -- , seq: 4848, time: 2014-07-20 00:54:10.926786, event: all_read, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
>>? ? -8> 2014-07-20 00:54:10.927046 7fb562528700 ?5 -- op tracker -- , seq: 4848, time: 0.000000, event: dispatched, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
>>? ? -7> 2014-07-20 00:54:10.927179 7fb54df90700 ?5 -- op tracker -- , seq: 4848, time: 2014-07-20 00:54:10.927160, event: reached_pg, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
>>? ? -6> 2014-07-20 00:54:10.927237 7fb54df90700 ?5 -- op tracker -- , seq: 4848, time: 2014-07-20 00:54:10.927216, event: started, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
>>? ? -5> 2014-07-20 00:54:10.927289 7fb54df90700 ?5 -- op tracker -- , seq: 4848, time: 2014-07-20 00:54:10.927269, event: done, op: osd_sub_op(unknown.0.0:0 1.29 0//0//-1 [scrub-unreserve] v 0'0 snapset=0=[]:[] snapc=0=[])
>>? ? -4> 2014-07-20 00:54:10.941372 7fb551f98700 ?1 -- 192.168.2.30:6801/18511 <== osd.3 192.168.1.31:0/13623 776 ==== osd_ping(ping e3144 stamp 2014-07-20 00:54:10.942416) v2 ==== 47+0+0 (216963345 0 0) 0x103c0e00 con 0x1001bce0
>>? ? -3> 2014-07-20 00:54:10.941451 7fb551f98700 ?1 -- 192.168.2.30:6801/18511 --> 192.168.1.31:0/13623 -- osd_ping(ping_reply e3144 stamp 2014-07-20 00:54:10.942416) v2 -- ?+0 0x100e2540 con 0x1001bce0
>>? ? -2> 2014-07-20 00:54:10.941742 7fb55379b700 ?1 -- 192.168.1.30:6801/18511 <== osd.3 192.168.1.31:0/13623 776 ==== osd_ping(ping e3144 stamp 2014-07-20 00:54:10.942416) v2 ==== 47+0+0 (216963345 0 0) 0x10547880 con 0xff8db80
>>? ? -1> 2014-07-20 00:54:10.941842 7fb55379b700 ?1 -- 192.168.1.30:6801/18511 --> 192.168.1.31:0/13623 -- osd_ping(ping_reply e3144 stamp 2014-07-20 00:54:10.942416) v2 -- ?+0 0x10254a80 con 0xff8db80
>>? ? ?0> 2014-07-20 00:54:11.646226 7fb54c78d700 -1 os/DBObjectMap.cc: In function 'virtual bool DBObjectMap::DBObjectMapIteratorImpl::valid()' thread 7fb54c78d700 time 2014-07-20 00:54:11.640719
>>os/DBObjectMap.cc: 399: FAILED assert(!valid || cur_iter->valid())
>>
>>
>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>?1: /usr/bin/ceph-osd() [0xa72172]
>>?2: (ReplicatedBackend::be_deep_scrub(hobject_t const&, ScrubMap::object&, ThreadPool::TPHandle&)+0x6c3) [0xa2df03]
>>?3: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, ThreadPool::TPHandle&)+0x503) [0x98c523]
>>?4: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x10b) [0x891d4b]
>>?5: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x456) [0x8925d6]
>>?6: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0x10a) [0x7b00fa]
>>?7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0xb7792a]
>>?8: (ThreadPool::WorkThread::entry()+0x10) [0xb78b80]
>>?9: (()+0x8062) [0x7fb56b184062]
>>?10: (clone()+0x6d) [0x7fb569ac4a3d]
>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>
>>
>>--- logging levels ---
>>? ?0/ 5 none
>>? ?0/ 1 lockdep
>>? ?0/ 1 context
>>? ?1/ 1 crush
>>? ?1/ 5 mds
>>? ?1/ 5 mds_balancer
>>? ?1/ 5 mds_locker
>>? ?1/ 5 mds_log
>>? ?1/ 5 mds_log_expire
>>? ?1/ 5 mds_migrator
>>? ?0/ 1 buffer
>>? ?0/ 1 timer
>>? ?0/ 1 filer
>>? ?0/ 1 striper
>>? ?0/ 1 objecter
>>? ?0/ 5 rados
>>? ?0/ 5 rbd
>>? ?0/ 5 journaler
>>? ?0/ 5 objectcacher
>>? ?0/ 5 client
>>? ?0/ 5 osd
>>? ?0/ 5 optracker? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
>>? ?0/ 5 objclass
>>? ?1/ 3 filestore
>>? ?1/ 3 keyvaluestore
>>? ?1/ 3 journal
>>? ?0/ 5 ms
>>? ?1/ 5 mon
>>? ?0/10 monc
>>? ?1/ 5 paxos
>>? ?0/ 5 tp
>>? ?1/ 5 auth
>>? ?1/ 5 crypto
>>? ?1/ 1 finisher
>>? ?1/ 5 heartbeatmap
>>? ?1/ 5 perfcounter
>>? ?1/ 5 rgw
>>? ?1/ 5 javaclient
>>? ?1/ 5 asok
>>? ?1/ 1 throttle
>>? -2/-2 (syslog threshold)
>>? -1/-1 (stderr threshold)
>>? max_recent ? ? 10000
>>? max_new ? ? ? ? 1000
>>? log_file /var/log/ceph/ceph-osd.0.log
>>--- end dump of recent events ---
>>2014-07-20 00:54:11.998700 7fb54c78d700 -1 *** Caught signal (Aborted) **
>>?in thread 7fb54c78d700
>>
>>
>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>?1: /usr/bin/ceph-osd() [0xaac562]
>>?2: (()+0xf880) [0x7fb56b18b880]
>>?3: (gsignal()+0x39) [0x7fb569a143a9]
>>?4: (abort()+0x148) [0x7fb569a174c8]
>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb56a3015e5]
>>?6: (()+0x5e746) [0x7fb56a2ff746]
>>?7: (()+0x5e773) [0x7fb56a2ff773]
>>?8: (()+0x5e9b2) [0x7fb56a2ff9b2]
>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0xb85b6a]
>>?10: /usr/bin/ceph-osd() [0xa72172]
>>
>>
>>?11: (ReplicatedBackend::be_deep_scrub(hobject_t const&, ScrubMap::object&, ThreadPool::TPHandle&)+0x6c3) [0xa2df03]
>>?12: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, ThreadPool::TPHandle&)+0x503) [0x98c523]
>>?13: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x10b) [0x891d4b]
>>?14: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x456) [0x8925d6]
>>?15: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0x10a) [0x7b00fa]
>>?16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0xb7792a]
>>?17: (ThreadPool::WorkThread::entry()+0x10) [0xb78b80]
>>?18: (()+0x8062) [0x7fb56b184062]
>>?19: (clone()+0x6d) [0x7fb569ac4a3d]
>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>
>>
>>--- begin dump of recent events ---
>>? ? -1> 2014-07-20 00:54:11.755763 7fb565618700 ?5 osd.0 3144 tick
>>? ? ?0> 2014-07-20 00:54:11.998700 7fb54c78d700 -1 *** Caught signal (Aborted) **
>>?in thread 7fb54c78d700
>>
>>
>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>?1: /usr/bin/ceph-osd() [0xaac562]
>>?2: (()+0xf880) [0x7fb56b18b880]
>>?3: (gsignal()+0x39) [0x7fb569a143a9]
>>?4: (abort()+0x148) [0x7fb569a174c8]
>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fb56a3015e5]
>>?6: (()+0x5e746) [0x7fb56a2ff746]
>>?7: (()+0x5e773) [0x7fb56a2ff773]
>>?8: (()+0x5e9b2) [0x7fb56a2ff9b2]
>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0xb85b6a]
>>?10: /usr/bin/ceph-osd() [0xa72172]
>>?11: (ReplicatedBackend::be_deep_scrub(hobject_t const&, ScrubMap::object&, ThreadPool::TPHandle&)+0x6c3) [0xa2df03]
>>?12: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> > const&, bool, ThreadPool::TPHandle&)+0x503) [0x98c523]
>>?13: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x10b) [0x891d4b]
>>?14: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x456) [0x8925d6]
>>?15: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0x10a) [0x7b00fa]
>>?16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0xb7792a]
>>?17: (ThreadPool::WorkThread::entry()+0x10) [0xb78b80]
>>?18: (()+0x8062) [0x7fb56b184062]
>>?19: (clone()+0x6d) [0x7fb569ac4a3d]
>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>
>>
>>--- logging levels ---
>>? ?0/ 5 none
>>? ?0/ 1 lockdep
>>? ?0/ 1 context
>>? ?1/ 1 crush
>>? ?1/ 5 mds
>>? ?1/ 5 mds_balancer
>>? ?1/ 5 mds_locker
>>? ?1/ 5 mds_log
>>? ?1/ 5 mds_log_expire
>>? ?1/ 5 mds_migrator
>>? ?0/ 1 buffer
>>? ?0/ 1 timer
>>? ?0/ 1 filer
>>? ?0/ 1 striper
>>? ?0/ 1 objecter
>>? ?0/ 5 rados
>>? ?0/ 5 rbd
>>? ?0/ 5 journaler
>>? ?0/ 5 objectcacher
>>? ?0/ 5 client
>>? ?0/ 5 osd
>>? ?0/ 5 optracker
>>? ?0/ 5 objclass
>>? ?1/ 3 filestore
>>? ?1/ 3 keyvaluestore
>>? ?1/ 3 journal
>>? ?0/ 5 ms
>>? ?1/ 5 mon
>>? ?0/10 monc
>>? ?1/ 5 paxos
>>? ?0/ 5 tp
>>? ?1/ 5 auth
>>? ?1/ 5 crypto
>>? ?1/ 1 finisher
>>? ?1/ 5 heartbeatmap
>>? ?1/ 5 perfcounter
>>? ?1/ 5 rgw
>>? ?1/ 5 javaclient
>>? ?1/ 5 asok
>>? ?1/ 1 throttle
>>? -2/-2 (syslog threshold)
>>? -1/-1 (stderr threshold)
>>? max_recent ? ? 10000
>>? max_new ? ? ? ? 1000
>>? log_file /var/log/ceph/ceph-osd.0.log
>>--- end dump of recent events ---
>>
>>
>>
>>
>>What can I do about this one?
>>
>>
>>Regards,
>>Hong
>>
>>
>>
>>
>>
>>
>>
>>
>>On Friday, July 18, 2014 5:16 PM, Craig Lewis <clewis at centraldesktop.com> wrote:
>> 
>>
>>
>>That I can't help you with. ?I'm a pure RadosGW user. ?But OSD stability affects everybody. :-P
>>
>>
>>
>>On Fri, Jul 18, 2014 at 2:34 PM, hjcho616 <hjcho616 at yahoo.com> wrote:
>>
>>Thanks Craig. ?I will try this soon. ?BTW should I upgrade to 0.80.4 first? ?The MDS journal issue seems to be one of the issue I am running into.
>>>
>>>
>>>Regards,
>>>Hong
>>>
>>>
>>>
>>>On Friday, July 18, 2014 4:14 PM, Craig Lewis <clewis at centraldesktop.com> wrote:
>>> 
>>>
>>>
>>>If osd.3, osd.4, and osd.5 are stable, your cluster should be working again. ?What does ceph status say??
>>>
>>>
>>>
>>>
>>>I was able to re-add removed osd.
>>>Here's what I did on my dev cluster:
>>>stop ceph-osd id=0
>>>ceph osd down 0
>>>
>>>ceph osd out 0
>>>
>>>ceph osd rm 0
>>>ceph osd crush rm osd.0
>>>
>>>
>>>
>>>Now my osd tree and osd dump do not show osd.0. ?The cluster was degraded, but did not do any backfilling because I require 3x replication on 3 different hosts, and Ceph can't satisfy that with 2 osds.?
>>>
>>>
>>>On the same host, I ran:
>>>ceph osd create ? ? ? ?# Returned ID 0
>>>start ceph-osd id=0
>>>
>>>
>>>
>>>
>>>osd.0 started up and joined the cluster. ?Once peering completed, all of the PGs recovered quickly. ?I didn't have any writes on the cluster while I was doing this. ?
>>>
>>>
>>>So it looks like you can just re-create and start those deleted osds.
>>>
>>>
>>>
>>>
>>>
>>>
>>>In your situation, I would do the following. ?Before you start, go through this, and make sure you understand all the steps. ?Worst case, you can always undo this by removing the osds again, and you'll be back to where you are now.
>>>
>>>
>>>ceph osd set nobackfill
>>>ceph osd set norecovery
>>>ceph osd set noin
>>>ceph create osd ? # Should return 0. ?Abort if it doesn't.
>>>ceph create osd ? # Should return 1. ?Abort if it doesn't.
>>>ceph create osd ? # Should return 2. ?Abort if it doesn't.
>>>start ceph-osd id=0
>>>
>>>
>>>
>>>Watch ceph -w and top. ?Hopefully ceph-osd id=0 will use some CPU, then go UP, and drop to 0% cpu. ?If so,
>>>ceph osd unset noin
>>>restart ceph-osd id=0
>>>
>>>
>>>Now osd.0 should go UP and IN, use some CPU for a while, then drop to 0% cpu. ?If osd.0 drops out now, ?set noout, and shut it down. ?
>>>
>>>
>>>set noin again, and start osd.1. ?When it's stable, do it again for osd.2.
>>>
>>>
>>>Once as many as possible are up and stable:
>>>ceph osd unset nobackfill
>>>ceph osd unset norecovery
>>>
>>>
>>>Now it should start recovering. ?If your osds start dropping out now, ?set noout, and shut down the ones that are having problems. ?
>>>
>>>
>>>
>>>
>>>The goal is to get all the stable osds up, in, and recovered. ?Once that's done, we can figure out what to do with the unstable osds.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>On Thu, Jul 17, 2014 at 9:29 PM, hjcho616 <hjcho616 at yahoo.com> wrote:
>>>
>>>Sorry Craig. ?I thought I sent both but second part didn't copy right. ?For some reason over night MDS and MON decided to stop so I started it when I was running those commands. ?Interestingly MDS didn't fail at the time like it used to. ?So I thought something was being fixed? ?Then I now realize MDS probably couldn't get to the data because OSD were down. ?Now that I brought up the OSDs MDS crashed again. =P?
>>>>
>>>>
>>>>$ ceph osd tree
>>>># id ? ?weight ?type name ? ? ? up/down reweight
>>>>-1 ? ? ?5.46 ? ?root default
>>>>-2 ? ? ?0 ? ? ? ? ? ? ? host OSD1
>>>>-3 ? ? ?5.46 ? ? ? ? ? ?host OSD2
>>>>3 ? ? ? 1.82 ? ? ? ? ? ? ? ? ? ?osd.3 ? up ? ? ?1
>>>>4 ? ? ? 1.82 ? ? ? ? ? ? ? ? ? ?osd.4 ? up ? ? ?1
>>>>5 ? ? ? 1.82 ? ? ? ? ? ? ? ? ? ?osd.5 ? up ? ? ?1
>>>>
>>>>
>>>>$ ceph osd dump
>>>>epoch 3125
>>>>fsid 9b2c9bca-112e-48b0-86fc-587ef9a52948
>>>>created 2014-02-08 01:57:34.086532
>>>>modified 2014-07-17 23:24:10.823596
>>>>flags
>>>>pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 crash_replay_interval 45 stripe_width 0
>>>>pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
>>>>pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
>>>>max_osd 6
>>>>osd.3 up ? in ?weight 1 up_from 3120 up_thru 3122 down_at 3116 last_clean_interval [2858,3113) 192.168.1.31:6803/13623 192.168.2.31:6802/13623 192.168.2.31:6803/13623 192.168.1.31:6804/13623 exists,up 4f86a418-6c67-4cb4-83a1-6c123c890036
>>>>osd.4 up ? in ?weight 1 up_from 3121 up_thru 3122 down_at 3116 last_clean_interval [2859,3113) 192.168.1.31:6806/13991 192.168.2.31:6804/13991 192.168.2.31:6805/13991 192.168.1.31:6807/13991 exists,up 3d5e3843-7a47-44b0-b276-61c4b1d62900
>>>>osd.5 up ? in ?weight 1 up_from 3118 up_thru 3118 down_at 3116 last_clean_interval [2856,3113) 192.168.1.31:6800/13249 192.168.2.31:6800/13249 192.168.2.31:6801/13249 192.168.1.31:6801/13249 exists,up eec86483-2f35-48a4-a154-2eaf26be06b9
>>>>pg_temp 0.2 [4,3]
>>>>pg_temp 0.a [4,5]
>>>>pg_temp 0.c [3,4]
>>>>pg_temp 0.10 [3,4]
>>>>pg_temp 0.15 [3,5]
>>>>pg_temp 0.17 [3,5]
>>>>pg_temp 0.2f [4,5]
>>>>pg_temp 0.3b [4,3]
>>>>pg_temp 0.3c [3,5]
>>>>pg_temp 0.3d [4,5]
>>>>pg_temp 1.1 [4,3]
>>>>pg_temp 1.9 [4,5]
>>>>pg_temp 1.b [3,4]
>>>>pg_temp 1.14 [3,5]
>>>>pg_temp 1.16 [3,5]
>>>>pg_temp 1.2e [4,5]
>>>>pg_temp 1.3a [4,3]
>>>>pg_temp 1.3b [3,5]
>>>>pg_temp 1.3c [4,5]
>>>>pg_temp 2.0 [4,3]
>>>>pg_temp 2.8 [4,5]
>>>>pg_temp 2.a [3,4]
>>>>pg_temp 2.13 [3,5]
>>>>pg_temp 2.15 [3,5]
>>>>pg_temp 2.2d [4,5]
>>>>pg_temp 2.39 [4,3]
>>>>pg_temp 2.3a [3,5]
>>>>pg_temp 2.3b [4,5]
>>>>blacklist 192.168.1.20:6802/30894 expires 2014-07-17 23:48:10.823576
>>>>blacklist 192.168.1.20:6801/30651 expires 2014-07-17 23:47:55.562984
>>>>
>>>>
>>>>Regards,
>>>>Hong
>>>>
>>>>
>>>>
>>>>On Thursday, July 17, 2014 3:30 PM, Craig Lewis <clewis at centraldesktop.com> wrote:
>>>> 
>>>>
>>>>
>>>>You gave me 'ceph osd dump' twice.... can I see 'ceph osd tree' too?
>>>>
>>>>
>>>>Why are osd.3, osd.4, and osd.5 down?
>>>>
>>>>
>>>>
>>>>On Thu, Jul 17, 2014 at 11:45 AM, hjcho616 <hjcho616 at yahoo.com> wrote:
>>>>
>>>>Thank you for looking at this. ?Below are the outputs you requested.
>>>>>
>>>>>
>>>>># ceph osd dump
>>>>>epoch 3117
>>>>>fsid 9b2c9bca-112e-48b0-86fc-587ef9a52948
>>>>>created 2014-02-08 01:57:34.086532
>>>>>modified 2014-07-16 22:13:04.385914
>>>>>flags?
>>>>>pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 crash_replay_interval 45 stripe_width 0
>>>>>pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
>>>>>pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
>>>>>max_osd 6
>>>>>osd.3 down in ?weight 1 up_from 2858 up_thru 3040 down_at 3116 last_clean_interval [2830,2851) 192.168.1.31:6803/5127 192.168.2.31:6805/5127 192.168.2.31:6806/5127 192.168.1.31:6805/5127 exists 4f86a418-6c67-4cb4-83a1-6c123c890036
>>>>>osd.4 down in ?weight 1 up_from 2859 up_thru 3043 down_at 3116 last_clean_interval [2835,2849) 192.168.1.31:6807/5310 192.168.2.31:6807/5310 192.168.2.31:6808/5310 192.168.1.31:6808/5310 exists 3d5e3843-7a47-44b0-b276-61c4b1d62900
>>>>>osd.5 down in ?weight 1 up_from 2856 up_thru 3042 down_at 3116 last_clean_interval [2837,2853) 192.168.1.31:6800/4969 192.168.2.31:6801/4969 192.168.2.31:6804/4969 192.168.1.31:6801/4969 exists eec86483-2f35-48a4-a154-2eaf26be06b9
>>>>>
>>>>>
>>>>># ceph osd dump
>>>>>epoch 3117
>>>>>fsid 9b2c9bca-112e-48b0-86fc-587ef9a52948
>>>>>created 2014-02-08 01:57:34.086532
>>>>>modified 2014-07-16 22:13:04.385914
>>>>>flags?
>>>>>pool 0 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 crash_replay_interval 45 stripe_width 0
>>>>>pool 1 'metadata' replicated size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
>>>>>pool 2 'rbd' replicated size 2 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 stripe_width 0
>>>>>max_osd 6
>>>>>osd.3 down in ?weight 1 up_from 2858 up_thru 3040 down_at 3116 last_clean_interval [2830,2851) 192.168.1.31:6803/5127 192.168.2.31:6805/5127 192.168.2.31:6806/5127 192.168.1.31:6805/5127 exists 4f86a418-6c67-4cb4-83a1-6c123c890036
>>>>>osd.4 down in ?weight 1 up_from 2859 up_thru 3043 down_at 3116 last_clean_interval [2835,2849) 192.168.1.31:6807/5310 192.168.2.31:6807/5310 192.168.2.31:6808/5310 192.168.1.31:6808/5310 exists 3d5e3843-7a47-44b0-b276-61c4b1d62900
>>>>>osd.5 down in ?weight 1 up_from 2856 up_thru 3042 down_at 3116 last_clean_interval [2837,2853) 192.168.1.31:6800/4969 192.168.2.31:6801/4969 192.168.2.31:6804/4969 192.168.1.31:6801/4969 exists eec86483-2f35-48a4-a154-2eaf26be06b9
>>>>>
>>>>>
>>>>>Regards,
>>>>>Hong
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>On Thursday, July 17, 2014 12:02 PM, Craig Lewis <clewis at centraldesktop.com> wrote:
>>>>> 
>>>>>
>>>>>
>>>>>I don't believe you can re-add an OSD after `ceph osd rm`, but it's worth a shot. ?Let me see what I can do on my dev cluster.
>>>>>
>>>>>
>>>>>What does `ceph osd dump` and `ceph osd tree` say? ?I want to make sure I'm starting from the same point you are.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>On Wed, Jul 16, 2014 at 7:39 PM, hjcho616 <hjcho616 at yahoo.com> wrote:
>>>>>
>>>>>I did a "ceph osd rm" for all three but I didn't do anything else to it afterwards. ?Can this be added back?
>>>>>>
>>>>>>
>>>>>>Regards,
>>>>>>Hong
>>>>>>
>>>>>>
>>>>>>
>>>>>>On Wednesday, July 16, 2014 6:54 PM, Craig Lewis <clewis at centraldesktop.com> wrote:
>>>>>> 
>>>>>>
>>>>>>
>>>>>>For some reason you ended up in my spam folder. ?That might be why you didn't get any responses.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>Have you destroyed osd.0, osd.1, and osd.2? ?If not, try bringing them up one a time. ?You might have just one bad disk, which is much better than 50% of your disks.
>>>>>>
>>>>>>
>>>>>>
>>>>>>How is the ceph-osd process behaving when it hits the suicide timeout? ?I had some problems a while back where the ceph-osd process would startup, start consuming ~200% CPU for a while, then get stuck using almost exactly 100% CPU. ?It would get kicked out of the cluster for being unresponsive, then suicide. ?Repeat. ?If that's happening here, I can suggest some things to try.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>On Fri, Jul 11, 2014 at 9:12 PM, hjcho616 <hjcho616 at yahoo.com> wrote:
>>>>>>
>>>>>>I have 2 OSD machines with 3 OSD running on each. ?One MDS server with 3 daemons running. ?Ran cephfs mostly on 0.78. ?One night we lost power for split second. ?MDS1 and OSD2 went down, OSD1 seemed OK, well turns out OSD1 suffered most. ?Those two machines rebooted and seemed ok except it had some inconsistencies. ?I waited for a while, didn't fix itself. ?So I issued 'ceph pg repair pgnum'. ?It would try some and some OSD would crash. ?Tried this for multiple days. ?Got some PGs fixed... but mostly it would crash an OSD and stop recovering. ?dmesg shows something like below.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>[ ?740.059498] traps: ceph-osd[5279] general protection ip:7f84e75ec75e sp:7fff00045bc0 error:0 in libtcmalloc.so.4.1.0[7f84e75b3000+4a000]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>and ceph osd log shows something like this.
>>>>>>>
>>>>>>>
>>>>>>>? ? -2> 2014-07-09 20:51:01.163571 7fe0f4617700 ?1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fe0e8e91700' had timed out after 60
>>>>>>>? ? -1> 2014-07-09 20:51:01.163609 7fe0f4617700 ?1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fe0e8e91700' had suicide timed out after 180
>>>>>>>? ? ?0> 2014-07-09 20:51:01.169542 7fe0f4617700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fe0f4617700 time 2014-07-09 20:51:01.163642
>>>>>>>common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
>>>>>>>
>>>>>>>
>>>>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>>>>?1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2eb) [0xad2cbb]
>>>>>>>?2: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6]
>>>>>>>?3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8]
>>>>>>>?4: (CephContextServiceThread::entry()+0x13f) [0xb9911f]
>>>>>>>?5: (()+0x8062) [0x7fe0f797e062]
>>>>>>>?6: (clone()+0x6d) [0x7fe0f62bea3d]
>>>>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>>>>
>>>>>>>
>>>>>>>--- logging levels ---
>>>>>>>? ?0/ 5 none
>>>>>>>? ?0/ 1 lockdep
>>>>>>>? ?0/ 1 context
>>>>>>>? ?1/ 1 crush
>>>>>>>? ?1/ 5 mds
>>>>>>>? ?1/ 5 mds_balancer
>>>>>>>? ?1/ 5 mds_locker
>>>>>>>? ?1/ 5 mds_log
>>>>>>>? ?1/ 5 mds_log_expire
>>>>>>>? ?1/ 5 mds_migrator
>>>>>>>? ?0/ 1 buffer
>>>>>>>? ?0/ 1 timer
>>>>>>>? ?0/ 1 filer
>>>>>>>? ?0/ 1 striper
>>>>>>>? ?0/ 1 objecter
>>>>>>>? ?0/ 5 rados
>>>>>>>? ?0/ 5 rbd
>>>>>>>? ?0/ 5 journaler
>>>>>>>? ?0/ 5 objectcacher
>>>>>>>? ?0/ 5 client
>>>>>>>? ?0/ 5 osd
>>>>>>>? ?0/ 5 optracker
>>>>>>>? ?0/ 5 objclass
>>>>>>>? ?1/ 3 filestore
>>>>>>>? ?1/ 3 keyvaluestore
>>>>>>>? ?1/ 3 journal
>>>>>>>? ?0/ 5 ms
>>>>>>>? ?1/ 5 mon
>>>>>>>? ?0/10 monc
>>>>>>>? ?1/ 5 paxos
>>>>>>>? ?0/ 5 tp
>>>>>>>? ?1/ 5 auth
>>>>>>>? ?1/ 5 crypto
>>>>>>>? ?1/ 1 finisher
>>>>>>>? ?1/ 5 heartbeatmap
>>>>>>>? ?1/ 5 perfcounter
>>>>>>>? ?1/ 5 rgw
>>>>>>>? ?1/ 5 javaclient
>>>>>>>? ?1/ 5 asok
>>>>>>>? ?1/ 1 throttle
>>>>>>>? -2/-2 (syslog threshold)
>>>>>>>? -1/-1 (stderr threshold)
>>>>>>>? max_recent ? ? 10000
>>>>>>>? max_new ? ? ? ? 1000
>>>>>>>? log_file /var/log/ceph/ceph-osd.0.log
>>>>>>>--- end dump of recent events ---
>>>>>>>2014-07-09 20:51:01.534706 7fe0f4617700 -1 *** Caught signal (Aborted) **
>>>>>>>?in thread 7fe0f4617700
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>>>>?1: /usr/bin/ceph-osd() [0xaac562]
>>>>>>>?2: (()+0xf880) [0x7fe0f7985880]
>>>>>>>?3: (gsignal()+0x39) [0x7fe0f620e3a9]
>>>>>>>?4: (abort()+0x148) [0x7fe0f62114c8]
>>>>>>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fe0f6afb5e5]
>>>>>>>?6: (()+0x5e746) [0x7fe0f6af9746]
>>>>>>>?7: (()+0x5e773) [0x7fe0f6af9773]
>>>>>>>?8: (()+0x5e9b2) [0x7fe0f6af99b2]
>>>>>>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0xb85b6a]
>>>>>>>?10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2eb) [0xad2cbb]
>>>>>>>?11: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6]
>>>>>>>?12: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8]
>>>>>>>?13: (CephContextServiceThread::entry()+0x13f) [0xb9911f]
>>>>>>>?14: (()+0x8062) [0x7fe0f797e062]
>>>>>>>?15: (clone()+0x6d) [0x7fe0f62bea3d]
>>>>>>>?NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>>>>
>>>>>>>
>>>>>>>--- begin dump of recent events ---
>>>>>>>? ? ?0> 2014-07-09 20:51:01.534706 7fe0f4617700 -1 *** Caught signal (Aborted) **
>>>>>>>?in thread 7fe0f4617700
>>>>>>>
>>>>>>>
>>>>>>>?ceph version 0.82 (14085f42ddd0fef4e7e1dc99402d07a8df82c04e)
>>>>>>>?1: /usr/bin/ceph-osd() [0xaac562]
>>>>>>>?2: (()+0xf880) [0x7fe0f7985880]
>>>>>>>?3: (gsignal()+0x39) [0x7fe0f620e3a9]
>>>>>>>?4: (abort()+0x148) [0x7fe0f62114c8]
>>>>>>>?5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fe0f6afb5e5]
>>>>>>>?6: (()+0x5e746) [0x7fe0f6af9746]
>>>>>>>?7: (()+0x5e773) [0x7fe0f6af9773]
>>>>>>>?8: (()+0x5e9b2) [0x7fe0f6af99b2]
>>>>>>>?9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x40a) [0xb85b6a]
>>>>>>>?10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x2eb) [0xad2cbb]
>>>>>>>?11: (ceph::HeartbeatMap::is_healthy()+0xb6) [0xad34c6]
>>>>>>>?12: (ceph::HeartbeatMap::check_touch_file()+0x28) [0xad3aa8]...
>>
>>[Message clipped]??
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140812/c37799a7/attachment.htm>