The cluster do not aware some osd are disappear

<Eric_YH_Chen@xxxxxxxxxx> · Tue, 31 Jul 2012 07:48:20 +0000

Dear All:

My Environment:  two servers, and 12 hard-disk on each server. 
                 Version: Ceph 0.48, Kernel: 3.2.0-27

We create a ceph cluster with 24 osd, 3 monitors
Osd.0 ~ osd.11 is on server1
Osd.12 ~ osd.23 is on server2
Mon.0 is on server1
Mon.1 is on server2
Mon.2 is on server3 which has no osd

When I turn off the network of server1, we expect that server2 would aware 12 osd (on server 1) disappear.
However, when I type ceph -s, it still show 24 osd there.

And from the log of osd.0 and osd.11, we can find heartbeat check on server1, but not on server2. 
What happened to server2? Can we restart the heartbeat server? Thanks!

root@wistor-002:~# ceph -s
   health HEALTH_WARN 1 mons down, quorum 1,2 008,009
   monmap e1: 3 mons at {006=192.168.200.84:6789/0,008=192.168.200.86:6789/0,009=192.168.200.87:6789/0}, election epoch 522, quorum 1,2 008,009
   osdmap e1388: 24 osds: 24 up, 24 in
    pgmap v288663: 4608 pgs: 4608 active+clean; 257 GB data, 988 GB used, 20214 GB / 22320 GB avail
   mdsmap e1: 0/0/1 up

log of ceph -w (we turn of server1 arround 15:20, that cause the new monitor election)
2012-07-31 15:21:25.966572 mon.0 [INF] pgmap v288658: 4608 pgs: 4608 active+clean; 257 GB data, 988 GB used, 20214 GB / 22320 GB avail
2012-07-31 15:20:10.400566 mon.1 [INF] mon.008 calling new monitor election
2012-07-31 15:21:36.030473 mon.1 [INF] mon.008 calling new monitor election
2012-07-31 15:21:36.079772 mon.2 [INF] mon.009 calling new monitor election
2012-07-31 15:21:46.102587 mon.1 [INF] mon.008@1 won leader election with quorum 1,2
2012-07-31 15:21:46.273253 mon.1 [INF] pgmap v288659: 4608 pgs: 4608 active+clean; 257 GB data, 988 GB used, 20214 GB / 22320 GB avail
2012-07-31 15:21:46.273379 mon.1 [INF] mdsmap e1: 0/0/1 up
2012-07-31 15:21:46.273495 mon.1 [INF] osdmap e1388: 24 osds: 24 up, 24 in
2012-07-31 15:21:46.273814 mon.1 [INF] monmap e1: 3 mons at {006=192.168.200.84:6789/0,008=192.168.200.86:6789/0,009=192.168.200.87:6789/0}
2012-07-31 15:21:46.587679 mon.1 [INF] pgmap v288660: 4608 pgs: 4608 active+clean; 257 GB data, 988 GB used, 20214 GB / 22320 GB avail
2012-07-31 15:22:01.245813 mon.1 [INF] pgmap v288661: 4608 pgs: 4608 active+clean; 257 GB data, 988 GB used, 20214 GB / 22320 GB avail
2012-07-31 15:22:33.970838 mon.1 [INF] pgmap v288662: 4608 pgs: 4608 active+clean; 257 GB data, 988 GB used, 20214 GB / 22320 GB avail

Log of osd.0 (on server 1)
2012-07-31 15:20:25.309264 7fdc06470700  0 -- 192.168.200.81:6825/12162 >> 192.168.200.82:6840/8772 pipe(0x4dbea00 sd=52 pgs=0 cs=0 l=0).accept connect_seq 0 vs existing 0 state 1
2012-07-31 15:20:25.310887 7fdc1c551700  0 -- 192.168.200.81:6825/12162 >> 192.168.200.82:6833/15570 pipe(0x4dbec80 sd=51 pgs=0 cs=0 l=0).accept connect_seq 0 vs existing 0 state 1
2012-07-31 15:21:46.861458 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.12 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)
2012-07-31 15:21:46.861496 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.13 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)
2012-07-31 15:21:46.861506 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.14 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)
2012-07-31 15:21:46.861514 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.15 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)
2012-07-31 15:21:46.861522 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.16 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)
2012-07-31 15:21:46.861530 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.17 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)
2012-07-31 15:21:46.861538 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.18 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)
2012-07-31 15:21:46.861546 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.19 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)
2012-07-31 15:21:46.861556 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.20 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)
2012-07-31 15:21:46.861576 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.21 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)
2012-07-31 15:21:46.861609 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.22 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)
2012-07-31 15:21:46.861618 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.23 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458)

Log of osd.12 (on server 2)
2012-07-31 15:20:31.475815 7f9eac5ba700  0 osd.12 1387 pg[2.16f( v 1356'10485 (465'9480,1356'10485] n=42 ec=1 les/c 1387/1387 1383/1383/1383) [12,0] r=0 lpr=1383 mlcod 0'0 active+clean] watch: oi.user_version=45
2012-07-31 15:20:31.475817 7f9eabdb9700  0 osd.12 1387 pg[2.205( v 1282'26975 (1254'25973,1282'26975] n=86 ec=1 les/c 1387/1387 1383/1383/1383) [12,9] r=0 lpr=1383 lcod 0'0 mlcod 0'0 active+clean] watch: ctx->obc=0x5838dc0 cookie=9 oi.version=26975 ctx->at_version=1387'26976
2012-07-31 15:20:31.475837 7f9eabdb9700  0 osd.12 1387 pg[2.205( v 1282'26975 (1254'25973,1282'26975] n=86 ec=1 les/c 1387/1387 1383/1383/1383) [12,9] r=0 lpr=1383 lcod 0'0 mlcod 0'0 active+clean] watch: oi.user_version=1043
2012-07-31 15:35:31.512306 7f9ea6f8e700  0 -- 192.168.200.82:6840/8772 >> 192.168.200.81:6847/18544 pipe(0x4633780 sd=41 pgs=82 cs=1 l=0).fault with nothing to send, going to standby
2012-07-31 15:35:31.512342 7f9ea7897700  0 -- 192.168.200.82:6840/8772 >> 192.168.200.81:6853/19122 pipe(0x4a68280 sd=43 pgs=83 cs=1 l=0).fault with nothing to send, going to standby
2012-07-31 15:35:31.579095 7f9ea6c8b700  0 -- 192.168.200.82:6840/8772 >> 192.168.200.81:6809/17957 pipe(0x6309c80 sd=55 pgs=80 cs=1 l=0).fault with nothing to send, going to standby
2012-07-31 15:35:31.592368 7f9ea7a99700  0 -- 192.168.200.82:6840/8772 >> 192.168.200.81:6840/12656 pipe(0x4b44780 sd=44 pgs=104 cs=1 l=0).fault with nothing to send, going to standby
2012-07-31 15:35:31.596484 7f9ea94b3700  0 -- 192.168.200.82:6840/8772 >> 192.168.200.81:6836/18275 pipe(0x4cfb780 sd=48 pgs=76 cs=1 l=0).fault with nothing to send, going to standby
2012-07-31 15:35:31.720803 7f9ea5a79700  0 -- 192.168.200.82:6840/8772 >> 192.168.200.81:6838/12409 pipe(0xeb4000 sd=38 pgs=105 cs=1 l=0).fault with nothing to send, going to standby

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html