osd down (for 2 about 2 minutes) error after adding a new host to my cluster

Isaac Otsiabah <zmoo76b@xxxxxxxxx> · Mon, 7 Jan 2013 13:00:21 -0800 (PST)

When i add a new host (with osd's) to my existing cluster, 1 or 2 previous osd(s) goes down for about 2 minutes and then they  come back up.  

[root@h1ct ~]# ceph osd tree

# id    weight  type name       up/down reweight
-1     
3       root default
-3      3               rack unknownrack
-2      3                       host h1
0       1                               osd.0   up      1
1       1                               osd.1   up      1
2      
1                               osd.2   up      1

For example, after adding host h2 (with 3 new osd) to the above cluster and running the "ceph osd tree" command, i see this: 

[root@h1 ~]# ceph osd tree

# id    weight  type name       up/down reweight
-1      6       root default
-3     
6               rack unknownrack
-2      3                       host h1
0       1                               osd.0   up      1
1       1                               osd.1   down    1
2      
1                               osd.2   up      1
-4      3                       host h2
3       1                               osd.3   up      1
4       1                               osd.4   up     
1
5       1                               osd.5   up      1

The down osd always come back up after 2 minutes or less andi see the following error message in the respective osd log file:  
2013-01-07 04:40:17.613028 7fec7f092760  1 journal _open 
/ceph_journal/journals/journal_2 fd 26: 1073741824 bytes, block size 
4096 bytes, directio = 1, aio = 0
2013-01-07 04:40:17.613122 
7fec7f092760  1 journal _open /ceph_journal/journals/journal_2 fd 26: 
1073741824 bytes, block size 4096 bytes, directio = 1, aio = 0
2013-01-07
04:42:10.006533 7fec746f7710  0 -- 192.168.0.124:6808/19449 >> 
192.168.1.123:6800/18287 pipe(0x7fec20000e10 sd=31 :6808 pgs=0 cs=0 
l=0).accept connect_seq 0 vs existing 0 state connecting
2013-01-07 
04:45:29.834341 7fec743f4710  0 -- 192.168.1.124:6808/19449 >> 
192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45438 pgs=7 cs=1 
l=0).fault, initiating reconnect
2013-01-07 04:45:29.835748 
7fec743f4710  0 -- 192.168.1.124:6808/19449 >> 
192.168.1.122:6800/20072 pipe(0x7fec5402f320 sd=28 :45439 pgs=15 cs=3 
l=0).fault, initiating reconnect
2013-01-07 04:45:30.835219 7fec743f4710  0 -- 
192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072 
pipe(0x7fec5402f320 sd=28 :45894 pgs=482 cs=903 l=0).fault, initiating 
reconnect
2013-01-07 04:45:30.837318 7fec743f4710  0 -- 
192.168.1.124:6808/19449 >> 192.168.1.122:6800/20072 
pipe(0x7fec5402f320 sd=28 :45895 pgs=483 cs=905 l=0).fault, initiating 
reconnect
2013-01-07 04:45:30.851984 7fec637fe710  0 log [ERR] : map 
e27 had wrong cluster addr (192.168.0.124:6808/19449 != my 
192.168.1.124:6808/19449)

Also, this only happens  only when the cluster ip address and the public ip address are different for example
....
....
....
[osd.0]
        host = g8ct
        public address = 192.168.0.124
        cluster address = 192.168.1.124
        btrfs devs = /dev/sdb

....
....

but does not happen when they are the same.  Any idea what may be the issue?

Isaac
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html