Hi again,
A followup to my last email:
I restarted osd.6 and the system went back to HEALTH_OK.
I examined the logs of osd.6 and osd.12 around the time the problem
occurred, and saw the following:
** On osd.12:
2013-02-08 15:47:43.418226 7fa116ffd700 1 heartbeat_map is_healthy
'OSD::op_tp thread 0x7fa114ff9700' had timed out after 15
[repeated many times]
2013-02-08 15:48:01.282623 7fa114ff9700 1 heartbeat_map reset_timeout
'OSD::op_tp thread 0x7fa114ff9700' had timed out after 15
[repeated twice]
2013-02-08 15:48:09.898961 7fa11dffb700 0 log [WRN] : map e3309 wrongly
marked me down
2013-02-08 15:49:56.496155 7fa116ffd700 1 heartbeat_map is_healthy
'OSD::op_tp thread 0x7fa114ff9700' had timed out after 15
This pattern repeats itself once more.
Then I have messages like this:
2013-02-08 15:50:59.814871 7fa11c6f7700 0 -- 10.0.0.1:6807/29923 >>
10.0.0.2:6807/10808 pipe(0x7fa0fd11a5d0 sd=36 :41003 s=2 pgs=288 cs=3
l=0).reader got old message 1 <= 41 0x7fa124420da0 osd_map(3319..3322
src has 2819..3322) v3, discarding
2013-02-08 15:50:59.814899 7fa1107e9700 0 -- 10.0.0.1:6807/29923 >>
10.0.0.2:6819/11582 pipe(0x7fa10c7d07d0 sd=37 :37261 s=2 pgs=270 cs=3
l=0).reader got old message 1 <= 51 0x7fa0b0452970 osd_map(3319..3322
src has 2819..3322) v3, discarding
2013-02-08 15:50:59.814946 7fa11c6f7700 0 -- 10.0.0.1:6807/29923 >>
10.0.0.2:6807/10808 pipe(0x7fa0fd11a5d0 sd=36 :41003 s=2 pgs=288 cs=3
l=0).fault with nothing to send, going to standby
2013-02-08 15:50:59.815062 7fa0a32f2700 0 -- 10.0.0.1:6807/29923 >>
10.0.0.2:6801/14104 pipe(0x7fa10d4b82d0 sd=43 :50456 s=2 pgs=240 cs=3
l=0).reader got old message 1 <= 44 0x7fa12c00e2a0 osd_map(3319..3322
src has 2819..3322) v3, discarding
2013-02-08 15:50:59.815109 7fa1107e9700 0 -- 10.0.0.1:6807/29923 >>
10.0.0.2:6819/11582 pipe(0x7fa10c7d07d0 sd=37 :37261 s=2 pgs=270 cs=3
l=0).fault, initiating reconnect
** On osd.6:
2013-02-08 15:48:03.716412 7ffebe2ab700 -1 osd.6 3308 heartbeat_check:
no reply from osd.12 since 2013-02-08 15:47:42.725323 (cutoff 2013-02-08
15:47:43.716409)
[repeated 7 times]
2013-02-08 15:50:59.812548 7ffea3fff700 0 osd.6 3322 from dead osd.12,
dropping, sharing map
[repeated 7 times]
Then I have messages like this:
2013-02-08 15:51:00.126043 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >>
10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=0 pgs=0 cs=0
l=0).accept connect_seq 0 vs existing 65 state standby
2013-02-08 15:51:00.126054 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >>
10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=0 pgs=0 cs=0
l=0).accept peer reset, then tried to connect to us, replacing
2013-02-08 15:51:00.126929 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >>
10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1
l=0).reader got old message 1 <= 196030 0x7ffe60001230 pg_info(1 pgs
e3319:0.9d) v3, discarding
2013-02-08 15:51:00.127083 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >>
10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1
l=0).reader got old message 2 <= 196030 0x7ffe60001230 pg_info(1 pgs
e3319:4.99) v3, discarding
2013-02-08 15:51:00.127178 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >>
10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1
l=0).reader got old message 3 <= 196030 0x7ffe60001ab0 pg_info(1 pgs
e3319:0.85) v3, discarding
2013-02-08 15:51:00.127273 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >>
10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1
l=0).reader got old message 4 <= 196030 0x7ffe60001300 pg_info(1 pgs
e3319:4.81) v3, discarding
2013-02-08 15:51:33.840234 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >>
10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1
l=0).reader got old message 5 <= 196030 0x7ffe60001ca0
osd_map(3319..3323 src has 2822..3323) v3, discarding
2013-02-08 15:51:33.840487 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >>
10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1
l=0).reader got old message 6 <= 196030 0x7ffe600010d0
pg_notify(0.12(14),2.10(9),1.11(9),4.e(9) epoch 3323) v4, discarding
2013-02-08 15:51:33.841834 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >>
10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1
l=0).reader got old message 7 <= 196030 0x7ffe600010d0
pg_query(0.43,0.85,0.9d,1.42,1.84,1.9c,2.41,2.83,2.9b,4.3f,4.81,4.99
epoch 33
23) v2, discarding
2013-02-08 15:51:34.165219 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >>
10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1
l=0).reader got old message 8 <= 196030 0x7ffe6001c630
pg_notify(0.12(14),1.11(9),2.10(9),4.e(9) epoch 3323) v4, discarding
2013-02-08 15:51:36.805662 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >>
10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1
l=0).reader got old message 9 <= 196030 0x7ffe60021280 pg_log(2.41 epoch
3324 query_epoch 3324) v3, discarding
2013-02-08 15:51:36.805764 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >>
10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1
l=0).reader got old message 10 <= 196030 0x7ffe60021280 pg_log(1.42
epoch 3324 query_epoch 3324) v3, discarding
2013-02-08 15:51:39.404585 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >>
10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1
l=0).reader got old message 11 <= 196030 0x7ffe60021280 pg_log(1.9c
epoch 3324 query_epoch 3324) v3, discarding
2013-02-08 15:51:39.404674 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >>
10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1
l=0).reader got old message 12 <= 196030 0x7ffe60021280 pg_log(2.9b
epoch 3324 query_epoch 3324) v3, discarding
--
Jens Kristian Søgaard, Mermaid Consulting ApS,
jens@xxxxxxxxxxxxxxxxxxxx,
http://www.mermaidconsulting.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com