Hi, On 02/08/2013 11:55 PM, Jens Kristian Søgaard wrote:
Hi again, A followup to my last email: I restarted osd.6 and the system went back to HEALTH_OK.
FYI, I saw this with 0.56.2 as well, but didn't report it....
I examined the logs of osd.6 and osd.12 around the time the problem occurred, and saw the following: ** On osd.12: 2013-02-08 15:47:43.418226 7fa116ffd700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fa114ff9700' had timed out after 15 [repeated many times] 2013-02-08 15:48:01.282623 7fa114ff9700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7fa114ff9700' had timed out after 15 [repeated twice] 2013-02-08 15:48:09.898961 7fa11dffb700 0 log [WRN] : map e3309 wrongly marked me down 2013-02-08 15:49:56.496155 7fa116ffd700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fa114ff9700' had timed out after 15 This pattern repeats itself once more.
I haven't examined the logs at that point, restarting the OSD fixed it, but just wanted to report I saw the same.
Probably a coincidence, but I saw it on a 12 OSD system as well. Wido
Then I have messages like this: 2013-02-08 15:50:59.814871 7fa11c6f7700 0 -- 10.0.0.1:6807/29923 >> 10.0.0.2:6807/10808 pipe(0x7fa0fd11a5d0 sd=36 :41003 s=2 pgs=288 cs=3 l=0).reader got old message 1 <= 41 0x7fa124420da0 osd_map(3319..3322 src has 2819..3322) v3, discarding 2013-02-08 15:50:59.814899 7fa1107e9700 0 -- 10.0.0.1:6807/29923 >> 10.0.0.2:6819/11582 pipe(0x7fa10c7d07d0 sd=37 :37261 s=2 pgs=270 cs=3 l=0).reader got old message 1 <= 51 0x7fa0b0452970 osd_map(3319..3322 src has 2819..3322) v3, discarding 2013-02-08 15:50:59.814946 7fa11c6f7700 0 -- 10.0.0.1:6807/29923 >> 10.0.0.2:6807/10808 pipe(0x7fa0fd11a5d0 sd=36 :41003 s=2 pgs=288 cs=3 l=0).fault with nothing to send, going to standby 2013-02-08 15:50:59.815062 7fa0a32f2700 0 -- 10.0.0.1:6807/29923 >> 10.0.0.2:6801/14104 pipe(0x7fa10d4b82d0 sd=43 :50456 s=2 pgs=240 cs=3 l=0).reader got old message 1 <= 44 0x7fa12c00e2a0 osd_map(3319..3322 src has 2819..3322) v3, discarding 2013-02-08 15:50:59.815109 7fa1107e9700 0 -- 10.0.0.1:6807/29923 >> 10.0.0.2:6819/11582 pipe(0x7fa10c7d07d0 sd=37 :37261 s=2 pgs=270 cs=3 l=0).fault, initiating reconnect ** On osd.6: 2013-02-08 15:48:03.716412 7ffebe2ab700 -1 osd.6 3308 heartbeat_check: no reply from osd.12 since 2013-02-08 15:47:42.725323 (cutoff 2013-02-08 15:47:43.716409) [repeated 7 times] 2013-02-08 15:50:59.812548 7ffea3fff700 0 osd.6 3322 from dead osd.12, dropping, sharing map [repeated 7 times] Then I have messages like this: 2013-02-08 15:51:00.126043 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >> 10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=0 pgs=0 cs=0 l=0).accept connect_seq 0 vs existing 65 state standby 2013-02-08 15:51:00.126054 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >> 10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=0 pgs=0 cs=0 l=0).accept peer reset, then tried to connect to us, replacing 2013-02-08 15:51:00.126929 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >> 10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1 l=0).reader got old message 1 <= 196030 0x7ffe60001230 pg_info(1 pgs e3319:0.9d) v3, discarding 2013-02-08 15:51:00.127083 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >> 10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1 l=0).reader got old message 2 <= 196030 0x7ffe60001230 pg_info(1 pgs e3319:4.99) v3, discarding 2013-02-08 15:51:00.127178 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >> 10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1 l=0).reader got old message 3 <= 196030 0x7ffe60001ab0 pg_info(1 pgs e3319:0.85) v3, discarding 2013-02-08 15:51:00.127273 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >> 10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1 l=0).reader got old message 4 <= 196030 0x7ffe60001300 pg_info(1 pgs e3319:4.81) v3, discarding 2013-02-08 15:51:33.840234 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >> 10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1 l=0).reader got old message 5 <= 196030 0x7ffe60001ca0 osd_map(3319..3323 src has 2822..3323) v3, discarding 2013-02-08 15:51:33.840487 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >> 10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1 l=0).reader got old message 6 <= 196030 0x7ffe600010d0 pg_notify(0.12(14),2.10(9),1.11(9),4.e(9) epoch 3323) v4, discarding 2013-02-08 15:51:33.841834 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >> 10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1 l=0).reader got old message 7 <= 196030 0x7ffe600010d0 pg_query(0.43,0.85,0.9d,1.42,1.84,1.9c,2.41,2.83,2.9b,4.3f,4.81,4.99 epoch 33 23) v2, discarding 2013-02-08 15:51:34.165219 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >> 10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1 l=0).reader got old message 8 <= 196030 0x7ffe6001c630 pg_notify(0.12(14),1.11(9),2.10(9),4.e(9) epoch 3323) v4, discarding 2013-02-08 15:51:36.805662 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >> 10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1 l=0).reader got old message 9 <= 196030 0x7ffe60021280 pg_log(2.41 epoch 3324 query_epoch 3324) v3, discarding 2013-02-08 15:51:36.805764 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >> 10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1 l=0).reader got old message 10 <= 196030 0x7ffe60021280 pg_log(1.42 epoch 3324 query_epoch 3324) v3, discarding 2013-02-08 15:51:39.404585 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >> 10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1 l=0).reader got old message 11 <= 196030 0x7ffe60021280 pg_log(1.9c epoch 3324 query_epoch 3324) v3, discarding 2013-02-08 15:51:39.404674 7ffe966ed700 0 -- 10.0.0.2:6807/10808 >> 10.0.0.1:6804/29923 pipe(0x7ffe6c002820 sd=30 :6807 s=2 pgs=275 cs=1 l=0).reader got old message 12 <= 196030 0x7ffe60021280 pg_log(2.9b epoch 3324 query_epoch 3324) v3, discarding
-- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com