Hello guys,
Since upgrading my cluster to Giant from the previous stable release I started having massive problems with client IO. I've done the upgrade 2 weeks ago and since then the IO on my ceph cluster has been unavailable 3 times already.
Quick info on my storage cluster - 3 mons, 2 osd servers, 17 osds, 4 ssds for journals. All running Giant release from ceph repos on Ubuntu 12.04 with the latest updates.
Every time I experience a very similar problem. Two out of 17 osds are shown as down even though the corresponding ceph-osd processes are running. Somehow, the down-ing of two osds brings down the IO in the cluster. This has not happened before upgrading to Giant. It seems that everytime different osds go down, but it happens on the same osd server. The latest one happened tonight. The ceph -s showed the following:
root@arh-ibstorage1-ib:~# ceph -s
cluster 51e9f641-372e-44ec-92a4-b9fe55cbf9fe
health HEALTH_WARN 1001 pgs degraded; 150 pgs down; 150 pgs peering; 1 pgs recovering; 41 pgs recovery_wait; 1001 pgs stuck degraded; 150 pgs stuck inactive; 1151 pgs stuck unclean; 967 pgs stuck undersized; 967 pgs undersized; 203 requests are blocked > 32 sec; recovery 661954/5994334 objects degraded (11.043%); 10/2997167 unfound (0.000%); 2/17 in osds are down; noout flag(s) set
monmap e11: 3 mons at {arh-cloud13-ib=192.168.168.13:6789/0,arh-ibstorage1-ib=192.168.168.200:6789/0,arh-ibstorage2-ib=192.168.168.201:6789/0}, election epoch 66832, quorum 0,1,2 arh-cloud13-ib,arh-ibstorage1-ib,arh-ibstorage2-ib
mdsmap e2709: 1/1/1 up {0=arh-ibstorage2-ib=up:active}, 1 up:standby
osdmap e64374: 17 osds: 15 up, 17 in
flags noout
pgmap v39540041: 4412 pgs, 17 pools, 10245 GB data, 2926 kobjects
20570 GB used, 26853 GB / 47423 GB avail
661954/5994334 objects degraded (11.043%); 10/2997167 unfound (0.000%)
1 active+recovering+undersized+degraded
3261 active+clean
34 active+recovery_wait+degraded
959 active+undersized+degraded
7 active+recovery_wait+undersized+degraded
150 down+peering
client io 0 B/s rd, 13805 B/s wr, 4 op/s
"ceph osd tree" shows:
root@arh-ibstorage1-ib:~# ceph osd tree
# id weight type name up/down reweight
-10 1.8 root uk-ssd
-9 1.8 datacenter ldex-ssd
-12 1.8 room ldex-dc3-ssd
-14 1.8 row row-a-ssd
-8 1.8 rack ldex-rack-a5-ssd
-6 0.9 host arh-ibstorage1-ib-ssd
-7 0.9 host arh-ibstorage2-ib-ssd
-1 46.41 root uk
-5 46.41 datacenter ldex
-11 46.41 room ldex-dc3
-13 46.41 row row-a
-4 46.41 rack ldex-rack-a5
-2 24.57 host arh-ibstorage1-ib
1 2.73 osd.1 up 1
3 2.73 osd.3 up 1
5 2.73 osd.5 up 1
6 2.73 osd.6 up 1
7 2.73 osd.7 up 1
8 2.73 osd.8 up 1
4 2.73 osd.4 up 1
0 2.73 osd.0 up 1
2 2.73 osd.2 up 1
-3 21.84 host arh-ibstorage2-ib
9 2.73 osd.9 up 1
10 2.73 osd.10 up 1
11 2.73 osd.11 up 1
12 2.73 osd.12 down 1
13 2.73 osd.13 up 1
14 2.73 osd.14 up 1
15 2.73 osd.15 up 1
16 2.73 osd.16 down 1
From the ceph -s it seems that some PGs were down and peering. There was no change in the ceph -s output for over an hour. The client IO activity ground to a halt and all vms became unresponsive and eventually crashed. The cluster was unusable until I have manually stopped osd.12 and osd.16 and restarted them again. This has solved the problem.
Previously this has happened about four days ago with osds 11 and 15 and it also happened about 10 days ago with different osds. So, having lost the storage IO three times in the last two weeks is a big deal.
Looking at the ceph logs I've discovered that since performing the upgrade I have seen a far more messages reporting osds being down. Different osds are reported as being down and they come back up after about 5-30 seconds, perform recovery and the cluster is back to normal (apart from the three tiimes which actually brought the cluster io down).
Furthermore, about a few minutes before osd.12 and osd.16 went down, I've seen 2 PGs are being deep scrubbed while osd.7 was reported down by 3 peers:
2014-11-13 20:55:36.608502 mon.0 192.168.168.13:6789/0 1210074 : cluster [INF] pgmap v39509396: 4412 pgs: 4410 active+clean, 2 active+clean+scrubbing+deep; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 10005 kB/s rd, 3486 kB/s wr, 310 op/s
2014-11-13 20:56:10.056933 mon.0 192.168.168.13:6789/0 1210209 : cluster [INF] pgmap v39509409: 4412 pgs: 3380 active+clean, 2 active+clean+scrubbing+deep, 1030 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 112 kB/s rd, 122 kB/s wr, 37 op/s; 662204/5994308 objects degraded (11.047%)
2014-11-13 20:56:10.073317 mon.0 192.168.168.13:6789/0 1210210 : cluster [INF] osd.7 192.168.168.200:6838/26321 boot
2014-11-13 20:56:10.073461 mon.0 192.168.168.13:6789/0 1210211 : cluster [INF] osd.3 192.168.168.200:6818/26170 boot
2014-11-13 20:56:10.074415 mon.0 192.168.168.13:6789/0 1210212 : cluster [INF] osdmap e64099: 17 osds: 17 up, 17 in
2014-11-13 20:56:10.109650 mon.0 192.168.168.13:6789/0 1210213 : cluster [INF] pgmap v39509410: 4412 pgs: 3380 active+clean, 2 active+clean+scrubbing+deep, 1030 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 97137 B/s rd, 220 kB/s wr, 36 op/s; 662204/5994308 objects degraded (11.047%)
2014-11-13 20:56:11.082817 mon.0 192.168.168.13:6789/0 1210214 : cluster [INF] osdmap e64100: 17 osds: 17 up, 17 in
2014-11-13 20:56:11.741135 mon.1 192.168.168.200:6789/0 520573 : cluster [INF] mon.arh-ibstorage1-ib calling new monitor election
2014-11-13 20:56:11.744641 mon.2 192.168.168.201:6789/0 521421 : cluster [INF] mon.arh-ibstorage2-ib calling new monitor election
2014-11-13 20:56:11.745988 mon.0 192.168.168.13:6789/0 1210248 : cluster [INF] mon.arh-cloud13-ib calling new monitor election
2014-11-13 20:56:11.754388 mon.0 192.168.168.13:6789/0 1210249 : cluster [INF] mon.arh-cloud13-ib@0 won leader election with quorum 0,1,2
2014-11-13 20:56:11.763975 mon.0 192.168.168.13:6789/0 1210250 : cluster [INF] monmap e11: 3 mons at {arh-cloud13-ib=192.168.168.13:6789/0,arh-ibstorage1-ib=192.168.168.200:6789/0,arh-ibstorage2-ib=192.168.168.201:6789/0}
2014-11-13 20:56:11.764069 mon.0 192.168.168.13:6789/0 1210251 : cluster [INF] pgmap v39509411: 4412 pgs: 3380 active+clean, 2 active+clean+scrubbing+deep, 1030 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 662204/5994308 objects degraded (11.047%)
2014-11-13 20:56:11.764147 mon.0 192.168.168.13:6789/0 1210252 : cluster [INF] mdsmap e2705: 1/1/1 up {0=arh-ibstorage2-ib=up:active}, 1 up:standby
2014-11-13 20:56:11.764283 mon.0 192.168.168.13:6789/0 1210253 : cluster [INF] osdmap e64100: 17 osds: 17 up, 17 in
2014-11-13 20:56:11.764859 mon.0 192.168.168.13:6789/0 1210257 : cluster [INF] osd.7 192.168.168.200:6838/26321 failed (3 reports from 3 peers after 32.005619 >= grace 25.962628)
2014-11-13 20:56:12.721171 mon.0 192.168.168.13:6789/0 1210262 : cluster [INF] osdmap e64101: 17 osds: 16 up, 17 in
2014-11-13 20:56:13.294162 mon.0 192.168.168.13:6789/0 1210263 : cluster [INF] pgmap v39509413: 4412 pgs: 3380 active+clean, 2 active+clean+scrubbing+deep, 1030 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 1150 kB/s rd, 561 kB/s wr, 193 op/s; 662204/5994308 objects degraded (11.047%)
2014-11-13 20:56:14.325109 mon.0 192.168.168.13:6789/0 1210264 : cluster [INF] osdmap e64102: 17 osds: 16 up, 17 in
2014-11-13 20:56:14.357819 mon.0 192.168.168.13:6789/0 1210265 : cluster [INF] pgmap v39509414: 4412 pgs: 3380 active+clean, 2 active+clean+scrubbing+deep, 1030 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 662204/5994308 objects degraded (11.047%)
2014-11-13 20:56:15.442751 mon.0 192.168.168.13:6789/0 1210266 : cluster [INF] pgmap v39509415: 4412 pgs: 3685 active+clean, 2 active+clean+scrubbing+deep, 18 active+recovery_wait+degraded, 707 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 28730 kB/s rd, 1087 kB/s wr, 590 op/s; 466784/5994308 objects degraded (7.787%); 499 MB/s, 3 keys/s, 237 objects/s recovering
2014-11-13 20:56:16.364658 mon.0 192.168.168.13:6789/0 1210268 : cluster [INF] osd.7 192.168.168.200:6838/26321 boot
2014-11-13 20:56:16.366310 mon.0 192.168.168.13:6789/0 1210269 : cluster [INF] osdmap e64103: 17 osds: 17 up, 17 in
2014-11-13 20:56:16.380120 mon.0 192.168.168.13:6789/0 1210270 : cluster [INF] pgmap v39509416: 4412 pgs: 3685 active+clean, 2 active+clean+scrubbing+deep, 18 active+recovery_wait+degraded, 707 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 33165 kB/s rd, 1255 kB/s wr, 682 op/s; 466784/5994308 objects degraded (7.787%); 576 MB/s, 3 keys/s, 274 objects/s recovering
2014-11-13 20:56:17.400859 mon.0 192.168.168.13:6789/0 1210271 : cluster [INF] osdmap e64104: 17 osds: 17 up, 17 in
So, for the next 20 seconds it was scrubbing until ceph reported that osd12 and osd16 reported being down:
2014-11-13 20:56:36.712346 mon.0 192.168.168.13:6789/0 1210327 : cluster [INF] osd.12 192.168.168.201:6810/10278 failed (11 reports from 9 peers after 24.080479 >= grace 23.026948)
2014-11-13 20:56:36.771124 mon.0 192.168.168.13:6789/0 1210330 : cluster [INF] osd.15 192.168.168.201:6801/4482 failed (10 reports from 8 peers after 24.118287 >= grace 23.285583)
2014-11-13 20:56:37.920306 mon.0 192.168.168.13:6789/0 1210345 : cluster [INF] pgmap v39509434: 4412 pgs: 122 peering, 4085 active+clean, 2 active+clean+scrubbing+deep, 2 active+degraded, 16 active+recovery_wait+degraded, 185 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 13943 B/s rd, 506 kB/s wr, 86 op/s; 118552/5994302 objects degraded (1.978%); 666 kB/s, 0 objects/s recovering
2014-11-13 20:56:37.963713 mon.0 192.168.168.13:6789/0 1210346 : cluster [INF] osdmap e64106: 17 osds: 15 up, 17 in
2014-11-13 20:56:38.458852 mon.0 192.168.168.13:6789/0 1210347 : cluster [INF] pgmap v39509435: 4412 pgs: 377 stale+active+clean, 122 peering, 3708 active+clean, 119 stale+active+undersized+degraded, 2 active+clean+scrubbing+deep, 2 active+degraded, 16 active+recovery_wait+degraded, 66 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 20510 B/s rd, 430 kB/s wr, 76 op/s; 118552/5994302 objects degraded (1.978%)
2014-11-13 20:56:39.195535 mon.0 192.168.168.13:6789/0 1210349 : cluster [INF] osdmap e64107: 17 osds: 15 up, 17 in
2014-11-13 20:56:39.231384 mon.0 192.168.168.13:6789/0 1210350 : cluster [INF] pgmap v39509436: 4412 pgs: 377 stale+active+clean, 122 peering, 3708 active+clean, 119 stale+active+undersized+degraded, 2 active+clean+scrubbing+deep, 2 active+degraded, 16 active+recovery_wait+degraded, 66 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 118552/5994302 objects degraded (1.978%)
2014-11-13 20:56:41.883606 mon.0 192.168.168.13:6789/0 1210352 : cluster [INF] pgmap v39509437: 4412 pgs: 377 stale+active+clean, 122 peering, 3708 active+clean, 119 stale+active+undersized+degraded, 2 active+clean+scrubbing+deep, 1 active+degraded, 17 active+recovery_wait+degraded, 66 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 14341 B/s rd, 144 kB/s wr, 5 op/s; 118552/5994302 objects degraded (1.978%)
2014-11-13 20:56:43.617030 mon.0 192.168.168.13:6789/0 1210359 : cluster [INF] pgmap v39509438: 4412 pgs: 377 stale+active+clean, 122 peering, 3708 active+clean, 119 stale+active+undersized+degraded, 2 active+clean+scrubbing+deep, 1 active+degraded, 17 active+recovery_wait+degraded, 66 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 718 kB/s rd, 215 kB/s wr, 46 op/s; 118552/5994302 objects degraded (1.978%)
2014-11-13 20:56:45.408475 mon.0 192.168.168.13:6789/0 1210365 : cluster [INF] pgmap v39509439: 4412 pgs: 46 peering, 3218 active+clean, 1 active+clean+scrubbing+deep, 1 active+degraded, 12 active+recovery_wait+degraded, 1007 active+undersized+degraded, 1 active+recovery_wait+undersized+degraded, 126 down+peering; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 1853 kB/s wr, 300 op/s; 696740/5994304 objects degraded (11.623%); 8/2997152 unfound (0.000%)
2014-11-13 20:56:46.517294 mon.0 192.168.168.13:6789/0 1210366 : cluster [INF] pgmap v39509440: 4412 pgs: 46 peering, 3222 active+clean, 1 active+clean+scrubbing+deep, 9 active+recovery_wait+degraded, 1007 active+undersized+degraded, 1 active+recovery_wait+undersized+degraded, 126 down+peering; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 2242 kB/s wr, 358 op/s; 696737/5994304 objects degraded (11.623%); 8/2997152 unfound (0.000%)
From this point on the client IO went down from an average of about 500 op/s to 10-20 op/s. All vms started having hang tasks shortly after that and eventually most of them crashed. A few more seconds passed, the status changed to:
2014-11-13 20:59:19.857378 mon.0 192.168.168.13:6789/0 1210590 : cluster [INF] pgmap v39509547: 4412 pgs: 1 active+recovering+undersized+degraded, 3268 active+clean, 19 active+degraded, 7 active+recovery_wait+degraded, 959 active+undersized+degraded, 7 active+recovery_wait+undersized+degraded, 150 down+peering, 1 active+recovering+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 0 B/s rd, 546 kB/s wr, 103 op/s; 661943/5994302 objects degraded (11.043%); 10/2997151 unfound (0.000%); 53620 kB/s, 1 keys/s, 13 objects/s recovering
And pretty much stayed like that for over an hour without changing much. The IO was still unavailable.
Looking at the logs of the osds that were shown down, I don't really see much apart from some slow requests prior to the osds being reported down:
2014-11-13 20:56:10.315696 7fd9645be700 0 log_channel(default) log [WRN] : slow request 31.043640 seconds old, received at 2014-11-13 20:55:39.271942: osd_op(client.46407343.0
:3839273 rbd_data.1f0a392ae8944a.0000000000000621 [set-alloc-hint object_size 4194304 write_size 4194304,write 3526656~4096] 5.9fe0f0f2 ack+ondisk+write e64096) currently waiti
ng for subops from 1
2014-11-13 20:56:10.467380 7fd941a75700 -1 osd.12 64099 heartbeat_check: no reply from osd.4 since back 2014-11-13 20:55:35.958627 front 2014-11-13 20:55:35.958627 (cutoff 2014
-11-13 20:55:50.467379)
2014-11-13 20:56:10.467389 7fd941a75700 -1 osd.12 64099 heartbeat_check: no reply from osd.5 since back 2014-11-13 20:55:38.259184 front 2014-11-13 20:55:35.958627 (cutoff 2014
-11-13 20:55:50.467379)
2014-11-13 20:56:10.467409 7fd941a75700 -1 osd.12 64099 heartbeat_check: no reply from osd.6 since back 2014-11-13 20:55:35.958627 front 2014-11-13 20:55:35.958627 (cutoff 2014
-11-13 20:55:50.467379)
2014-11-13 20:56:10.467427 7fd941a75700 -1 osd.12 64099 heartbeat_check: no reply from osd.8 since back 2014-11-13 20:55:35.958627 front 2014-11-13 20:55:35.958627 (cutoff 2014
-11-13 20:55:50.467379)
2014-11-13 20:56:10.967963 7fd941a75700 -1 osd.12 64099 heartbeat_check: no reply from osd.4 since back 2014-11-13 20:55:35.958627 front 2014-11-13 20:55:35.958627 (cutoff 2014
-11-13 20:55:50.967962)
2014-11-13 20:56:10.967971 7fd941a75700 -1 osd.12 64099 heartbeat_check: no reply from osd.5 since back 2014-11-13 20:55:38.259184 front 2014-11-13 20:55:35.958627 (cutoff 2014
-11-13 20:55:50.967962)
2014-11-13 20:56:10.967978 7fd941a75700 -1 osd.12 64099 heartbeat_check: no reply from osd.6 since back 2014-11-13 20:55:35.958627 front 2014-11-13 20:55:35.958627 (cutoff 2014
-11-13 20:55:50.967962)
2014-11-13 20:56:10.967984 7fd941a75700 -1 osd.12 64099 heartbeat_check: no reply from osd.8 since back 2014-11-13 20:55:35.958627 front 2014-11-13 20:55:35.958627 (cutoff 2014
-11-13 20:55:50.967962)
2014-11-13 20:56:11.317046 7fd9645be700 -1 osd.12 64100 heartbeat_check: no reply from osd.4 since back 2014-11-13 20:55:35.958627 front 2014-11-13 20:55:35.958627 (cutoff 2014-11-13 20:55:51.317043)
2014-11-13 20:56:11.317057 7fd9645be700 -1 osd.12 64100 heartbeat_check: no reply from osd.5 since back 2014-11-13 20:55:38.259184 front 2014-11-13 20:55:35.958627 (cutoff 2014-11-13 20:55:51.317043)
2014-11-13 20:56:11.317064 7fd9645be700 -1 osd.12 64100 heartbeat_check: no reply from osd.6 since back 2014-11-13 20:55:35.958627 front 2014-11-13 20:56:10.967590 (cutoff 2014-11-13 20:55:51.317043)
2014-11-13 20:56:11.317070 7fd9645be700 -1 osd.12 64100 heartbeat_check: no reply from osd.8 since back 2014-11-13 20:55:38.259184 front 2014-11-13 20:55:35.958627 (cutoff 2014-11-13 20:55:51.317043)
2014-11-13 20:56:11.317167 7fd9645be700 0 log_channel(default) log [WRN] : 17 slow requests, 7 included below; oldest blocked for > 32.285501 secs
2014-11-13 20:56:11.317173 7fd9645be700 0 log_channel(default) log [WRN] : slow request 32.285501 seconds old, received at 2014-11-13 20:55:39.031582: osd_op(client.46379104.0:182402 rbd_data.2c26ec1333ab105.000000000000015a [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 3956736~1024] 5.e9450c8d ack+ondisk+write e64096) currently waiting for subops from 4
2014-11-13 20:56:11.317179 7fd9645be700 0 log_channel(default) log [WRN] : slow request 32.047721 seconds old, received at 2014-11-13 20:55:39.269362: osd_op(client.46407343.0:3839258 rbd_data.1f0a392ae8944a.0000000000000621 [set-alloc-hint object_size 4194304 write_size 4194304,write 1568768~8192] 5.9fe0f0f2 ack+ondisk+write e64096) currently waiting for subops from 1
2014-11-13 20:56:11.317183 7fd9645be700 0 log_channel(default) log [WRN] : slow request 32.047396 seconds old, received at 2014-11-13 20:55:39.269687: osd_op(client.46407343.0:3839259 rbd_data.1f0a392ae8944a.0000000000000621 [set-alloc-hint object_size 4194304 write_size 4194304,write 1773568~4096] 5.9fe0f0f2 ack+ondisk+write e64096) currently waiting for subops from 1
2014-11-13 20:56:11.317187 7fd9645be700 0 log_channel(default) log [WRN] : slow request 32.047229 seconds old, received at 2014-11-13 20:55:39.269854: osd_op(client.46407343.0:3839260 rbd_data.1f0a392ae8944a.0000000000000621 [set-alloc-hint object_size 4194304 write_size 4194304,write 1921024~4096] 5.9fe0f0f2 ack+ondisk+write e64096) currently waiting for subops from 1
2014-11-13 20:56:11.317232 7fd9645be700 0 log_channel(default) log [WRN] : slow request 32.047069 seconds old, received at 2014-11-13 20:55:39.270014: osd_op(client.46407343.0:3839261 rbd_data.1f0a392ae8944a.0000000000000621 [set-alloc-hint object_size 4194304 write_size 4194304,write 1949696~4096] 5.9fe0f0f2 ack+ondisk+write e64096) currently waiting for subops from 1
2014-11-13 20:56:37.954923 7fd92c055700 0 -- 192.168.168.201:6812/10278 >> 192.168.168.200:6820/56026102 pipe(0x2b369b00 sd=381 :6812 s=2 pgs=205156 cs=1 l=0 c=0x1b29d420).fault with nothing to send, going to standby
2014-11-13 20:56:37.955547 7fd92bb50700 0 -- 192.168.168.201:6812/10278 >> 192.168.168.200:6810/72026170 pipe(0x2ad35580 sd=238 :6812 s=2 pgs=141844 cs=1 l=0 c=0xf33c420).fault with nothing to send, going to standby
2014-11-13 20:56:37.956392 7fd935deb700 0 -- 192.168.168.201:6812/10278 >> 192.168.168.200:6807/70026130 pipe(0x2e2268c0 sd=408 :6812 s=2 pgs=278536 cs=1 l=0 c=0x142f7420).fault with nothing to send, going to standby
2014-11-13 20:56:37.956695 7fd93e86d700 0 -- 192.168.168.201:6812/10278 >> 192.168.168.200:6849/1026289 pipe(0x29b4c840 sd=129 :41140 s=2 pgs=591 cs=1 l=0 c=0x29b58b00).fault with nothing to send, going to standby
2014-11-13 20:56:37.958245 7fd93e56a700 0 -- 192.168.168.201:6812/10278 >> 192.168.168.200:6803/1026306 pipe(0x29bb7600 sd=154 :43005 s=2 pgs=117 cs=1 l=0 c=0x29be4840).fault with nothing to send, going to standby
2014-11-13 20:56:37.969506 7fd93e368700 0 -- 192.168.168.201:6812/10278 >> 192.168.168.200:6824/26238 pipe(0x29c98600 sd=146 :60953 s=2 pgs=148 cs=1 l=0 c=0x29cd0840).fault with nothing to send, going to standby
2014-11-13 20:56:37.969620 7fd92a439700 0 -- 192.168.168.201:6812/10278 >> 192.168.168.201:6848/26784 pipe(0x2c5fcb00 sd=231 :42835 s=2 pgs=151539 cs=7 l=0 c=0x13c32580).fault with nothing to send, going to standby
2014-11-13 20:56:37.971391 7fd925df3700 0 -- 192.168.168.201:6812/10278 >> 192.168.168.200:6831/62026114 pipe(0x2d93edc0 sd=21 :6812 s=2 pgs=196755 cs=1 l=0 c=0x173c82c0).fault with nothing to send, going to standby
2014-11-13 20:56:37.980660 7fd93e267700 0 -- 192.168.168.201:6812/10278 >> 192.168.168.200:6821/18026333 pipe(0x2a13c2c0 sd=122 :41160 s=2 pgs=37077 cs=1 l=0 c=0x2a9ffdc0).fault with nothing to send, going to standby
This was the last log entry for osd.12 until It has been restarted hours later. A very similar picture is in the logs of osd.16. Pretty much identical.
Does anyone have any idea what is causing these problems with Giant release? How do I fix them as it is currently very unstable for a production ready release.
Many thanks for your help
Andrei
Since upgrading my cluster to Giant from the previous stable release I started having massive problems with client IO. I've done the upgrade 2 weeks ago and since then the IO on my ceph cluster has been unavailable 3 times already.
Quick info on my storage cluster - 3 mons, 2 osd servers, 17 osds, 4 ssds for journals. All running Giant release from ceph repos on Ubuntu 12.04 with the latest updates.
Every time I experience a very similar problem. Two out of 17 osds are shown as down even though the corresponding ceph-osd processes are running. Somehow, the down-ing of two osds brings down the IO in the cluster. This has not happened before upgrading to Giant. It seems that everytime different osds go down, but it happens on the same osd server. The latest one happened tonight. The ceph -s showed the following:
root@arh-ibstorage1-ib:~# ceph -s
cluster 51e9f641-372e-44ec-92a4-b9fe55cbf9fe
health HEALTH_WARN 1001 pgs degraded; 150 pgs down; 150 pgs peering; 1 pgs recovering; 41 pgs recovery_wait; 1001 pgs stuck degraded; 150 pgs stuck inactive; 1151 pgs stuck unclean; 967 pgs stuck undersized; 967 pgs undersized; 203 requests are blocked > 32 sec; recovery 661954/5994334 objects degraded (11.043%); 10/2997167 unfound (0.000%); 2/17 in osds are down; noout flag(s) set
monmap e11: 3 mons at {arh-cloud13-ib=192.168.168.13:6789/0,arh-ibstorage1-ib=192.168.168.200:6789/0,arh-ibstorage2-ib=192.168.168.201:6789/0}, election epoch 66832, quorum 0,1,2 arh-cloud13-ib,arh-ibstorage1-ib,arh-ibstorage2-ib
mdsmap e2709: 1/1/1 up {0=arh-ibstorage2-ib=up:active}, 1 up:standby
osdmap e64374: 17 osds: 15 up, 17 in
flags noout
pgmap v39540041: 4412 pgs, 17 pools, 10245 GB data, 2926 kobjects
20570 GB used, 26853 GB / 47423 GB avail
661954/5994334 objects degraded (11.043%); 10/2997167 unfound (0.000%)
1 active+recovering+undersized+degraded
3261 active+clean
34 active+recovery_wait+degraded
959 active+undersized+degraded
7 active+recovery_wait+undersized+degraded
150 down+peering
client io 0 B/s rd, 13805 B/s wr, 4 op/s
"ceph osd tree" shows:
root@arh-ibstorage1-ib:~# ceph osd tree
# id weight type name up/down reweight
-10 1.8 root uk-ssd
-9 1.8 datacenter ldex-ssd
-12 1.8 room ldex-dc3-ssd
-14 1.8 row row-a-ssd
-8 1.8 rack ldex-rack-a5-ssd
-6 0.9 host arh-ibstorage1-ib-ssd
-7 0.9 host arh-ibstorage2-ib-ssd
-1 46.41 root uk
-5 46.41 datacenter ldex
-11 46.41 room ldex-dc3
-13 46.41 row row-a
-4 46.41 rack ldex-rack-a5
-2 24.57 host arh-ibstorage1-ib
1 2.73 osd.1 up 1
3 2.73 osd.3 up 1
5 2.73 osd.5 up 1
6 2.73 osd.6 up 1
7 2.73 osd.7 up 1
8 2.73 osd.8 up 1
4 2.73 osd.4 up 1
0 2.73 osd.0 up 1
2 2.73 osd.2 up 1
-3 21.84 host arh-ibstorage2-ib
9 2.73 osd.9 up 1
10 2.73 osd.10 up 1
11 2.73 osd.11 up 1
12 2.73 osd.12 down 1
13 2.73 osd.13 up 1
14 2.73 osd.14 up 1
15 2.73 osd.15 up 1
16 2.73 osd.16 down 1
From the ceph -s it seems that some PGs were down and peering. There was no change in the ceph -s output for over an hour. The client IO activity ground to a halt and all vms became unresponsive and eventually crashed. The cluster was unusable until I have manually stopped osd.12 and osd.16 and restarted them again. This has solved the problem.
Previously this has happened about four days ago with osds 11 and 15 and it also happened about 10 days ago with different osds. So, having lost the storage IO three times in the last two weeks is a big deal.
Looking at the ceph logs I've discovered that since performing the upgrade I have seen a far more messages reporting osds being down. Different osds are reported as being down and they come back up after about 5-30 seconds, perform recovery and the cluster is back to normal (apart from the three tiimes which actually brought the cluster io down).
Furthermore, about a few minutes before osd.12 and osd.16 went down, I've seen 2 PGs are being deep scrubbed while osd.7 was reported down by 3 peers:
2014-11-13 20:55:36.608502 mon.0 192.168.168.13:6789/0 1210074 : cluster [INF] pgmap v39509396: 4412 pgs: 4410 active+clean, 2 active+clean+scrubbing+deep; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 10005 kB/s rd, 3486 kB/s wr, 310 op/s
2014-11-13 20:56:10.056933 mon.0 192.168.168.13:6789/0 1210209 : cluster [INF] pgmap v39509409: 4412 pgs: 3380 active+clean, 2 active+clean+scrubbing+deep, 1030 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 112 kB/s rd, 122 kB/s wr, 37 op/s; 662204/5994308 objects degraded (11.047%)
2014-11-13 20:56:10.073317 mon.0 192.168.168.13:6789/0 1210210 : cluster [INF] osd.7 192.168.168.200:6838/26321 boot
2014-11-13 20:56:10.073461 mon.0 192.168.168.13:6789/0 1210211 : cluster [INF] osd.3 192.168.168.200:6818/26170 boot
2014-11-13 20:56:10.074415 mon.0 192.168.168.13:6789/0 1210212 : cluster [INF] osdmap e64099: 17 osds: 17 up, 17 in
2014-11-13 20:56:10.109650 mon.0 192.168.168.13:6789/0 1210213 : cluster [INF] pgmap v39509410: 4412 pgs: 3380 active+clean, 2 active+clean+scrubbing+deep, 1030 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 97137 B/s rd, 220 kB/s wr, 36 op/s; 662204/5994308 objects degraded (11.047%)
2014-11-13 20:56:11.082817 mon.0 192.168.168.13:6789/0 1210214 : cluster [INF] osdmap e64100: 17 osds: 17 up, 17 in
2014-11-13 20:56:11.741135 mon.1 192.168.168.200:6789/0 520573 : cluster [INF] mon.arh-ibstorage1-ib calling new monitor election
2014-11-13 20:56:11.744641 mon.2 192.168.168.201:6789/0 521421 : cluster [INF] mon.arh-ibstorage2-ib calling new monitor election
2014-11-13 20:56:11.745988 mon.0 192.168.168.13:6789/0 1210248 : cluster [INF] mon.arh-cloud13-ib calling new monitor election
2014-11-13 20:56:11.754388 mon.0 192.168.168.13:6789/0 1210249 : cluster [INF] mon.arh-cloud13-ib@0 won leader election with quorum 0,1,2
2014-11-13 20:56:11.763975 mon.0 192.168.168.13:6789/0 1210250 : cluster [INF] monmap e11: 3 mons at {arh-cloud13-ib=192.168.168.13:6789/0,arh-ibstorage1-ib=192.168.168.200:6789/0,arh-ibstorage2-ib=192.168.168.201:6789/0}
2014-11-13 20:56:11.764069 mon.0 192.168.168.13:6789/0 1210251 : cluster [INF] pgmap v39509411: 4412 pgs: 3380 active+clean, 2 active+clean+scrubbing+deep, 1030 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 662204/5994308 objects degraded (11.047%)
2014-11-13 20:56:11.764147 mon.0 192.168.168.13:6789/0 1210252 : cluster [INF] mdsmap e2705: 1/1/1 up {0=arh-ibstorage2-ib=up:active}, 1 up:standby
2014-11-13 20:56:11.764283 mon.0 192.168.168.13:6789/0 1210253 : cluster [INF] osdmap e64100: 17 osds: 17 up, 17 in
2014-11-13 20:56:11.764859 mon.0 192.168.168.13:6789/0 1210257 : cluster [INF] osd.7 192.168.168.200:6838/26321 failed (3 reports from 3 peers after 32.005619 >= grace 25.962628)
2014-11-13 20:56:12.721171 mon.0 192.168.168.13:6789/0 1210262 : cluster [INF] osdmap e64101: 17 osds: 16 up, 17 in
2014-11-13 20:56:13.294162 mon.0 192.168.168.13:6789/0 1210263 : cluster [INF] pgmap v39509413: 4412 pgs: 3380 active+clean, 2 active+clean+scrubbing+deep, 1030 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 1150 kB/s rd, 561 kB/s wr, 193 op/s; 662204/5994308 objects degraded (11.047%)
2014-11-13 20:56:14.325109 mon.0 192.168.168.13:6789/0 1210264 : cluster [INF] osdmap e64102: 17 osds: 16 up, 17 in
2014-11-13 20:56:14.357819 mon.0 192.168.168.13:6789/0 1210265 : cluster [INF] pgmap v39509414: 4412 pgs: 3380 active+clean, 2 active+clean+scrubbing+deep, 1030 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 662204/5994308 objects degraded (11.047%)
2014-11-13 20:56:15.442751 mon.0 192.168.168.13:6789/0 1210266 : cluster [INF] pgmap v39509415: 4412 pgs: 3685 active+clean, 2 active+clean+scrubbing+deep, 18 active+recovery_wait+degraded, 707 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 28730 kB/s rd, 1087 kB/s wr, 590 op/s; 466784/5994308 objects degraded (7.787%); 499 MB/s, 3 keys/s, 237 objects/s recovering
2014-11-13 20:56:16.364658 mon.0 192.168.168.13:6789/0 1210268 : cluster [INF] osd.7 192.168.168.200:6838/26321 boot
2014-11-13 20:56:16.366310 mon.0 192.168.168.13:6789/0 1210269 : cluster [INF] osdmap e64103: 17 osds: 17 up, 17 in
2014-11-13 20:56:16.380120 mon.0 192.168.168.13:6789/0 1210270 : cluster [INF] pgmap v39509416: 4412 pgs: 3685 active+clean, 2 active+clean+scrubbing+deep, 18 active+recovery_wait+degraded, 707 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 33165 kB/s rd, 1255 kB/s wr, 682 op/s; 466784/5994308 objects degraded (7.787%); 576 MB/s, 3 keys/s, 274 objects/s recovering
2014-11-13 20:56:17.400859 mon.0 192.168.168.13:6789/0 1210271 : cluster [INF] osdmap e64104: 17 osds: 17 up, 17 in
So, for the next 20 seconds it was scrubbing until ceph reported that osd12 and osd16 reported being down:
2014-11-13 20:56:36.712346 mon.0 192.168.168.13:6789/0 1210327 : cluster [INF] osd.12 192.168.168.201:6810/10278 failed (11 reports from 9 peers after 24.080479 >= grace 23.026948)
2014-11-13 20:56:36.771124 mon.0 192.168.168.13:6789/0 1210330 : cluster [INF] osd.15 192.168.168.201:6801/4482 failed (10 reports from 8 peers after 24.118287 >= grace 23.285583)
2014-11-13 20:56:37.920306 mon.0 192.168.168.13:6789/0 1210345 : cluster [INF] pgmap v39509434: 4412 pgs: 122 peering, 4085 active+clean, 2 active+clean+scrubbing+deep, 2 active+degraded, 16 active+recovery_wait+degraded, 185 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 13943 B/s rd, 506 kB/s wr, 86 op/s; 118552/5994302 objects degraded (1.978%); 666 kB/s, 0 objects/s recovering
2014-11-13 20:56:37.963713 mon.0 192.168.168.13:6789/0 1210346 : cluster [INF] osdmap e64106: 17 osds: 15 up, 17 in
2014-11-13 20:56:38.458852 mon.0 192.168.168.13:6789/0 1210347 : cluster [INF] pgmap v39509435: 4412 pgs: 377 stale+active+clean, 122 peering, 3708 active+clean, 119 stale+active+undersized+degraded, 2 active+clean+scrubbing+deep, 2 active+degraded, 16 active+recovery_wait+degraded, 66 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 20510 B/s rd, 430 kB/s wr, 76 op/s; 118552/5994302 objects degraded (1.978%)
2014-11-13 20:56:39.195535 mon.0 192.168.168.13:6789/0 1210349 : cluster [INF] osdmap e64107: 17 osds: 15 up, 17 in
2014-11-13 20:56:39.231384 mon.0 192.168.168.13:6789/0 1210350 : cluster [INF] pgmap v39509436: 4412 pgs: 377 stale+active+clean, 122 peering, 3708 active+clean, 119 stale+active+undersized+degraded, 2 active+clean+scrubbing+deep, 2 active+degraded, 16 active+recovery_wait+degraded, 66 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 118552/5994302 objects degraded (1.978%)
2014-11-13 20:56:41.883606 mon.0 192.168.168.13:6789/0 1210352 : cluster [INF] pgmap v39509437: 4412 pgs: 377 stale+active+clean, 122 peering, 3708 active+clean, 119 stale+active+undersized+degraded, 2 active+clean+scrubbing+deep, 1 active+degraded, 17 active+recovery_wait+degraded, 66 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 14341 B/s rd, 144 kB/s wr, 5 op/s; 118552/5994302 objects degraded (1.978%)
2014-11-13 20:56:43.617030 mon.0 192.168.168.13:6789/0 1210359 : cluster [INF] pgmap v39509438: 4412 pgs: 377 stale+active+clean, 122 peering, 3708 active+clean, 119 stale+active+undersized+degraded, 2 active+clean+scrubbing+deep, 1 active+degraded, 17 active+recovery_wait+degraded, 66 active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 718 kB/s rd, 215 kB/s wr, 46 op/s; 118552/5994302 objects degraded (1.978%)
2014-11-13 20:56:45.408475 mon.0 192.168.168.13:6789/0 1210365 : cluster [INF] pgmap v39509439: 4412 pgs: 46 peering, 3218 active+clean, 1 active+clean+scrubbing+deep, 1 active+degraded, 12 active+recovery_wait+degraded, 1007 active+undersized+degraded, 1 active+recovery_wait+undersized+degraded, 126 down+peering; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 1853 kB/s wr, 300 op/s; 696740/5994304 objects degraded (11.623%); 8/2997152 unfound (0.000%)
2014-11-13 20:56:46.517294 mon.0 192.168.168.13:6789/0 1210366 : cluster [INF] pgmap v39509440: 4412 pgs: 46 peering, 3222 active+clean, 1 active+clean+scrubbing+deep, 9 active+recovery_wait+degraded, 1007 active+undersized+degraded, 1 active+recovery_wait+undersized+degraded, 126 down+peering; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 2242 kB/s wr, 358 op/s; 696737/5994304 objects degraded (11.623%); 8/2997152 unfound (0.000%)
From this point on the client IO went down from an average of about 500 op/s to 10-20 op/s. All vms started having hang tasks shortly after that and eventually most of them crashed. A few more seconds passed, the status changed to:
2014-11-13 20:59:19.857378 mon.0 192.168.168.13:6789/0 1210590 : cluster [INF] pgmap v39509547: 4412 pgs: 1 active+recovering+undersized+degraded, 3268 active+clean, 19 active+degraded, 7 active+recovery_wait+degraded, 959 active+undersized+degraded, 7 active+recovery_wait+undersized+degraded, 150 down+peering, 1 active+recovering+degraded; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB avail; 0 B/s rd, 546 kB/s wr, 103 op/s; 661943/5994302 objects degraded (11.043%); 10/2997151 unfound (0.000%); 53620 kB/s, 1 keys/s, 13 objects/s recovering
And pretty much stayed like that for over an hour without changing much. The IO was still unavailable.
Looking at the logs of the osds that were shown down, I don't really see much apart from some slow requests prior to the osds being reported down:
2014-11-13 20:56:10.315696 7fd9645be700 0 log_channel(default) log [WRN] : slow request 31.043640 seconds old, received at 2014-11-13 20:55:39.271942: osd_op(client.46407343.0
:3839273 rbd_data.1f0a392ae8944a.0000000000000621 [set-alloc-hint object_size 4194304 write_size 4194304,write 3526656~4096] 5.9fe0f0f2 ack+ondisk+write e64096) currently waiti
ng for subops from 1
2014-11-13 20:56:10.467380 7fd941a75700 -1 osd.12 64099 heartbeat_check: no reply from osd.4 since back 2014-11-13 20:55:35.958627 front 2014-11-13 20:55:35.958627 (cutoff 2014
-11-13 20:55:50.467379)
2014-11-13 20:56:10.467389 7fd941a75700 -1 osd.12 64099 heartbeat_check: no reply from osd.5 since back 2014-11-13 20:55:38.259184 front 2014-11-13 20:55:35.958627 (cutoff 2014
-11-13 20:55:50.467379)
2014-11-13 20:56:10.467409 7fd941a75700 -1 osd.12 64099 heartbeat_check: no reply from osd.6 since back 2014-11-13 20:55:35.958627 front 2014-11-13 20:55:35.958627 (cutoff 2014
-11-13 20:55:50.467379)
2014-11-13 20:56:10.467427 7fd941a75700 -1 osd.12 64099 heartbeat_check: no reply from osd.8 since back 2014-11-13 20:55:35.958627 front 2014-11-13 20:55:35.958627 (cutoff 2014
-11-13 20:55:50.467379)
2014-11-13 20:56:10.967963 7fd941a75700 -1 osd.12 64099 heartbeat_check: no reply from osd.4 since back 2014-11-13 20:55:35.958627 front 2014-11-13 20:55:35.958627 (cutoff 2014
-11-13 20:55:50.967962)
2014-11-13 20:56:10.967971 7fd941a75700 -1 osd.12 64099 heartbeat_check: no reply from osd.5 since back 2014-11-13 20:55:38.259184 front 2014-11-13 20:55:35.958627 (cutoff 2014
-11-13 20:55:50.967962)
2014-11-13 20:56:10.967978 7fd941a75700 -1 osd.12 64099 heartbeat_check: no reply from osd.6 since back 2014-11-13 20:55:35.958627 front 2014-11-13 20:55:35.958627 (cutoff 2014
-11-13 20:55:50.967962)
2014-11-13 20:56:10.967984 7fd941a75700 -1 osd.12 64099 heartbeat_check: no reply from osd.8 since back 2014-11-13 20:55:35.958627 front 2014-11-13 20:55:35.958627 (cutoff 2014
-11-13 20:55:50.967962)
2014-11-13 20:56:11.317046 7fd9645be700 -1 osd.12 64100 heartbeat_check: no reply from osd.4 since back 2014-11-13 20:55:35.958627 front 2014-11-13 20:55:35.958627 (cutoff 2014-11-13 20:55:51.317043)
2014-11-13 20:56:11.317057 7fd9645be700 -1 osd.12 64100 heartbeat_check: no reply from osd.5 since back 2014-11-13 20:55:38.259184 front 2014-11-13 20:55:35.958627 (cutoff 2014-11-13 20:55:51.317043)
2014-11-13 20:56:11.317064 7fd9645be700 -1 osd.12 64100 heartbeat_check: no reply from osd.6 since back 2014-11-13 20:55:35.958627 front 2014-11-13 20:56:10.967590 (cutoff 2014-11-13 20:55:51.317043)
2014-11-13 20:56:11.317070 7fd9645be700 -1 osd.12 64100 heartbeat_check: no reply from osd.8 since back 2014-11-13 20:55:38.259184 front 2014-11-13 20:55:35.958627 (cutoff 2014-11-13 20:55:51.317043)
2014-11-13 20:56:11.317167 7fd9645be700 0 log_channel(default) log [WRN] : 17 slow requests, 7 included below; oldest blocked for > 32.285501 secs
2014-11-13 20:56:11.317173 7fd9645be700 0 log_channel(default) log [WRN] : slow request 32.285501 seconds old, received at 2014-11-13 20:55:39.031582: osd_op(client.46379104.0:182402 rbd_data.2c26ec1333ab105.000000000000015a [stat,set-alloc-hint object_size 4194304 write_size 4194304,write 3956736~1024] 5.e9450c8d ack+ondisk+write e64096) currently waiting for subops from 4
2014-11-13 20:56:11.317179 7fd9645be700 0 log_channel(default) log [WRN] : slow request 32.047721 seconds old, received at 2014-11-13 20:55:39.269362: osd_op(client.46407343.0:3839258 rbd_data.1f0a392ae8944a.0000000000000621 [set-alloc-hint object_size 4194304 write_size 4194304,write 1568768~8192] 5.9fe0f0f2 ack+ondisk+write e64096) currently waiting for subops from 1
2014-11-13 20:56:11.317183 7fd9645be700 0 log_channel(default) log [WRN] : slow request 32.047396 seconds old, received at 2014-11-13 20:55:39.269687: osd_op(client.46407343.0:3839259 rbd_data.1f0a392ae8944a.0000000000000621 [set-alloc-hint object_size 4194304 write_size 4194304,write 1773568~4096] 5.9fe0f0f2 ack+ondisk+write e64096) currently waiting for subops from 1
2014-11-13 20:56:11.317187 7fd9645be700 0 log_channel(default) log [WRN] : slow request 32.047229 seconds old, received at 2014-11-13 20:55:39.269854: osd_op(client.46407343.0:3839260 rbd_data.1f0a392ae8944a.0000000000000621 [set-alloc-hint object_size 4194304 write_size 4194304,write 1921024~4096] 5.9fe0f0f2 ack+ondisk+write e64096) currently waiting for subops from 1
2014-11-13 20:56:11.317232 7fd9645be700 0 log_channel(default) log [WRN] : slow request 32.047069 seconds old, received at 2014-11-13 20:55:39.270014: osd_op(client.46407343.0:3839261 rbd_data.1f0a392ae8944a.0000000000000621 [set-alloc-hint object_size 4194304 write_size 4194304,write 1949696~4096] 5.9fe0f0f2 ack+ondisk+write e64096) currently waiting for subops from 1
2014-11-13 20:56:37.954923 7fd92c055700 0 -- 192.168.168.201:6812/10278 >> 192.168.168.200:6820/56026102 pipe(0x2b369b00 sd=381 :6812 s=2 pgs=205156 cs=1 l=0 c=0x1b29d420).fault with nothing to send, going to standby
2014-11-13 20:56:37.955547 7fd92bb50700 0 -- 192.168.168.201:6812/10278 >> 192.168.168.200:6810/72026170 pipe(0x2ad35580 sd=238 :6812 s=2 pgs=141844 cs=1 l=0 c=0xf33c420).fault with nothing to send, going to standby
2014-11-13 20:56:37.956392 7fd935deb700 0 -- 192.168.168.201:6812/10278 >> 192.168.168.200:6807/70026130 pipe(0x2e2268c0 sd=408 :6812 s=2 pgs=278536 cs=1 l=0 c=0x142f7420).fault with nothing to send, going to standby
2014-11-13 20:56:37.956695 7fd93e86d700 0 -- 192.168.168.201:6812/10278 >> 192.168.168.200:6849/1026289 pipe(0x29b4c840 sd=129 :41140 s=2 pgs=591 cs=1 l=0 c=0x29b58b00).fault with nothing to send, going to standby
2014-11-13 20:56:37.958245 7fd93e56a700 0 -- 192.168.168.201:6812/10278 >> 192.168.168.200:6803/1026306 pipe(0x29bb7600 sd=154 :43005 s=2 pgs=117 cs=1 l=0 c=0x29be4840).fault with nothing to send, going to standby
2014-11-13 20:56:37.969506 7fd93e368700 0 -- 192.168.168.201:6812/10278 >> 192.168.168.200:6824/26238 pipe(0x29c98600 sd=146 :60953 s=2 pgs=148 cs=1 l=0 c=0x29cd0840).fault with nothing to send, going to standby
2014-11-13 20:56:37.969620 7fd92a439700 0 -- 192.168.168.201:6812/10278 >> 192.168.168.201:6848/26784 pipe(0x2c5fcb00 sd=231 :42835 s=2 pgs=151539 cs=7 l=0 c=0x13c32580).fault with nothing to send, going to standby
2014-11-13 20:56:37.971391 7fd925df3700 0 -- 192.168.168.201:6812/10278 >> 192.168.168.200:6831/62026114 pipe(0x2d93edc0 sd=21 :6812 s=2 pgs=196755 cs=1 l=0 c=0x173c82c0).fault with nothing to send, going to standby
2014-11-13 20:56:37.980660 7fd93e267700 0 -- 192.168.168.201:6812/10278 >> 192.168.168.200:6821/18026333 pipe(0x2a13c2c0 sd=122 :41160 s=2 pgs=37077 cs=1 l=0 c=0x2a9ffdc0).fault with nothing to send, going to standby
This was the last log entry for osd.12 until It has been restarted hours later. A very similar picture is in the logs of osd.16. Pretty much identical.
Does anyone have any idea what is causing these problems with Giant release? How do I fix them as it is currently very unstable for a production ready release.
Many thanks for your help
Andrei
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com