Re: Giant osd problems - loss of IO

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Any other suggestions why several osds are going down on Giant and causing IO to stall? This was not happening on Firefly.

Thanks


On 11/14/2014 01:50 PM, Andrei Mikhailovsky wrote:
> Wido, I've not done any changes from the default settings. There are no firewalls between the ceph cluster members and I do not see a great deal of network related errors either. There is a tiny amount of TX errors on the network interface, which accounts to 0.0001% of the total packets.
>
> Regarding the debug_ms , do I set this on ceph-mon side or on all osds or both?
>

> On the OSDs, it might explain why the heartbeats are failing.

Wido,

I have set the debug_ms to 10 on all osds, however, the disk usage has quickly increased and consumed about 5gb of space in the matter of 10 minutes. I am no longer able to keep the debug level as I will quickly run out of space on the partition. From what I can see, 5 osds have generated over 1GB log files where the rest has only generated about 200-300 GB.

Is there anything in particular I should be looking for to determine the cause of the hearbeats failures?

Thanks





> From: "Wido den Hollander" <wido@xxxxxxxx>
> To: ceph-users@xxxxxxxxxxxxxx
> Sent: Friday, 14 November, 2014 12:30:02 PM
> Subject: Re: Giant osd problems - loss of IO
>
> On 14-11-14 13:26, Andrei Mikhailovsky wrote:
>> Hello guys,
>>
>> Since upgrading my cluster to Giant from the previous stable release I
>> started having massive problems with client IO. I've done the upgrade 2
>> weeks ago and since then the IO on my ceph cluster has been unavailable
>> 3 times already.
>>
>> Quick info on my storage cluster - 3 mons, 2 osd servers, 17 osds, 4
>> ssds for journals. All running Giant release from ceph repos on Ubuntu
>> 12.04 with the latest updates.
>>
>> Every time I experience a very similar problem. Two out of 17 osds are
>> shown as down even though the corresponding ceph-osd processes are
>> running. Somehow, the down-ing of two osds brings down the IO in the
>> cluster. This has not happened before upgrading to Giant. It seems that
>> everytime different osds go down, but it happens on the same osd server.
>> The latest one happened tonight. The ceph -s showed the following:
>>
>> root@arh-ibstorage1-ib:~# ceph -s
>> cluster 51e9f641-372e-44ec-92a4-b9fe55cbf9fe
>> health HEALTH_WARN 1001 pgs degraded; 150 pgs down; 150 pgs
>> peering; 1 pgs recovering; 41 pgs recovery_wait; 1001 pgs stuck
>> degraded; 150 pgs stuck inactive; 1151 pgs stuck unclean; 967 pgs stuck
>> undersized; 967 pgs undersized; 203 requests are blocked > 32 sec;
>> recovery 661954/5994334 objects degraded (11.043%); 10/2997167 unfound
>> (0.000%); 2/17 in osds are down; noout flag(s) set
>> monmap e11: 3 mons at
>> {arh-cloud13-ib=192.168.168.13:6789/0,arh-ibstorage1-ib=192.168.168.200:6789/0,arh-ibstorage2-ib=192.168.168.201:6789/0},
>> election epoch 66832, quorum 0,1,2
>> arh-cloud13-ib,arh-ibstorage1-ib,arh-ibstorage2-ib
>> mdsmap e2709: 1/1/1 up {0=arh-ibstorage2-ib=up:active}, 1 up:standby
>> osdmap e64374: 17 osds: 15 up, 17 in
>> flags noout
>> pgmap v39540041: 4412 pgs, 17 pools, 10245 GB data, 2926 kobjects
>> 20570 GB used, 26853 GB / 47423 GB avail
>> 661954/5994334 objects degraded (11.043%); 10/2997167
>> unfound (0.000%)
>> 1 active+recovering+undersized+degraded
>> 3261 active+clean
>> 34 active+recovery_wait+degraded
>> 959 active+undersized+degraded
>> 7 active+recovery_wait+undersized+degraded
>> 150 down+peering
>> client io 0 B/s rd, 13805 B/s wr, 4 op/s
>>
>>
>> "ceph osd tree" shows:
>>
>> root@arh-ibstorage1-ib:~# ceph osd tree
>> # id weight type name up/down reweight
>> -10 1.8 root uk-ssd
>> -9 1.8 datacenter ldex-ssd
>> -12 1.8 room ldex-dc3-ssd
>> -14 1.8 row row-a-ssd
>> -8 1.8 rack ldex-rack-a5-ssd
>> -6 0.9 host arh-ibstorage1-ib-ssd
>> -7 0.9 host arh-ibstorage2-ib-ssd
>> -1 46.41 root uk
>> -5 46.41 datacenter ldex
>> -11 46.41 room ldex-dc3
>> -13 46.41 row row-a
>> -4 46.41 rack ldex-rack-a5
>> -2 24.57 host arh-ibstorage1-ib
>> 1 2.73 osd.1 up 1
>> 3 2.73 osd.3 up 1
>> 5 2.73 osd.5 up 1
>> 6 2.73 osd.6 up 1
>> 7 2.73 osd.7 up 1
>> 8 2.73 osd.8 up 1
>> 4 2.73 osd.4 up 1
>> 0 2.73 osd.0 up 1
>> 2 2.73 osd.2 up 1
>> -3 21.84 host arh-ibstorage2-ib
>> 9 2.73 osd.9 up 1
>> 10 2.73 osd.10 up 1
>> 11 2.73 osd.11 up 1
>> 12 2.73 osd.12 down 1
>> 13 2.73 osd.13 up 1
>> 14 2.73 osd.14 up 1
>> 15 2.73 osd.15 up 1
>> 16 2.73 osd.16 down 1
>>
>>
>> From the ceph -s it seems that some PGs were down and peering. There was
>> no change in the ceph -s output for over an hour. The client IO activity
>> ground to a halt and all vms became unresponsive and eventually crashed.
>> The cluster was unusable until I have manually stopped osd.12 and osd.16
>> and restarted them again. This has solved the problem.
>>
>> Previously this has happened about four days ago with osds 11 and 15 and
>> it also happened about 10 days ago with different osds. So, having lost
>> the storage IO three times in the last two weeks is a big deal.
>>
>> Looking at the ceph logs I've discovered that since performing the
>> upgrade I have seen a far more messages reporting osds being down.
>> Different osds are reported as being down and they come back up after
>> about 5-30 seconds, perform recovery and the cluster is back to normal
>> (apart from the three tiimes which actually brought the cluster io down).
>>
>> Furthermore, about a few minutes before osd.12 and osd.16 went down,
>> I've seen 2 PGs are being deep scrubbed while osd.7 was reported down by
>> 3 peers:
>>
>>
>>
>> 2014-11-13 20:55:36.608502 mon.0 192.168.168.13:6789/0 1210074 : cluster
>> [INF] pgmap v39509396: 4412 pgs: 4410 active+clean, 2
>> active+clean+scrubbing+deep; 10245 GB data, 20570 GB used, 26853 GB /
>> 47423 GB avail; 10005 kB/s rd, 3486 kB/s wr, 310 op/s
>> 2014-11-13 20:56:10.056933 mon.0 192.168.168.13:6789/0 1210209 : cluster
>> [INF] pgmap v39509409: 4412 pgs: 3380 active+clean, 2
>> active+clean+scrubbing+deep, 1030 active+undersized+degraded; 10245 GB
>> data, 20570 GB used, 26853 GB / 47423 GB avail; 112 kB/s rd, 122 kB/s
>> wr, 37 op/s; 662204/5994308 objects degraded (11.047%)
>> 2014-11-13 20:56:10.073317 mon.0 192.168.168.13:6789/0 1210210 : cluster
>> [INF] osd.7 192.168.168.200:6838/26321 boot
>> 2014-11-13 20:56:10.073461 mon.0 192.168.168.13:6789/0 1210211 : cluster
>> [INF] osd.3 192.168.168.200:6818/26170 boot
>> 2014-11-13 20:56:10.074415 mon.0 192.168.168.13:6789/0 1210212 : cluster
>> [INF] osdmap e64099: 17 osds: 17 up, 17 in
>> 2014-11-13 20:56:10.109650 mon.0 192.168.168.13:6789/0 1210213 : cluster
>> [INF] pgmap v39509410: 4412 pgs: 3380 active+clean, 2
>> active+clean+scrubbing+deep, 1030 active+undersized+degraded; 10245 GB
>> data, 20570 GB used, 26853 GB / 47423 GB avail; 97137 B/s rd, 220 kB/s
>> wr, 36 op/s; 662204/5994308 objects degraded (11.047%)
>> 2014-11-13 20:56:11.082817 mon.0 192.168.168.13:6789/0 1210214 : cluster
>> [INF] osdmap e64100: 17 osds: 17 up, 17 in
>> 2014-11-13 20:56:11.741135 mon.1 192.168.168.200:6789/0 520573 : cluster
>> [INF] mon.arh-ibstorage1-ib calling new monitor election
>> 2014-11-13 20:56:11.744641 mon.2 192.168.168.201:6789/0 521421 : cluster
>> [INF] mon.arh-ibstorage2-ib calling new monitor election
>> 2014-11-13 20:56:11.745988 mon.0 192.168.168.13:6789/0 1210248 : cluster
>> [INF] mon.arh-cloud13-ib calling new monitor election
>> 2014-11-13 20:56:11.754388 mon.0 192.168.168.13:6789/0 1210249 : cluster
>> [INF] mon.arh-cloud13-ib@0 won leader election with quorum 0,1,2
>> 2014-11-13 20:56:11.763975 mon.0 192.168.168.13:6789/0 1210250 : cluster
>> [INF] monmap e11: 3 mons at
>> {arh-cloud13-ib=192.168.168.13:6789/0,arh-ibstorage1-ib=192.168.168.200:6789/0,arh-ibstorage2-ib=192.168.168.201:6789/0}
>> 2014-11-13 20:56:11.764069 mon.0 192.168.168.13:6789/0 1210251 : cluster
>> [INF] pgmap v39509411: 4412 pgs: 3380 active+clean, 2
>> active+clean+scrubbing+deep, 1030 active+undersized+degraded; 10245 GB
>> data, 20570 GB used, 26853 GB / 47423 GB avail; 662204/5994308 objects
>> degraded (11.047%)
>> 2014-11-13 20:56:11.764147 mon.0 192.168.168.13:6789/0 1210252 : cluster
>> [INF] mdsmap e2705: 1/1/1 up {0=arh-ibstorage2-ib=up:active}, 1 up:standby
>> 2014-11-13 20:56:11.764283 mon.0 192.168.168.13:6789/0 1210253 : cluster
>> [INF] osdmap e64100: 17 osds: 17 up, 17 in
>> 2014-11-13 20:56:11.764859 mon.0 192.168.168.13:6789/0 1210257 : cluster
>> [INF] osd.7 192.168.168.200:6838/26321 failed (3 reports from 3 peers
>> after 32.005619 >= grace 25.962628)
>> 2014-11-13 20:56:12.721171 mon.0 192.168.168.13:6789/0 1210262 : cluster
>> [INF] osdmap e64101: 17 osds: 16 up, 17 in
>> 2014-11-13 20:56:13.294162 mon.0 192.168.168.13:6789/0 1210263 : cluster
>> [INF] pgmap v39509413: 4412 pgs: 3380 active+clean, 2
>> active+clean+scrubbing+deep, 1030 active+undersized+degraded; 10245 GB
>> data, 20570 GB used, 26853 GB / 47423 GB avail; 1150 kB/s rd, 561 kB/s
>> wr, 193 op/s; 662204/5994308 objects degraded (11.047%)
>> 2014-11-13 20:56:14.325109 mon.0 192.168.168.13:6789/0 1210264 : cluster
>> [INF] osdmap e64102: 17 osds: 16 up, 17 in
>> 2014-11-13 20:56:14.357819 mon.0 192.168.168.13:6789/0 1210265 : cluster
>> [INF] pgmap v39509414: 4412 pgs: 3380 active+clean, 2
>> active+clean+scrubbing+deep, 1030 active+undersized+degraded; 10245 GB
>> data, 20570 GB used, 26853 GB / 47423 GB avail; 662204/5994308 objects
>> degraded (11.047%)
>> 2014-11-13 20:56:15.442751 mon.0 192.168.168.13:6789/0 1210266 : cluster
>> [INF] pgmap v39509415: 4412 pgs: 3685 active+clean, 2
>> active+clean+scrubbing+deep, 18 active+recovery_wait+degraded, 707
>> active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB /
>> 47423 GB avail; 28730 kB/s rd, 1087 kB/s wr, 590 op/s; 466784/5994308
>> objects degraded (7.787%); 499 MB/s, 3 keys/s, 237 objects/s recovering
>> 2014-11-13 20:56:16.364658 mon.0 192.168.168.13:6789/0 1210268 : cluster
>> [INF] osd.7 192.168.168.200:6838/26321 boot
>> 2014-11-13 20:56:16.366310 mon.0 192.168.168.13:6789/0 1210269 : cluster
>> [INF] osdmap e64103: 17 osds: 17 up, 17 in
>> 2014-11-13 20:56:16.380120 mon.0 192.168.168.13:6789/0 1210270 : cluster
>> [INF] pgmap v39509416: 4412 pgs: 3685 active+clean, 2
>> active+clean+scrubbing+deep, 18 active+recovery_wait+degraded, 707
>> active+undersized+degraded; 10245 GB data, 20570 GB used, 26853 GB /
>> 47423 GB avail; 33165 kB/s rd, 1255 kB/s wr, 682 op/s; 466784/5994308
>> objects degraded (7.787%); 576 MB/s, 3 keys/s, 274 objects/s recovering
>> 2014-11-13 20:56:17.400859 mon.0 192.168.168.13:6789/0 1210271 : cluster
>> [INF] osdmap e64104: 17 osds: 17 up, 17 in
>>
>>
>>
>> So, for the next 20 seconds it was scrubbing until ceph reported that
>> osd12 and osd16 reported being down:
>>
>>
>>
>> 2014-11-13 20:56:36.712346 mon.0 192.168.168.13:6789/0 1210327 : cluster
>> [INF] osd.12 192.168.168.201:6810/10278 failed (11 reports from 9 peers
>> after 24.080479 >= grace 23.026948)
>> 2014-11-13 20:56:36.771124 mon.0 192.168.168.13:6789/0 1210330 : cluster
>> [INF] osd.15 192.168.168.201:6801/4482 failed (10 reports from 8 peers
>> after 24.118287 >= grace 23.285583)
>> 2014-11-13 20:56:37.920306 mon.0 192.168.168.13:6789/0 1210345 : cluster
>> [INF] pgmap v39509434: 4412 pgs: 122 peering, 4085 active+clean, 2
>> active+clean+scrubbing+deep, 2 active+degraded, 16
>> active+recovery_wait+degraded, 185 active+undersized+degraded; 10245 GB
>> data, 20570 GB used, 26853 GB / 47423 GB avail; 13943 B/s rd, 506 kB/s
>> wr, 86 op/s; 118552/5994302 objects degraded (1.978%); 666 kB/s, 0
>> objects/s recovering
>> 2014-11-13 20:56:37.963713 mon.0 192.168.168.13:6789/0 1210346 : cluster
>> [INF] osdmap e64106: 17 osds: 15 up, 17 in
>> 2014-11-13 20:56:38.458852 mon.0 192.168.168.13:6789/0 1210347 : cluster
>> [INF] pgmap v39509435: 4412 pgs: 377 stale+active+clean, 122 peering,
>> 3708 active+clean, 119 stale+active+undersized+degraded, 2
>> active+clean+scrubbing+deep, 2 active+degraded, 16
>> active+recovery_wait+degraded, 66 active+undersized+degraded; 10245 GB
>> data, 20570 GB used, 26853 GB / 47423 GB avail; 20510 B/s rd, 430 kB/s
>> wr, 76 op/s; 118552/5994302 objects degraded (1.978%)
>> 2014-11-13 20:56:39.195535 mon.0 192.168.168.13:6789/0 1210349 : cluster
>> [INF] osdmap e64107: 17 osds: 15 up, 17 in
>> 2014-11-13 20:56:39.231384 mon.0 192.168.168.13:6789/0 1210350 : cluster
>> [INF] pgmap v39509436: 4412 pgs: 377 stale+active+clean, 122 peering,
>> 3708 active+clean, 119 stale+active+undersized+degraded, 2
>> active+clean+scrubbing+deep, 2 active+degraded, 16
>> active+recovery_wait+degraded, 66 active+undersized+degraded; 10245 GB
>> data, 20570 GB used, 26853 GB / 47423 GB avail; 118552/5994302 objects
>> degraded (1.978%)
>> 2014-11-13 20:56:41.883606 mon.0 192.168.168.13:6789/0 1210352 : cluster
>> [INF] pgmap v39509437: 4412 pgs: 377 stale+active+clean, 122 peering,
>> 3708 active+clean, 119 stale+active+undersized+degraded, 2
>> active+clean+scrubbing+deep, 1 active+degraded, 17
>> active+recovery_wait+degraded, 66 active+undersized+degraded; 10245 GB
>> data, 20570 GB used, 26853 GB / 47423 GB avail; 14341 B/s rd, 144 kB/s
>> wr, 5 op/s; 118552/5994302 objects degraded (1.978%)
>> 2014-11-13 20:56:43.617030 mon.0 192.168.168.13:6789/0 1210359 : cluster
>> [INF] pgmap v39509438: 4412 pgs: 377 stale+active+clean, 122 peering,
>> 3708 active+clean, 119 stale+active+undersized+degraded, 2
>> active+clean+scrubbing+deep, 1 active+degraded, 17
>> active+recovery_wait+degraded, 66 active+undersized+degraded; 10245 GB
>> data, 20570 GB used, 26853 GB / 47423 GB avail; 718 kB/s rd, 215 kB/s
>> wr, 46 op/s; 118552/5994302 objects degraded (1.978%)
>> 2014-11-13 20:56:45.408475 mon.0 192.168.168.13:6789/0 1210365 : cluster
>> [INF] pgmap v39509439: 4412 pgs: 46 peering, 3218 active+clean, 1
>> active+clean+scrubbing+deep, 1 active+degraded, 12
>> active+recovery_wait+degraded, 1007 active+undersized+degraded, 1
>> active+recovery_wait+undersized+degraded, 126 down+peering; 10245 GB
>> data, 20570 GB used, 26853 GB / 47423 GB avail; 1853 kB/s wr, 300 op/s;
>> 696740/5994304 objects degraded (11.623%); 8/2997152 unfound (0.000%)
>> 2014-11-13 20:56:46.517294 mon.0 192.168.168.13:6789/0 1210366 : cluster
>> [INF] pgmap v39509440: 4412 pgs: 46 peering, 3222 active+clean, 1
>> active+clean+scrubbing+deep, 9 active+recovery_wait+degraded, 1007
>> active+undersized+degraded, 1 active+recovery_wait+undersized+degraded,
>> 126 down+peering; 10245 GB data, 20570 GB used, 26853 GB / 47423 GB
>> avail; 2242 kB/s wr, 358 op/s; 696737/5994304 objects degraded
>> (11.623%); 8/2997152 unfound (0.000%)
>>
>>
>>
>> From this point on the client IO went down from an average of about 500
>> op/s to 10-20 op/s. All vms started having hang tasks shortly after that
>> and eventually most of them crashed. A few more seconds passed, the
>> status changed to:
>>
>> 2014-11-13 20:59:19.857378 mon.0 192.168.168.13:6789/0 1210590 : cluster
>> [INF] pgmap v39509547: 4412 pgs: 1
>> active+recovering+undersized+degraded, 3268 active+clean, 19
>> active+degraded, 7 active+recovery_wait+degraded, 959
>> active+undersized+degraded, 7 active+recovery_wait+undersized+degraded,
>> 150 down+peering, 1 active+recovering+degraded; 10245 GB data, 20570 GB
>> used, 26853 GB / 47423 GB avail; 0 B/s rd, 546 kB/s wr, 103 op/s;
>> 661943/5994302 objects degraded (11.043%); 10/2997151 unfound (0.000%);
>> 53620 kB/s, 1 keys/s, 13 objects/s recovering
>>
>>
>> And pretty much stayed like that for over an hour without changing much.
>> The IO was still unavailable.
>>
>> Looking at the logs of the osds that were shown down, I don't really see
>> much apart from some slow requests prior to the osds being reported down:
>>
>>
>> 2014-11-13 20:56:10.315696 7fd9645be700 0 log_channel(default) log
>> [WRN] : slow request 31.043640 seconds old, received at 2014-11-13
>> 20:55:39.271942: osd_op(client.46407343.0
>> :3839273 rbd_data.1f0a392ae8944a.0000000000000621 [set-alloc-hint
>> object_size 4194304 write_size 4194304,write 3526656~4096] 5.9fe0f0f2
>> ack+ondisk+write e64096) currently waiti
>> ng for subops from 1
>> 2014-11-13 20:56:10.467380 7fd941a75700 -1 osd.12 64099 heartbeat_check:
>> no reply from osd.4 since back 2014-11-13 20:55:35.958627 front
>> 2014-11-13 20:55:35.958627 (cutoff 2014
>> -11-13 20:55:50.467379)
>> 2014-11-13 20:56:10.467389 7fd941a75700 -1 osd.12 64099 heartbeat_check:
>> no reply from osd.5 since back 2014-11-13 20:55:38.259184 front
>> 2014-11-13 20:55:35.958627 (cutoff 2014
>> -11-13 20:55:50.467379)
>> 2014-11-13 20:56:10.467409 7fd941a75700 -1 osd.12 64099 heartbeat_check:
>> no reply from osd.6 since back 2014-11-13 20:55:35.958627 front
>> 2014-11-13 20:55:35.958627 (cutoff 2014
>> -11-13 20:55:50.467379)
>> 2014-11-13 20:56:10.467427 7fd941a75700 -1 osd.12 64099 heartbeat_check:
>> no reply from osd.8 since back 2014-11-13 20:55:35.958627 front
>> 2014-11-13 20:55:35.958627 (cutoff 2014
>> -11-13 20:55:50.467379)
>> 2014-11-13 20:56:10.967963 7fd941a75700 -1 osd.12 64099 heartbeat_check:
>> no reply from osd.4 since back 2014-11-13 20:55:35.958627 front
>> 2014-11-13 20:55:35.958627 (cutoff 2014
>> -11-13 20:55:50.967962)
>> 2014-11-13 20:56:10.967971 7fd941a75700 -1 osd.12 64099 heartbeat_check:
>> no reply from osd.5 since back 2014-11-13 20:55:38.259184 front
>> 2014-11-13 20:55:35.958627 (cutoff 2014
>> -11-13 20:55:50.967962)
>> 2014-11-13 20:56:10.967978 7fd941a75700 -1 osd.12 64099 heartbeat_check:
>> no reply from osd.6 since back 2014-11-13 20:55:35.958627 front
>> 2014-11-13 20:55:35.958627 (cutoff 2014
>> -11-13 20:55:50.967962)
>> 2014-11-13 20:56:10.967984 7fd941a75700 -1 osd.12 64099 heartbeat_check:
>> no reply from osd.8 since back 2014-11-13 20:55:35.958627 front
>> 2014-11-13 20:55:35.958627 (cutoff 2014
>> -11-13 20:55:50.967962)
>> 2014-11-13 20:56:11.317046 7fd9645be700 -1 osd.12 64100 heartbeat_check:
>> no reply from osd.4 since back 2014-11-13 20:55:35.958627 front
>> 2014-11-13 20:55:35.958627 (cutoff 2014-11-13 20:55:51.317043)
>> 2014-11-13 20:56:11.317057 7fd9645be700 -1 osd.12 64100 heartbeat_check:
>> no reply from osd.5 since back 2014-11-13 20:55:38.259184 front
>> 2014-11-13 20:55:35.958627 (cutoff 2014-11-13 20:55:51.317043)
>> 2014-11-13 20:56:11.317064 7fd9645be700 -1 osd.12 64100 heartbeat_check:
>> no reply from osd.6 since back 2014-11-13 20:55:35.958627 front
>> 2014-11-13 20:56:10.967590 (cutoff 2014-11-13 20:55:51.317043)
>> 2014-11-13 20:56:11.317070 7fd9645be700 -1 osd.12 64100 heartbeat_check:
>> no reply from osd.8 since back 2014-11-13 20:55:38.259184 front
>> 2014-11-13 20:55:35.958627 (cutoff 2014-11-13 20:55:51.317043)
>> 2014-11-13 20:56:11.317167 7fd9645be700 0 log_channel(default) log
>> [WRN] : 17 slow requests, 7 included below; oldest blocked for >
>> 32.285501 secs
>> 2014-11-13 20:56:11.317173 7fd9645be700 0 log_channel(default) log
>> [WRN] : slow request 32.285501 seconds old, received at 2014-11-13
>> 20:55:39.031582: osd_op(client.46379104.0:182402
>> rbd_data.2c26ec1333ab105.000000000000015a [stat,set-alloc-hint
>> object_size 4194304 write_size 4194304,write 3956736~1024] 5.e9450c8d
>> ack+ondisk+write e64096) currently waiting for subops from 4
>> 2014-11-13 20:56:11.317179 7fd9645be700 0 log_channel(default) log
>> [WRN] : slow request 32.047721 seconds old, received at 2014-11-13
>> 20:55:39.269362: osd_op(client.46407343.0:3839258
>> rbd_data.1f0a392ae8944a.0000000000000621 [set-alloc-hint object_size
>> 4194304 write_size 4194304,write 1568768~8192] 5.9fe0f0f2
>> ack+ondisk+write e64096) currently waiting for subops from 1
>> 2014-11-13 20:56:11.317183 7fd9645be700 0 log_channel(default) log
>> [WRN] : slow request 32.047396 seconds old, received at 2014-11-13
>> 20:55:39.269687: osd_op(client.46407343.0:3839259
>> rbd_data.1f0a392ae8944a.0000000000000621 [set-alloc-hint object_size
>> 4194304 write_size 4194304,write 1773568~4096] 5.9fe0f0f2
>> ack+ondisk+write e64096) currently waiting for subops from 1
>> 2014-11-13 20:56:11.317187 7fd9645be700 0 log_channel(default) log
>> [WRN] : slow request 32.047229 seconds old, received at 2014-11-13
>> 20:55:39.269854: osd_op(client.46407343.0:3839260
>> rbd_data.1f0a392ae8944a.0000000000000621 [set-alloc-hint object_size
>> 4194304 write_size 4194304,write 1921024~4096] 5.9fe0f0f2
>> ack+ondisk+write e64096) currently waiting for subops from 1
>> 2014-11-13 20:56:11.317232 7fd9645be700 0 log_channel(default) log
>> [WRN] : slow request 32.047069 seconds old, received at 2014-11-13
>> 20:55:39.270014: osd_op(client.46407343.0:3839261
>> rbd_data.1f0a392ae8944a.0000000000000621 [set-alloc-hint object_size
>> 4194304 write_size 4194304,write 1949696~4096] 5.9fe0f0f2
>> ack+ondisk+write e64096) currently waiting for subops from 1
>> 2014-11-13 20:56:37.954923 7fd92c055700 0 -- 192.168.168.201:6812/10278
>>>> 192.168.168.200:6820/56026102 pipe(0x2b369b00 sd=381 :6812 s=2
>> pgs=205156 cs=1 l=0 c=0x1b29d420).fault with nothing to send, going to
>> standby
>> 2014-11-13 20:56:37.955547 7fd92bb50700 0 -- 192.168.168.201:6812/10278
>>>> 192.168.168.200:6810/72026170 pipe(0x2ad35580 sd=238 :6812 s=2
>> pgs=141844 cs=1 l=0 c=0xf33c420).fault with nothing to send, going to
>> standby
>> 2014-11-13 20:56:37.956392 7fd935deb700 0 -- 192.168.168.201:6812/10278
>>>> 192.168.168.200:6807/70026130 pipe(0x2e2268c0 sd=408 :6812 s=2
>> pgs=278536 cs=1 l=0 c=0x142f7420).fault with nothing to send, going to
>> standby
>> 2014-11-13 20:56:37.956695 7fd93e86d700 0 -- 192.168.168.201:6812/10278
>>>> 192.168.168.200:6849/1026289 pipe(0x29b4c840 sd=129 :41140 s=2
>> pgs=591 cs=1 l=0 c=0x29b58b00).fault with nothing to send, going to standby
>> 2014-11-13 20:56:37.958245 7fd93e56a700 0 -- 192.168.168.201:6812/10278
>>>> 192.168.168.200:6803/1026306 pipe(0x29bb7600 sd=154 :43005 s=2
>> pgs=117 cs=1 l=0 c=0x29be4840).fault with nothing to send, going to standby
>> 2014-11-13 20:56:37.969506 7fd93e368700 0 -- 192.168.168.201:6812/10278
>>>> 192.168.168.200:6824/26238 pipe(0x29c98600 sd=146 :60953 s=2 pgs=148
>> cs=1 l=0 c=0x29cd0840).fault with nothing to send, going to standby
>> 2014-11-13 20:56:37.969620 7fd92a439700 0 -- 192.168.168.201:6812/10278
>>>> 192.168.168.201:6848/26784 pipe(0x2c5fcb00 sd=231 :42835 s=2
>> pgs=151539 cs=7 l=0 c=0x13c32580).fault with nothing to send, going to
>> standby
>> 2014-11-13 20:56:37.971391 7fd925df3700 0 -- 192.168.168.201:6812/10278
>>>> 192.168.168.200:6831/62026114 pipe(0x2d93edc0 sd=21 :6812 s=2
>> pgs=196755 cs=1 l=0 c=0x173c82c0).fault with nothing to send, going to
>> standby
>> 2014-11-13 20:56:37.980660 7fd93e267700 0 -- 192.168.168.201:6812/10278
>>>> 192.168.168.200:6821/18026333 pipe(0x2a13c2c0 sd=122 :41160 s=2
>> pgs=37077 cs=1 l=0 c=0x2a9ffdc0).fault with nothing to send, going to
>> standby
>>
>>
>>
>> This was the last log entry for osd.12 until It has been restarted hours
>> later. A very similar picture is in the logs of osd.16. Pretty much
>> identical.
>>
>> Does anyone have any idea what is causing these problems with Giant
>> release? How do I fix them as it is currently very unstable for a
>> production ready release.
>>
>
> The heartbeat cutoffs are odd. Are you sure there are no firewalls in
> place there or any other TCP settings?
>
> Did you change any settings regarding the heartbeats inside Ceph or are
> they all at their defaults?
>
> Have you tried to run with debug_ms set to 10 for example. That might
> show you more.
>
> Wido
>
>> Many thanks for your help
>>
>> Andrei
>>
>>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>


--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux