Re: Giant upgrade - stability issues

Samuel Just <sam.just@xxxxxxxxxxx> · Wed, 19 Nov 2014 07:10:00 -0800



logs/andrei » grep failed ceph.log.7
2014-11-12 01:37:19.857143 mon.0 192.168.168.13:6789/0 969265 :
cluster [INF] osd.3 192.168.168.200:6818/26170 failed (3 reports from
3 peers after 22.000550 >= grace 20.995772)
2014-11-12 01:37:21.176073 mon.0 192.168.168.13:6789/0 969287 :
cluster [INF] osd.0 192.168.168.200:6801/26102 failed (3 reports from
3 peers after 22.997644 >= grace 22.986743)
2014-11-12 01:37:59.187365 mon.0 192.168.168.13:6789/0 969412 :
cluster [INF] osd.1 192.168.168.200:6806/26114 failed (8 reports from
7 peers after 24.000156 >= grace 23.981558)
2014-11-12 01:37:59.187641 mon.0 192.168.168.13:6789/0 969414 :
cluster [INF] osd.2 192.168.168.200:6813/26130 failed (8 reports from
7 peers after 24.000402 >= grace 23.981558)
2014-11-12 01:43:36.238865 mon.0 192.168.168.13:6789/0 969970 :
cluster [INF] osd.2 192.168.168.200:6813/26130 failed (7 reports from
7 peers after 25.218598 >= grace 23.980624)
2014-11-12 01:59:53.025847 mon.0 192.168.168.13:6789/0 971388 :
cluster [INF] osd.2 192.168.168.200:6813/26130 failed (8 reports from
7 peers after 1002.005567 >= grace 23.298168)
2014-11-12 02:00:48.753025 mon.0 192.168.168.13:6789/0 971529 :
cluster [INF] osd.3 192.168.168.200:6818/26170 failed (3 reports from
3 peers after 21.350209 >= grace 20.995897)
2014-11-12 02:00:51.432242 mon.0 192.168.168.13:6789/0 971589 :
cluster [INF] osd.0 192.168.168.200:6801/26102 failed (9 reports from
8 peers after 23.679883 >= grace 21.990901)
2014-11-12 02:00:54.919140 mon.0 192.168.168.13:6789/0 971613 :
cluster [INF] osd.7 192.168.168.200:6838/26321 failed (7 reports from
6 peers after 27.166208 >= grace 25.654808)
2014-11-12 02:00:55.008719 mon.0 192.168.168.13:6789/0 971622 :
cluster [INF] osd.1 192.168.168.200:6806/26114 failed (7 reports from
5 peers after 27.256118 >= grace 23.979063)
2014-11-12 02:58:14.787461 mon.0 192.168.168.13:6789/0 976957 :
cluster [INF] osd.7 192.168.168.200:6838/26321 failed (3 reports from
3 peers after 32.609818 >= grace 24.785794)
2014-11-12 03:42:51.831223 mon.0 192.168.168.13:6789/0 980969 :
cluster [INF] osd.1 192.168.168.200:6806/26114 failed (3 reports from
2 peers after 2711.653602 >= grace 21.779812)
2014-11-12 03:42:51.841152 mon.0 192.168.168.13:6789/0 981008 :
cluster [INF] osd.0 192.168.168.200:6801/26102 failed (3 reports from
1 peers after 30.654845 >= grace 21.988229)
2014-11-12 05:01:08.085550 mon.0 192.168.168.13:6789/0 988367 :
cluster [INF] osd.0 192.168.168.200:6801/26102 failed (9 reports from
7 peers after 22.826650 >= grace 21.991228)
2014-11-12 05:01:08.086156 mon.0 192.168.168.13:6789/0 988371 :
cluster [INF] osd.3 192.168.168.200:6818/26170 failed (9 reports from
7 peers after 22.826985 >= grace 21.991228)
2014-11-12 05:01:08.395862 mon.0 192.168.168.13:6789/0 988379 :
cluster [INF] osd.1 192.168.168.200:6806/26114 failed (9 reports from
7 peers after 23.136824 >= grace 22.986665)
2014-11-12 05:14:53.539485 mon.0 192.168.168.13:6789/0 989651 :
cluster [INF] osd.3 192.168.168.200:6818/26170 failed (10 reports from
5 peers after 24.467902 >= grace 21.990599)
2014-11-12 05:40:33.497748 mon.0 192.168.168.13:6789/0 992002 :
cluster [INF] osd.3 192.168.168.200:6818/26170 failed (11 reports from
5 peers after 1564.426103 >= grace 21.479835)
2014-11-12 20:29:22.250975 mon.0 192.168.168.13:6789/0 1075024 :
cluster [INF] osd.0 192.168.168.200:6801/26102 failed (6 reports from
5 peers after 23.000407 >= grace 21.991162)
2014-11-12 20:29:22.251216 mon.0 192.168.168.13:6789/0 1075026 :
cluster [INF] osd.1 192.168.168.200:6806/26114 failed (6 reports from
5 peers after 23.000410 >= grace 22.986743)
2014-11-12 20:29:22.251503 mon.0 192.168.168.13:6789/0 1075029 :
cluster [INF] osd.3 192.168.168.200:6818/26170 failed (6 reports from
5 peers after 23.000581 >= grace 22.986743)
2014-11-12 20:29:25.548353 mon.0 192.168.168.13:6789/0 1075044 :
cluster [INF] osd.2 192.168.168.200:6813/26130 failed (7 reports from
5 peers after 26.297423 >= grace 25.969696)
2014-11-12 20:29:25.548871 mon.0 192.168.168.13:6789/0 1075049 :
cluster [INF] osd.7 192.168.168.200:6838/26321 failed (7 reports from
5 peers after 26.297653 >= grace 23.877243)
2014-11-12 22:18:30.217802 mon.0 192.168.168.13:6789/0 1085318 :
cluster [INF] osd.1 192.168.168.200:6806/26114 failed (5 reports from
5 peers after 24.092179 >= grace 22.986115)
2014-11-12 22:18:30.217866 mon.0 192.168.168.13:6789/0 1085319 :
cluster [INF] osd.3 192.168.168.200:6818/26170 failed (5 reports from
5 peers after 24.092106 >= grace 22.986115)
2014-11-12 22:18:35.221766 mon.0 192.168.168.13:6789/0 1085350 :
cluster [INF] osd.0 192.168.168.200:6801/26102 failed (3 reports from
2 peers after 23.002405 >= grace 21.991161)
2014-11-12 22:31:55.369603 mon.0 192.168.168.13:6789/0 1086634 :
cluster [INF] osd.3 192.168.168.200:6818/26170 failed (6 reports from
5 peers after 23.836498 >= grace 22.986262)
2014-11-12 22:31:58.489937 mon.0 192.168.168.13:6789/0 1086645 :
cluster [INF] osd.0 192.168.168.200:6801/26102 failed (3 reports from
2 peers after 23.008983 >= grace 22.986738)
2014-11-12 22:31:58.508207 mon.0 192.168.168.13:6789/0 1086650 :
cluster [INF] osd.1 192.168.168.200:6806/26114 failed (3 reports from
2 peers after 24.000402 >= grace 22.986168)
2014-11-12 22:31:58.509126 mon.0 192.168.168.13:6789/0 1086655 :
cluster [INF] osd.7 192.168.168.200:6838/26321 failed (3 reports from
2 peers after 23.000217 >= grace 22.932868)
2014-11-12 22:32:00.370212 mon.0 192.168.168.13:6789/0 1086669 :
cluster [INF] osd.2 192.168.168.200:6813/26130 failed (3 reports from
2 peers after 24.862476 >= grace 23.980897)
2014-11-12 22:32:28.391352 mon.0 192.168.168.13:6789/0 1086758 :
cluster [INF] osd.1 192.168.168.200:6806/26114 failed (3 reports from
2 peers after 24.822477 >= grace 21.990464)
2014-11-12 23:12:03.126429 mon.0 192.168.168.13:6789/0 1090415 :
cluster [INF] osd.0 192.168.168.200:6801/26102 failed (4 reports from
4 peers after 23.108060 >= grace 21.991121)
2014-11-12 23:12:03.126653 mon.0 192.168.168.13:6789/0 1090417 :
cluster [INF] osd.1 192.168.168.200:6806/26114 failed (4 reports from
4 peers after 23.108240 >= grace 21.991121)
2014-11-13 00:48:57.277032 mon.0 192.168.168.13:6789/0 1099175 :
cluster [INF] osd.1 192.168.168.200:6806/26114 failed (3 reports from
3 peers after 25.893011 >= grace 21.990053)

You indicated that osd 12 and 16 were the ones marked down, but it
looks like only 0,1,2,3,7 were marked down in the ceph.log you sent.
The logs for 12 and 16 did indicate that they had been partitioned
from the other nodes.  I'd bet that you are having intermittent
network trouble since the heartbeats are intermittently failing.
-Sam

On Tue, Nov 18, 2014 at 3:34 PM, Andrei Mikhailovsky <andrei@xxxxxxxxxx> wrote:
> Sam,
>
> Pastebin or similar will not take tens of megabytes worth of logs. If we are
> talking about debug_ms 10 setting, I've got about 7gb worth of logs
> generated every half an hour or so. Not really sure what to do with that
> much data. Anything more constructive?
>
> Thanks
> ________________________________
> From: "Samuel Just" <sam.just@xxxxxxxxxxx>
> To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx
> Sent: Tuesday, 18 November, 2014 8:53:47 PM
>
> Subject: Re:  Giant upgrade - stability issues
>
> pastebin or something, probably.
> -Sam
>
> On Tue, Nov 18, 2014 at 12:34 PM, Andrei Mikhailovsky <andrei@xxxxxxxxxx>
> wrote:
>> Sam, the logs are rather large in size. Where should I post it to?
>>
>> Thanks
>> ________________________________
>> From: "Samuel Just" <sam.just@xxxxxxxxxxx>
>> To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx>
>> Cc: ceph-users@xxxxxxxxxxxxxx
>> Sent: Tuesday, 18 November, 2014 7:54:56 PM
>> Subject: Re:  Giant upgrade - stability issues
>>
>>
>> Ok, why is ceph marking osds down?  Post your ceph.log from one of the
>> problematic periods.
>> -Sam
>>
>> On Tue, Nov 18, 2014 at 1:35 AM, Andrei Mikhailovsky <andrei@xxxxxxxxxx>
>> wrote:
>>> Hello cephers,
>>>
>>> I need your help and suggestion on what is going on with my cluster. A
>>> few
>>> weeks ago i've upgraded from Firefly to Giant. I've previously written
>>> about
>>> having issues with Giant where in two weeks period the cluster's IO froze
>>> three times after ceph down-ed two osds. I have in total just 17 osds
>>> between two osd servers, 3 mons. The cluster is running on Ubuntu 12.04
>>> with
>>> latest updates.
>>>
>>> I've got zabbix agents monitoring the osd servers and the cluster. I get
>>> alerts of any issues, such as problems with PGs, etc. Since upgrading to
>>> Giant, I am now frequently seeing emails alerting of the cluster having
>>> degraded PGs. I am getting around 10-15 such emails per day stating that
>>> the
>>> cluster has degraded PGs. The number of degraded PGs very between a
>>> couple
>>> of PGs to over a thousand. After several minutes the cluster repairs
>>> itself.
>>> The total number of PGs in the cluster is 4412 between all the pools.
>>>
>>> I am also seeing more alerts from vms stating that there is a high IO
>>> wait
>>> and also seeing hang tasks. Some vms reporting over 50% io wait.
>>>
>>> This has not happened on Firefly or the previous releases of ceph. Not
>>> much
>>> has changed in the cluster since the upgrade to Giant. Networking and
>>> hardware is still the same and it is still running the same version of
>>> Ubuntu OS. The cluster load hasn't changed as well. Thus, I think the
>>> issues
>>> above are related to the upgrade of ceph to Giant.
>>>
>>> Here is the ceph.conf that I use:
>>>
>>> [global]
>>> fsid = 51e9f641-372e-44ec-92a4-b9fe55cbf9fe
>>> mon_initial_members = arh-ibstorage1-ib, arh-ibstorage2-ib,
>>> arh-cloud13-ib
>>> mon_host = 192.168.168.200,192.168.168.201,192.168.168.13
>>> auth_supported = cephx
>>> osd_journal_size = 10240
>>> filestore_xattr_use_omap = true
>>> public_network = 192.168.168.0/24
>>> rbd_default_format = 2
>>> osd_recovery_max_chunk = 8388608
>>> osd_recovery_op_priority = 1
>>> osd_max_backfills = 1
>>> osd_recovery_max_active = 1
>>> osd_recovery_threads = 1
>>> filestore_max_sync_interval = 15
>>> filestore_op_threads = 8
>>> filestore_merge_threshold = 40
>>> filestore_split_multiple = 8
>>> osd_disk_threads = 8
>>> osd_op_threads = 8
>>> osd_pool_default_pg_num = 1024
>>> osd_pool_default_pgp_num = 1024
>>> osd_crush_update_on_start = false
>>>
>>> [client]
>>> rbd_cache = true
>>> admin_socket = /var/run/ceph/$name.$pid.asok
>>>
>>>
>>> I would like to get to the bottom of these issues. Not sure if the issues
>>> could be fixed with changing some settings in ceph.conf or a full
>>> downgrade
>>> back to the Firefly. Is the downgrade even possible on a production
>>> cluster?
>>>
>>> Thanks for your help
>>>
>>> Andrei
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com