Re: Giant upgrade - stability issues

Samuel Just <sam.just@xxxxxxxxxxx> · Tue, 18 Nov 2014 12:53:47 -0800



pastebin or something, probably.
-Sam

On Tue, Nov 18, 2014 at 12:34 PM, Andrei Mikhailovsky <andrei@xxxxxxxxxx> wrote:
> Sam, the logs are rather large in size. Where should I post it to?
>
> Thanks
> ________________________________
> From: "Samuel Just" <sam.just@xxxxxxxxxxx>
> To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx
> Sent: Tuesday, 18 November, 2014 7:54:56 PM
> Subject: Re:  Giant upgrade - stability issues
>
>
> Ok, why is ceph marking osds down?  Post your ceph.log from one of the
> problematic periods.
> -Sam
>
> On Tue, Nov 18, 2014 at 1:35 AM, Andrei Mikhailovsky <andrei@xxxxxxxxxx>
> wrote:
>> Hello cephers,
>>
>> I need your help and suggestion on what is going on with my cluster. A few
>> weeks ago i've upgraded from Firefly to Giant. I've previously written
>> about
>> having issues with Giant where in two weeks period the cluster's IO froze
>> three times after ceph down-ed two osds. I have in total just 17 osds
>> between two osd servers, 3 mons. The cluster is running on Ubuntu 12.04
>> with
>> latest updates.
>>
>> I've got zabbix agents monitoring the osd servers and the cluster. I get
>> alerts of any issues, such as problems with PGs, etc. Since upgrading to
>> Giant, I am now frequently seeing emails alerting of the cluster having
>> degraded PGs. I am getting around 10-15 such emails per day stating that
>> the
>> cluster has degraded PGs. The number of degraded PGs very between a couple
>> of PGs to over a thousand. After several minutes the cluster repairs
>> itself.
>> The total number of PGs in the cluster is 4412 between all the pools.
>>
>> I am also seeing more alerts from vms stating that there is a high IO wait
>> and also seeing hang tasks. Some vms reporting over 50% io wait.
>>
>> This has not happened on Firefly or the previous releases of ceph. Not
>> much
>> has changed in the cluster since the upgrade to Giant. Networking and
>> hardware is still the same and it is still running the same version of
>> Ubuntu OS. The cluster load hasn't changed as well. Thus, I think the
>> issues
>> above are related to the upgrade of ceph to Giant.
>>
>> Here is the ceph.conf that I use:
>>
>> [global]
>> fsid = 51e9f641-372e-44ec-92a4-b9fe55cbf9fe
>> mon_initial_members = arh-ibstorage1-ib, arh-ibstorage2-ib, arh-cloud13-ib
>> mon_host = 192.168.168.200,192.168.168.201,192.168.168.13
>> auth_supported = cephx
>> osd_journal_size = 10240
>> filestore_xattr_use_omap = true
>> public_network = 192.168.168.0/24
>> rbd_default_format = 2
>> osd_recovery_max_chunk = 8388608
>> osd_recovery_op_priority = 1
>> osd_max_backfills = 1
>> osd_recovery_max_active = 1
>> osd_recovery_threads = 1
>> filestore_max_sync_interval = 15
>> filestore_op_threads = 8
>> filestore_merge_threshold = 40
>> filestore_split_multiple = 8
>> osd_disk_threads = 8
>> osd_op_threads = 8
>> osd_pool_default_pg_num = 1024
>> osd_pool_default_pgp_num = 1024
>> osd_crush_update_on_start = false
>>
>> [client]
>> rbd_cache = true
>> admin_socket = /var/run/ceph/$name.$pid.asok
>>
>>
>> I would like to get to the bottom of these issues. Not sure if the issues
>> could be fixed with changing some settings in ceph.conf or a full
>> downgrade
>> back to the Firefly. Is the downgrade even possible on a production
>> cluster?
>>
>> Thanks for your help
>>
>> Andrei
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com