Sam,
Pastebin or similar will not take tens of megabytes worth of logs. If we are talking about debug_ms 10 setting, I've got about 7gb worth of logs generated every half an hour or so. Not really sure what to do with that much data. Anything more constructive?
Thanks
Pastebin or similar will not take tens of megabytes worth of logs. If we are talking about debug_ms 10 setting, I've got about 7gb worth of logs generated every half an hour or so. Not really sure what to do with that much data. Anything more constructive?
Thanks
From: "Samuel Just" <sam.just@xxxxxxxxxxx>
To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx>
Cc: ceph-users@xxxxxxxxxxxxxx
Sent: Tuesday, 18 November, 2014 8:53:47 PM
Subject: Re: Giant upgrade - stability issues
pastebin or something, probably.
-Sam
On Tue, Nov 18, 2014 at 12:34 PM, Andrei Mikhailovsky <andrei@xxxxxxxxxx> wrote:
> Sam, the logs are rather large in size. Where should I post it to?
>
> Thanks
> ________________________________
> From: "Samuel Just" <sam.just@xxxxxxxxxxx>
> To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx
> Sent: Tuesday, 18 November, 2014 7:54:56 PM
> Subject: Re: Giant upgrade - stability issues
>
>
> Ok, why is ceph marking osds down? Post your ceph.log from one of the
> problematic periods.
> -Sam
>
> On Tue, Nov 18, 2014 at 1:35 AM, Andrei Mikhailovsky <andrei@xxxxxxxxxx>
> wrote:
>> Hello cephers,
>>
>> I need your help and suggestion on what is going on with my cluster. A few
>> weeks ago i've upgraded from Firefly to Giant. I've previously written
>> about
>> having issues with Giant where in two weeks period the cluster's IO froze
>> three times after ceph down-ed two osds. I have in total just 17 osds
>> between two osd servers, 3 mons. The cluster is running on Ubuntu 12.04
>> with
>> latest updates.
>>
>> I've got zabbix agents monitoring the osd servers and the cluster. I get
>> alerts of any issues, such as problems with PGs, etc. Since upgrading to
>> Giant, I am now frequently seeing emails alerting of the cluster having
>> degraded PGs. I am getting around 10-15 such emails per day stating that
>> the
>> cluster has degraded PGs. The number of degraded PGs very between a couple
>> of PGs to over a thousand. After several minutes the cluster repairs
>> itself.
>> The total number of PGs in the cluster is 4412 between all the pools.
>>
>> I am also seeing more alerts from vms stating that there is a high IO wait
>> and also seeing hang tasks. Some vms reporting over 50% io wait.
>>
>> This has not happened on Firefly or the previous releases of ceph. Not
>> much
>> has changed in the cluster since the upgrade to Giant. Networking and
>> hardware is still the same and it is still running the same version of
>> Ubuntu OS. The cluster load hasn't changed as well. Thus, I think the
>> issues
>> above are related to the upgrade of ceph to Giant.
>>
>> Here is the ceph.conf that I use:
>>
>> [global]
>> fsid = 51e9f641-372e-44ec-92a4-b9fe55cbf9fe
>> mon_initial_members = arh-ibstorage1-ib, arh-ibstorage2-ib, arh-cloud13-ib
>> mon_host = 192.168.168.200,192.168.168.201,192.168.168.13
>> auth_supported = cephx
>> osd_journal_size = 10240
>> filestore_xattr_use_omap = true
>> public_network = 192.168.168.0/24
>> rbd_default_format = 2
>> osd_recovery_max_chunk = 8388608
>> osd_recovery_op_priority = 1
>> osd_max_backfills = 1
>> osd_recovery_max_active = 1
>> osd_recovery_threads = 1
>> filestore_max_sync_interval = 15
>> filestore_op_threads = 8
>> filestore_merge_threshold = 40
>> filestore_split_multiple = 8
>> osd_disk_threads = 8
>> osd_op_threads = 8
>> osd_pool_default_pg_num = 1024
>> osd_pool_default_pgp_num = 1024
>> osd_crush_update_on_start = false
>>
>> [client]
>> rbd_cache = true
>> admin_socket = /var/run/ceph/$name.$pid.asok
>>
>>
>> I would like to get to the bottom of these issues. Not sure if the issues
>> could be fixed with changing some settings in ceph.conf or a full
>> downgrade
>> back to the Firefly. Is the downgrade even possible on a production
>> cluster?
>>
>> Thanks for your help
>>
>> Andrei
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx>
Cc: ceph-users@xxxxxxxxxxxxxx
Sent: Tuesday, 18 November, 2014 8:53:47 PM
Subject: Re: Giant upgrade - stability issues
pastebin or something, probably.
-Sam
On Tue, Nov 18, 2014 at 12:34 PM, Andrei Mikhailovsky <andrei@xxxxxxxxxx> wrote:
> Sam, the logs are rather large in size. Where should I post it to?
>
> Thanks
> ________________________________
> From: "Samuel Just" <sam.just@xxxxxxxxxxx>
> To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx
> Sent: Tuesday, 18 November, 2014 7:54:56 PM
> Subject: Re: Giant upgrade - stability issues
>
>
> Ok, why is ceph marking osds down? Post your ceph.log from one of the
> problematic periods.
> -Sam
>
> On Tue, Nov 18, 2014 at 1:35 AM, Andrei Mikhailovsky <andrei@xxxxxxxxxx>
> wrote:
>> Hello cephers,
>>
>> I need your help and suggestion on what is going on with my cluster. A few
>> weeks ago i've upgraded from Firefly to Giant. I've previously written
>> about
>> having issues with Giant where in two weeks period the cluster's IO froze
>> three times after ceph down-ed two osds. I have in total just 17 osds
>> between two osd servers, 3 mons. The cluster is running on Ubuntu 12.04
>> with
>> latest updates.
>>
>> I've got zabbix agents monitoring the osd servers and the cluster. I get
>> alerts of any issues, such as problems with PGs, etc. Since upgrading to
>> Giant, I am now frequently seeing emails alerting of the cluster having
>> degraded PGs. I am getting around 10-15 such emails per day stating that
>> the
>> cluster has degraded PGs. The number of degraded PGs very between a couple
>> of PGs to over a thousand. After several minutes the cluster repairs
>> itself.
>> The total number of PGs in the cluster is 4412 between all the pools.
>>
>> I am also seeing more alerts from vms stating that there is a high IO wait
>> and also seeing hang tasks. Some vms reporting over 50% io wait.
>>
>> This has not happened on Firefly or the previous releases of ceph. Not
>> much
>> has changed in the cluster since the upgrade to Giant. Networking and
>> hardware is still the same and it is still running the same version of
>> Ubuntu OS. The cluster load hasn't changed as well. Thus, I think the
>> issues
>> above are related to the upgrade of ceph to Giant.
>>
>> Here is the ceph.conf that I use:
>>
>> [global]
>> fsid = 51e9f641-372e-44ec-92a4-b9fe55cbf9fe
>> mon_initial_members = arh-ibstorage1-ib, arh-ibstorage2-ib, arh-cloud13-ib
>> mon_host = 192.168.168.200,192.168.168.201,192.168.168.13
>> auth_supported = cephx
>> osd_journal_size = 10240
>> filestore_xattr_use_omap = true
>> public_network = 192.168.168.0/24
>> rbd_default_format = 2
>> osd_recovery_max_chunk = 8388608
>> osd_recovery_op_priority = 1
>> osd_max_backfills = 1
>> osd_recovery_max_active = 1
>> osd_recovery_threads = 1
>> filestore_max_sync_interval = 15
>> filestore_op_threads = 8
>> filestore_merge_threshold = 40
>> filestore_split_multiple = 8
>> osd_disk_threads = 8
>> osd_op_threads = 8
>> osd_pool_default_pg_num = 1024
>> osd_pool_default_pgp_num = 1024
>> osd_crush_update_on_start = false
>>
>> [client]
>> rbd_cache = true
>> admin_socket = /var/run/ceph/$name.$pid.asok
>>
>>
>> I would like to get to the bottom of these issues. Not sure if the issues
>> could be fixed with changing some settings in ceph.conf or a full
>> downgrade
>> back to the Firefly. Is the downgrade even possible on a production
>> cluster?
>>
>> Thanks for your help
>>
>> Andrei
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com