pastebin or something, probably. -Sam On Tue, Nov 18, 2014 at 12:34 PM, Andrei Mikhailovsky <andrei@xxxxxxxxxx> wrote: > Sam, the logs are rather large in size. Where should I post it to? > > Thanks > ________________________________ > From: "Samuel Just" <sam.just@xxxxxxxxxxx> > To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx> > Cc: ceph-users@xxxxxxxxxxxxxx > Sent: Tuesday, 18 November, 2014 7:54:56 PM > Subject: Re: Giant upgrade - stability issues > > > Ok, why is ceph marking osds down? Post your ceph.log from one of the > problematic periods. > -Sam > > On Tue, Nov 18, 2014 at 1:35 AM, Andrei Mikhailovsky <andrei@xxxxxxxxxx> > wrote: >> Hello cephers, >> >> I need your help and suggestion on what is going on with my cluster. A few >> weeks ago i've upgraded from Firefly to Giant. I've previously written >> about >> having issues with Giant where in two weeks period the cluster's IO froze >> three times after ceph down-ed two osds. I have in total just 17 osds >> between two osd servers, 3 mons. The cluster is running on Ubuntu 12.04 >> with >> latest updates. >> >> I've got zabbix agents monitoring the osd servers and the cluster. I get >> alerts of any issues, such as problems with PGs, etc. Since upgrading to >> Giant, I am now frequently seeing emails alerting of the cluster having >> degraded PGs. I am getting around 10-15 such emails per day stating that >> the >> cluster has degraded PGs. The number of degraded PGs very between a couple >> of PGs to over a thousand. After several minutes the cluster repairs >> itself. >> The total number of PGs in the cluster is 4412 between all the pools. >> >> I am also seeing more alerts from vms stating that there is a high IO wait >> and also seeing hang tasks. Some vms reporting over 50% io wait. >> >> This has not happened on Firefly or the previous releases of ceph. Not >> much >> has changed in the cluster since the upgrade to Giant. Networking and >> hardware is still the same and it is still running the same version of >> Ubuntu OS. The cluster load hasn't changed as well. Thus, I think the >> issues >> above are related to the upgrade of ceph to Giant. >> >> Here is the ceph.conf that I use: >> >> [global] >> fsid = 51e9f641-372e-44ec-92a4-b9fe55cbf9fe >> mon_initial_members = arh-ibstorage1-ib, arh-ibstorage2-ib, arh-cloud13-ib >> mon_host = 192.168.168.200,192.168.168.201,192.168.168.13 >> auth_supported = cephx >> osd_journal_size = 10240 >> filestore_xattr_use_omap = true >> public_network = 192.168.168.0/24 >> rbd_default_format = 2 >> osd_recovery_max_chunk = 8388608 >> osd_recovery_op_priority = 1 >> osd_max_backfills = 1 >> osd_recovery_max_active = 1 >> osd_recovery_threads = 1 >> filestore_max_sync_interval = 15 >> filestore_op_threads = 8 >> filestore_merge_threshold = 40 >> filestore_split_multiple = 8 >> osd_disk_threads = 8 >> osd_op_threads = 8 >> osd_pool_default_pg_num = 1024 >> osd_pool_default_pgp_num = 1024 >> osd_crush_update_on_start = false >> >> [client] >> rbd_cache = true >> admin_socket = /var/run/ceph/$name.$pid.asok >> >> >> I would like to get to the bottom of these issues. Not sure if the issues >> could be fixed with changing some settings in ceph.conf or a full >> downgrade >> back to the Firefly. Is the downgrade even possible on a production >> cluster? >> >> Thanks for your help >> >> Andrei >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com