You can try to capture logging at debug osd = 20 debug ms = 20 debug filestore = 20 while an osd is misbehaving. -Sam On Thu, Nov 20, 2014 at 7:34 AM, Andrei Mikhailovsky <andrei@xxxxxxxxxx> wrote: > Sam, > > further to your email I have done the following: > > 1. Upgraded both osd servers with the latest updates and restarted each > server in turn > 2. fired up nping utility to generate TCP connections (3 way handshake) from > each of the servers as well as from the host servers. In total i've ran 5 > tests. The nping utility was establishing connects on port 22 (as all > servers have this port open) with the delay of 1ms. The command used to > generate the traffic was as follows: > > nping --tcp-connect -p 22 --delay 1ms <hostname> -v2 -c 36000000 | gzip >>/root/nping-hostname-output.gz > > The tests took just over 12 hours to complete. The results did not show any > problems as far as I can see. Here is the tailed output of one of the > findings: > > > SENT (37825.7303s) Starting TCP Handshake > arh-ibstorage1-ib:22 > (192.168.168.200:22) > RECV (37825.7303s) Handshake with arh-ibstorage1-ib:22 (192.168.168.200:22) > completed > > Max rtt: 4.447ms | Min rtt: 0.008ms | Avg rtt: 0.008ms > TCP connection attempts: 36000000 | Successful connections: 36000000 | > Failed: 0 (0.00%) > Tx time: 37825.72833s | Tx bytes/s: 76138.65 | Tx pkts/s: 951.73 > Rx time: 37825.72939s | Rx bytes/s: 38069.33 | Rx pkts/s: 951.73 > Nping done: 1 IP address pinged in 37844.55 seconds > > > As you can see from the above, there are no failed connects at all from the > 36 million established connections. The average delay is 0.008ms and it was > sending on average almost 1000 packets per second. I've got the same results > from other servers. > > Unless you have other tests in mind, I think there are no issues with the > network. > > I fire up another test for 24 hours this time to see if it makes a > difference. > > Thanks > > Andrei > > > ________________________________ > From: "Samuel Just" <sam.just@xxxxxxxxxxx> > To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx> > Cc: ceph-users@xxxxxxxxxxxxxx > Sent: Wednesday, 19 November, 2014 9:45:40 PM > > Subject: Re: Giant upgrade - stability issues > > Well, the heartbeats are failing due to networking errors preventing > the heartbeats from arriving. That is causing osds to go down, and > that is causing pgs to become degraded. You'll have to work out what > is preventing the tcp connections from being stable. > -Sam > > On Wed, Nov 19, 2014 at 1:39 PM, Andrei Mikhailovsky <andrei@xxxxxxxxxx> > wrote: >> >>>You indicated that osd 12 and 16 were the ones marked down, but it >>>looks like only 0,1,2,3,7 were marked down in the ceph.log you sent. >>>The logs for 12 and 16 did indicate that they had been partitioned >>>from the other nodes. I'd bet that you are having intermittent >>>network trouble since the heartbeats are intermittently failing. >>>-Sam >> >> AM: I will check the logs further for the osds 12 and 16. Perhaps I've >> missed something, but the ceph osd tree output was showing 12 and 16 as >> down. >> >> Regarding the failure of heartbeats, Wido has suggested that I should >> investigate the reason for it's failure. The obvious thing to look at is >> the >> network and this is what I've initially done. However, I do not see any >> signs of the network issues. There are no errors on the physical interface >> and ifconfig is showing a very small number of TX dropped packets >> (0.00006%) >> and 0 errors: >> >> >> # ifconfig ib0 >> ib0 Link encap:UNSPEC HWaddr >> 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00 >> inet addr:192.168.168.200 Bcast:192.168.168.255 >> Mask:255.255.255.0 >> inet6 addr: fe80::223:7dff:ff94:e2a5/64 Scope:Link >> UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 >> RX packets:1812895801 errors:0 dropped:52 overruns:0 frame:0 >> TX packets:1835002992 errors:0 dropped:1037 overruns:0 carrier:0 >> collisions:0 txqueuelen:2048 >> RX bytes:6252740293262 (6.2 TB) TX bytes:11343307665152 (11.3 >> TB) >> >> >> How would I investigate what is happening with the hearbeats and the >> reason >> for their failures? I have a suspetion that this will solve the issues >> with >> frequent reporting of degraded PGs on the cluster and intermittent high >> levels of IO wait on vms. >> >> And also, as i've previously mentioned, the issues started to happen after >> the upgrade to Giant. I've not had these problems with Firefly, Emperor or >> Dumpling releases on the same hardware and same cluster loads. >> >> Thanks >> >> Andrei >> >> >> >> >> On Tue, Nov 18, 2014 at 3:34 PM, Andrei Mikhailovsky <andrei@xxxxxxxxxx> >> wrote: >>> Sam, >>> >>> Pastebin or similar will not take tens of megabytes worth of logs. If we >>> are >>> talking about debug_ms 10 setting, I've got about 7gb worth of logs >>> generated every half an hour or so. Not really sure what to do with that >>> much data. Anything more constructive? >>> >>> Thanks >>> ________________________________ >>> From: "Samuel Just" <sam.just@xxxxxxxxxxx> >>> To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx> >>> Cc: ceph-users@xxxxxxxxxxxxxx >>> Sent: Tuesday, 18 November, 2014 8:53:47 PM >>> >>> Subject: Re: Giant upgrade - stability issues >>> >>> pastebin or something, probably. >>> -Sam >>> >>> On Tue, Nov 18, 2014 at 12:34 PM, Andrei Mikhailovsky <andrei@xxxxxxxxxx> >>> wrote: >>>> Sam, the logs are rather large in size. Where should I post it to? >>>> >>>> Thanks >>>> ________________________________ >>>> From: "Samuel Just" <sam.just@xxxxxxxxxxx> >>>> To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx> >>>> Cc: ceph-users@xxxxxxxxxxxxxx >>>> Sent: Tuesday, 18 November, 2014 7:54:56 PM >>>> Subject: Re: Giant upgrade - stability issues >>>> >>>> >>>> Ok, why is ceph marking osds down? Post your ceph.log from one of the >>>> problematic periods. >>>> -Sam >>>> >>>> On Tue, Nov 18, 2014 at 1:35 AM, Andrei Mikhailovsky <andrei@xxxxxxxxxx> >>>> wrote: >>>>> Hello cephers, >>>>> >>>>> I need your help and suggestion on what is going on with my cluster. A >>>>> few >>>>> weeks ago i've upgraded from Firefly to Giant. I've previously written >>>>> about >>>>> having issues with Giant where in two weeks period the cluster's IO >>>>> froze >>>>> three times after ceph down-ed two osds. I have in total just 17 osds >>>>> between two osd servers, 3 mons. The cluster is running on Ubuntu 12.04 >>>>> with >>>>> latest updates. >>>>> >>>>> I've got zabbix agents monitoring the osd servers and the cluster. I >>>>> get >>>>> alerts of any issues, such as problems with PGs, etc. Since upgrading >>>>> to >>>>> Giant, I am now frequently seeing emails alerting of the cluster having >>>>> degraded PGs. I am getting around 10-15 such emails per day stating >>>>> that >>>>> the >>>>> cluster has degraded PGs. The number of degraded PGs very between a >>>>> couple >>>>> of PGs to over a thousand. After several minutes the cluster repairs >>>>> itself. >>>>> The total number of PGs in the cluster is 4412 between all the pools. >>>>> >>>>> I am also seeing more alerts from vms stating that there is a high IO >>>>> wait >>>>> and also seeing hang tasks. Some vms reporting over 50% io wait. >>>>> >>>>> This has not happened on Firefly or the previous releases of ceph. Not >>>>> much >>>>> has changed in the cluster since the upgrade to Giant. Networking and >>>>> hardware is still the same and it is still running the same version of >>>>> Ubuntu OS. The cluster load hasn't changed as well. Thus, I think the >>>>> issues >>>>> above are related to the upgrade of ceph to Giant. >>>>> >>>>> Here is the ceph.conf that I use: >>>>> >>>>> [global] >>>>> fsid = 51e9f641-372e-44ec-92a4-b9fe55cbf9fe >>>>> mon_initial_members = arh-ibstorage1-ib, arh-ibstorage2-ib, >>>>> arh-cloud13-ib >>>>> mon_host = 192.168.168.200,192.168.168.201,192.168.168.13 >>>>> auth_supported = cephx >>>>> osd_journal_size = 10240 >>>>> filestore_xattr_use_omap = true >>>>> public_network = 192.168.168.0/24 >>>>> rbd_default_format = 2 >>>>> osd_recovery_max_chunk = 8388608 >>>>> osd_recovery_op_priority = 1 >>>>> osd_max_backfills = 1 >>>>> osd_recovery_max_active = 1 >>>>> osd_recovery_threads = 1 >>>>> filestore_max_sync_interval = 15 >>>>> filestore_op_threads = 8 >>>>> filestore_merge_threshold = 40 >>>>> filestore_split_multiple = 8 >>>>> osd_disk_threads = 8 >>>>> osd_op_threads = 8 >>>>> osd_pool_default_pg_num = 1024 >>>>> osd_pool_default_pgp_num = 1024 >>>>> osd_crush_update_on_start = false >>>>> >>>>> [client] >>>>> rbd_cache = true >>>>> admin_socket = /var/run/ceph/$name.$pid.asok >>>>> >>>>> >>>>> I would like to get to the bottom of these issues. Not sure if the >>>>> issues >>>>> could be fixed with changing some settings in ceph.conf or a full >>>>> downgrade >>>>> back to the Firefly. Is the downgrade even possible on a production >>>>> cluster? >>>>> >>>>> Thanks for your help >>>>> >>>>> Andrei >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@xxxxxxxxxxxxxx >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>> >>>> >>> >> > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com