Re: Giant upgrade - stability issues

Andrei Mikhailovsky <andrei@xxxxxxxxxx> · Thu, 20 Nov 2014 15:34:02 +0000 (GMT)

Sam,

further to your email I have done the following:

1. Upgraded both osd servers with the latest updates and restarted each server in turn
2. fired up nping utility to generate TCP connections (3 way handshake) from each of the servers as well as from the host servers.  In total i've ran 5 tests. The nping utility was establishing connects on port 22 (as all servers have this port open) with the delay of 1ms.  The command used to generate the traffic was as follows:

nping --tcp-connect -p 22 --delay 1ms <hostname> -v2 -c 36000000 | gzip >/root/nping-hostname-output.gz

The tests took just over 12 hours to complete. The results did not show any problems as far as I can see. Here is the tailed output of one of the findings:

SENT (37825.7303s) Starting TCP Handshake > arh-ibstorage1-ib:22 (192.168.168.200:22)
RECV (37825.7303s) Handshake with arh-ibstorage1-ib:22 (192.168.168.200:22) completed

Max rtt: 4.447ms | Min rtt: 0.008ms | Avg rtt: 0.008ms
TCP connection attempts: 36000000 | Successful connections: 36000000 | Failed: 0 (0.00%)
Tx time: 37825.72833s | Tx bytes/s: 76138.65 | Tx pkts/s: 951.73
Rx time: 37825.72939s | Rx bytes/s: 38069.33 | Rx pkts/s: 951.73
Nping done: 1 IP address pinged in 37844.55 seconds

As you can see from the above, there are no failed connects at all from the 36 million established connections. The average delay is 0.008ms and it was sending on average almost 1000 packets per second. I've got the same results from other servers.

Unless you have other tests in mind, I think there are no issues with the network.

I fire up another test for 24 hours this time to see if it makes a difference.

Thanks

Andrei

From: "Samuel Just" <sam.just@xxxxxxxxxxx>
To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx>
Cc: ceph-users@xxxxxxxxxxxxxx
Sent: Wednesday, 19 November, 2014 9:45:40 PM
Subject: Re:  Giant upgrade - stability issues

Well, the heartbeats are failing due to networking errors preventing
the heartbeats from arriving.  That is causing osds to go down, and
that is causing pgs to become degraded.  You'll have to work out what
is preventing the tcp connections from being stable.
-Sam

On Wed, Nov 19, 2014 at 1:39 PM, Andrei Mikhailovsky <andrei@xxxxxxxxxx> wrote:
>
>>You indicated that osd 12 and 16 were the ones marked down, but it
>>looks like only 0,1,2,3,7 were marked down in the ceph.log you sent.
>>The logs for 12 and 16 did indicate that they had been partitioned
>>from the other nodes.  I'd bet that you are having intermittent
>>network trouble since the heartbeats are intermittently failing.
>>-Sam
>
> AM: I will check the logs further for the osds 12 and 16. Perhaps I've
> missed something, but the ceph osd tree output was showing 12 and 16 as
> down.
>
> Regarding the failure of heartbeats, Wido has suggested that I should
> investigate the reason for it's failure. The obvious thing to look at is the
> network and this is what I've initially done. However, I do not see any
> signs of the network issues. There are no errors on the physical interface
> and ifconfig is showing a very small number of TX dropped packets (0.00006%)
> and 0 errors:
>
>
> # ifconfig ib0
> ib0       Link encap:UNSPEC  HWaddr
> 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00
>           inet addr:192.168.168.200  Bcast:192.168.168.255
> Mask:255.255.255.0
>           inet6 addr: fe80::223:7dff:ff94:e2a5/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
>           RX packets:1812895801 errors:0 dropped:52 overruns:0 frame:0
>           TX packets:1835002992 errors:0 dropped:1037 overruns:0 carrier:0
>           collisions:0 txqueuelen:2048
>           RX bytes:6252740293262 (6.2 TB)  TX bytes:11343307665152 (11.3 TB)
>
>
> How would I investigate what is happening with the hearbeats and the reason
> for their failures? I have a suspetion that this will solve the issues with
> frequent reporting of degraded PGs on the cluster and intermittent high
> levels of IO wait on vms.
>
> And also, as i've previously mentioned, the issues started to happen after
> the upgrade to Giant. I've not had these problems with Firefly, Emperor or
> Dumpling releases on the same hardware and same cluster loads.
>
> Thanks
>
> Andrei
>
>
>
>
> On Tue, Nov 18, 2014 at 3:34 PM, Andrei Mikhailovsky <andrei@xxxxxxxxxx>
> wrote:
>> Sam,
>>
>> Pastebin or similar will not take tens of megabytes worth of logs. If we
>> are
>> talking about debug_ms 10 setting, I've got about 7gb worth of logs
>> generated every half an hour or so. Not really sure what to do with that
>> much data. Anything more constructive?
>>
>> Thanks
>> ________________________________
>> From: "Samuel Just" <sam.just@xxxxxxxxxxx>
>> To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx>
>> Cc: ceph-users@xxxxxxxxxxxxxx
>> Sent: Tuesday, 18 November, 2014 8:53:47 PM
>>
>> Subject: Re:  Giant upgrade - stability issues
>>
>> pastebin or something, probably.
>> -Sam
>>
>> On Tue, Nov 18, 2014 at 12:34 PM, Andrei Mikhailovsky <andrei@xxxxxxxxxx>
>> wrote:
>>> Sam, the logs are rather large in size. Where should I post it to?
>>>
>>> Thanks
>>> ________________________________
>>> From: "Samuel Just" <sam.just@xxxxxxxxxxx>
>>> To: "Andrei Mikhailovsky" <andrei@xxxxxxxxxx>
>>> Cc: ceph-users@xxxxxxxxxxxxxx
>>> Sent: Tuesday, 18 November, 2014 7:54:56 PM
>>> Subject: Re: [ceph-users] Giant upgrade - stability issues
>>>
>>>
>>> Ok, why is ceph marking osds down?  Post your ceph.log from one of the
>>> problematic periods.
>>> -Sam
>>>
>>> On Tue, Nov 18, 2014 at 1:35 AM, Andrei Mikhailovsky <andrei@xxxxxxxxxx>
>>> wrote:
>>>> Hello cephers,
>>>>
>>>> I need your help and suggestion on what is going on with my cluster. A
>>>> few
>>>> weeks ago i've upgraded from Firefly to Giant. I've previously written
>>>> about
>>>> having issues with Giant where in two weeks period the cluster's IO
>>>> froze
>>>> three times after ceph down-ed two osds. I have in total just 17 osds
>>>> between two osd servers, 3 mons. The cluster is running on Ubuntu 12.04
>>>> with
>>>> latest updates.
>>>>
>>>> I've got zabbix agents monitoring the osd servers and the cluster. I get
>>>> alerts of any issues, such as problems with PGs, etc. Since upgrading to
>>>> Giant, I am now frequently seeing emails alerting of the cluster having
>>>> degraded PGs. I am getting around 10-15 such emails per day stating that
>>>> the
>>>> cluster has degraded PGs. The number of degraded PGs very between a
>>>> couple
>>>> of PGs to over a thousand. After several minutes the cluster repairs
>>>> itself.
>>>> The total number of PGs in the cluster is 4412 between all the pools.
>>>>
>>>> I am also seeing more alerts from vms stating that there is a high IO
>>>> wait
>>>> and also seeing hang tasks. Some vms reporting over 50% io wait.
>>>>
>>>> This has not happened on Firefly or the previous releases of ceph. Not
>>>> much
>>>> has changed in the cluster since the upgrade to Giant. Networking and
>>>> hardware is still the same and it is still running the same version of
>>>> Ubuntu OS. The cluster load hasn't changed as well. Thus, I think the
>>>> issues
>>>> above are related to the upgrade of ceph to Giant.
>>>>
>>>> Here is the ceph.conf that I use:
>>>>
>>>> [global]
>>>> fsid = 51e9f641-372e-44ec-92a4-b9fe55cbf9fe
>>>> mon_initial_members = arh-ibstorage1-ib, arh-ibstorage2-ib,
>>>> arh-cloud13-ib
>>>> mon_host = 192.168.168.200,192.168.168.201,192.168.168.13
>>>> auth_supported = cephx
>>>> osd_journal_size = 10240
>>>> filestore_xattr_use_omap = true
>>>> public_network = 192.168.168.0/24
>>>> rbd_default_format = 2
>>>> osd_recovery_max_chunk = 8388608
>>>> osd_recovery_op_priority = 1
>>>> osd_max_backfills = 1
>>>> osd_recovery_max_active = 1
>>>> osd_recovery_threads = 1
>>>> filestore_max_sync_interval = 15
>>>> filestore_op_threads = 8
>>>> filestore_merge_threshold = 40
>>>> filestore_split_multiple = 8
>>>> osd_disk_threads = 8
>>>> osd_op_threads = 8
>>>> osd_pool_default_pg_num = 1024
>>>> osd_pool_default_pgp_num = 1024
>>>> osd_crush_update_on_start = false
>>>>
>>>> [client]
>>>> rbd_cache = true
>>>> admin_socket = /var/run/ceph/$name.$pid.asok
>>>>
>>>>
>>>> I would like to get to the bottom of these issues. Not sure if the
>>>> issues
>>>> could be fixed with changing some settings in ceph.conf or a full
>>>> downgrade
>>>> back to the Firefly. Is the downgrade even possible on a production
>>>> cluster?
>>>>
>>>> Thanks for your help
>>>>
>>>> Andrei
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com