Re: [EXTERNAL] Upgrading 0.94.6 -> 0.94.9 saturating mon node networking

"Will.Boege" <Will.Boege@xxxxxxxxxx> · Thu, 22 Sep 2016 13:32:56 +0000

Just went through this upgrading a ~400 OSD cluster. I was in the EXACT spot you were in. The faster you can get all OSDs to the same version as the MONs the better. We decided to power forward and the performance got better for every OSD node we patched. 

Additionally I also discovered your LevelDBs will start growing exponentially if you leave your cluster in that state for too long. 

Pretty sure the downrev OSDs are aggressively getting osdmaps from the MONs causing some kind of spinlock condition. 

> On Sep 21, 2016, at 4:21 PM, Stillwell, Bryan J <Bryan.Stillwell@xxxxxxxxxxx> wrote:
> 
> While attempting to upgrade a 1200+ OSD cluster from 0.94.6 to 0.94.9 I've
> run into serious performance issues every time I restart an OSD.
> 
> At first I thought the problem I was running into was caused by the osdmap
> encoding bug that Dan and Wido ran into when upgrading to 0.94.7, because
> I was seeing a ton (millions) of these messages in the logs:
> 
> 2016-09-21 20:48:32.831040 osd.504 24.161.248.128:6810/96488 24 : cluster
> [WRN] failed to encode map e727985 with expected cry
> 
> Here are the links to their descriptions of the problem:
> 
> http://www.spinics.net/lists/ceph-devel/msg30450.html
> https://www.mail-archive.com/ceph-users@xxxxxxxxxxxxxx/msg30783.html
> 
> I tried the solution of using the following command to stop those errors
> from occurring:
> 
> ceph tell osd.* injectargs '--clog_to_monitors false'
> 
> Which did get the messages to stop spamming the log files, however, it
> didn't fix the performance issue for me.
> 
> Using dstat on the mon nodes I was able to determine that every time the
> osdmap is updated (by running 'ceph osd pool set data size 2' in this
> example) it causes the outgoing network on all mon nodes to be saturated
> for multiple seconds at a time:
> 
> ----system---- ----total-cpu-usage---- ------memory-usage----- -net/total-
> -dsk/total- --io/total-
>     time     |usr sys idl wai hiq siq| used  buff  cach  free| recv
> send| read  writ| read  writ
> 21-09 21:06:53|  1   0  99   0   0   0|11.8G  273M 18.7G  221G|2326k
> 9015k|   0  1348k|   0  16.0
> 21-09 21:06:54|  1   1  98   0   0   0|11.9G  273M 18.7G  221G|  15M
> 10M|   0  1312k|   0  16.0
> 21-09 21:06:55|  2   2  94   0   0   1|12.3G  273M 18.7G  220G|  14M
> 311M|   0    48M|   0   309
> 21-09 21:06:56|  2   3  93   0   0   3|12.2G  273M 18.7G  220G|7745k
> 1190M|   0    16M|   0  93.0
> 21-09 21:06:57|  1   2  96   0   0   1|12.0G  273M 18.7G  220G|8269k
> 1189M|   0  1956k|   0  10.0
> 21-09 21:06:58|  3   1  95   0   0   1|11.8G  273M 18.7G  221G|4854k
> 752M|   0  4960k|   0  21.0
> 21-09 21:06:59|  3   0  97   0   0   0|11.8G  273M 18.7G  221G|3098k
> 25M|   0  5036k|   0  26.0
> 21-09 21:07:00|  1   0  98   0   0   0|11.8G  273M 18.7G  221G|2247k
> 25M|   0  9980k|   0  45.0
> 21-09 21:07:01|  2   1  97   0   0   0|11.8G  273M 18.7G  221G|4149k
> 17M|   0    76M|   0   427
> 
> That would be 1190 MiB/s (or 9.982 Gbps).
> 
> Restarting every OSD on a node at once as part of the upgrade causes a
> couple minutes worth of network saturation on all three mon nodes.  This
> causes thousands of slow requests and many unhappy OpenStack users.
> 
> I'm now stuck about 15% into the upgrade and haven't been able to
> determine how to move forward (or even backward) without causing another
> outage.
> 
> I've attempted to run the same test on another cluster with 1300+ OSDs and
> the outgoing network on the mon nodes didn't exceed 15 MiB/s (0.126 Gbps).
> 
> Any suggestions on how I can proceed?
> 
> Thanks,
> Bryan
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com