Re: OSDs going down during radosbench benchmark

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Tom, a few things you can check into. Some of these depend on how many
OSDs you¹re trying to run on a single chassis.

# up PIDs, otherwise you may run out of the ability to spawn new threads
kernel.pid_max=4194303

# up available mem for sudden bursts, like during benchmarking
Vm.min_free_kbytes = <something reasonable, like 2GB>

In ceph.conf:

max_open_files = <32K or more>

# make sure you have enough ephemeral port range for the number of OSDs
Ms bind port min = 6800
Ms bind port max = 9000

You may need to up your network tuning as well, but it¹s less likely to
cause these sorts of problems. Watch your netstat -s for clues.

Warren Wang



On 9/12/16, 12:44 PM, "ceph-users on behalf of Deneau, Tom"
<ceph-users-bounces@xxxxxxxxxxxxxx on behalf of tom.deneau@xxxxxxx> wrote:

>Trying to understand why some OSDs (6 out of 21) went down in my cluster
>while running a CBT radosbench benchmark.  From the logs below, is this a
>networking problem between systems, or is it some kind of FileStore
>problem.
>
>Looking at one crashed OSD log, I see the following crash error:
>
>2016-09-09 21:30:29.757792 7efc6f5f1700 -1 FileStore: sync_entry timed
>out after 600 seconds.
> ceph version 10.2.1-13.el7cp (f15ca93643fee5f7d32e62c3e8a7016c1fc1e6f4)
>
>just before that I see things like:
>
>2016-09-09 21:18:07.391760 7efc755fd700 -1 osd.12 165 heartbeat_check: no
>reply from osd.6 since back 2016-09-09 21:17:47.261601 front 2016-09-09
>21:17:47.261601 (cutoff 2016-09-09 21:17:47.391758)
>
>and also
>
>2016-09-09 19:03:45.788327 7efc53905700  0 -- 10.0.1.2:6826/58682 >>
>10.0.1.1:6832/19713 pipe(0x7efc8bfbc800 sd=65 :52000 s=1 pgs=12 cs=1 l=0\
> c=0x7efc8bef5b00).connect got RESETSESSION
>
>and many warnings for slow requests.
>
>
>All the other osds that died seem to have died with:
>
>2016-09-09 19:11:01.663262 7f2157e65700 -1 common/HeartbeatMap.cc: In
>function 'bool ceph::HeartbeatMap::_check(const
>ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f2157e65700 time
>2016-09-09 19:11:01.660671
>common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
>
>
>-- Tom Deneau, AMD
>
>
>
>
>
>_______________________________________________
>ceph-users mailing list
>ceph-users@xxxxxxxxxxxxxx
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

This email and any files transmitted with it are confidential and intended solely for the individual or entity to whom they are addressed. If you have received this email in error destroy it immediately. *** Walmart Confidential ***
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux