Cluster was going well for a long time, but on the previous week osds
start to fail.
We use cluster like image storage for Opennebula with small load and
like object storage with high load.
Sometimes disks of some osds utlized by 100 %, iostat shows avgqu-sz
over 1000, while reading or writing a few kilobytes in a second, osds on
this disks become unresponsive and cluster marks them down. We lower the
load to object storage and situation became better.
Yesterday situation became worse:
If RGWs are disabled and there is no requests to object storage cluster
performing well, but if enable RGWs and make a few PUTs or GETs all not
SSD osds on all storages become in the same situation, described above.
IOtop shows, that xfsaild/<disk> burns disks.
trace-cmd record -e xfs\* for a 10 seconds shows 10 milion objects, as
i understand it means ~360 000 objects to push per one osd for a 10 seconds
$ wc -l t.t
10256873 t.t
fragmentation on one of such disks is about 3%
more information about cluster:
https://yadi.sk/d/Y63mXQhl3HPvwt
also debug logs for osd.33 while problem occurs
https://yadi.sk/d/kiqsMF9L3HPvte
debug_osd = 20/20
debug_filestore = 20/20
debug_tp = 20/20
Ubuntu 14.04
$ uname -a
Linux storage01 4.2.0-42-generic #49~14.04.1-Ubuntu SMP Wed Jun 29
20:22:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Ceph 10.2.7
7 storages: Supermicro 28 osd 4tb 7200 JBOD + journal raid10 4 ssd intel
3510 800gb + 2 osd SSD intel 3710 400gb for rgw meta and index
One of this storages differs only in number of osd, it has 26 osd on
4tb, instead of 28 on others
Storages connect to each other by bonded 2x10gbit
Clients connect to storages by bonded 2x1gbit
in 5 storages 2 x CPU E5-2650v2 and 256 gb RAM
in 2 storages 2 x CPU E5-2690v3 and 512 gb RAM
7 mons
3 rgw
Help me please to rescue the cluster.
--
Dmitriev Anton
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com