Performance issues on Jewel 10.2.2.

Frédéric Nass <frederic.nass@xxxxxxxxxxxxxxxx> · Wed, 14 Dec 2016 15:33:36 +0100

Hi,

We're having performance issues on a Jewel 10.2.2 cluster. It started 
with IOs taking several seconds to be acknowledged so we did some 
benchmarks.

We could reproduce with a rados bench on new pool set on a single host 
(R730xd with 2 SSDs in JBOD and 12 4TB NL SAS in RAID0 writeback) with 
no replication (min_size 1, size_1).
We suspect this could be related to XFS filestore split operation or any 
other filestore operation.

Could someone have a look at this video : 
https://youtu.be/JQV3VfpAjbM?vq=hd1080

Video shows :

- admin node with commands and comments (top left)
- htop (middle left)
- rados bench (bottom left)
- iostat (top right)
- growing number of dirs in all PGs of that pool on osd.12 (/dev/sdd) 
and growing number of objects in the pool. (bottom right)

OSD debug log, perf report and osd params :

ceph-osd.12.log (http://u2l.fr/ceph-osd-12-log-tgz) with full debug log 
on from 12:00:26 to 12:00:36. On the video at 17'26" we can see that 
osd.12 (/dev/sdd) is 100% busy at 12:00:26.
test_perf_report.txt (http://u2l.fr/test-perf-report-txt) based on 
perf.data from 12:02:50 to 12:03:44.
mom02h06_osd.12_config_show.txt (http://u2l.fr/osd-12-config-show)
mom02h06_osd.12_config_diff.txt (http://u2l.fr/osd-12-config-diff)
ceph-conf-osd-params.txt (http://u2l.fr/ceph-conf-osd-params)

Regards,

--

Frédéric Nass

Sous-direction Infrastructures
Direction du Numérique
Université de Lorraine

Tél : +33 3 72 74 11 35

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com