Hi,
We're having performance issues on a Jewel 10.2.2 cluster. It started
with IOs taking several seconds to be acknowledged so we did some
benchmarks.
We could reproduce with a rados bench on new pool set on a single host
(R730xd with 2 SSDs in JBOD and 12 4TB NL SAS in RAID0 writeback) with
no replication (min_size 1, size_1).
We suspect this could be related to XFS filestore split operation or any
other filestore operation.
Could someone have a look at this video :
https://youtu.be/JQV3VfpAjbM?vq=hd1080
Video shows :
- admin node with commands and comments (top left)
- htop (middle left)
- rados bench (bottom left)
- iostat (top right)
- growing number of dirs in all PGs of that pool on osd.12 (/dev/sdd)
and growing number of objects in the pool. (bottom right)
OSD debug log, perf report and osd params :
ceph-osd.12.log (http://u2l.fr/ceph-osd-12-log-tgz) with full debug log
on from 12:00:26 to 12:00:36. On the video at 17'26" we can see that
osd.12 (/dev/sdd) is 100% busy at 12:00:26.
test_perf_report.txt (http://u2l.fr/test-perf-report-txt) based on
perf.data from 12:02:50 to 12:03:44.
mom02h06_osd.12_config_show.txt (http://u2l.fr/osd-12-config-show)
mom02h06_osd.12_config_diff.txt (http://u2l.fr/osd-12-config-diff)
ceph-conf-osd-params.txt (http://u2l.fr/ceph-conf-osd-params)
Regards,
--
Frédéric Nass
Sous-direction Infrastructures
Direction du Numérique
Université de Lorraine
Tél : +33 3 72 74 11 35
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com