Hi Cephers,
I'm testing cluster throughput before moving to the production. Ceph version 13.2.1 (I'll update to 13.2.2).
I run rados bench from 10 cluster nodes and 10 clients in parallel.
Just after I call rados command, HDDs behind three OSDs are 100% utilized while others are < 40%. After the short while only one OSD stay 100% utilized. I've stopped this OSD to eliminate hardware issue, but then another OSD on another node start hitting 100% disk util during next rados bench write. The same OSD is fully utilized for each bench run.
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdd 0,00 0,00 0,00 518,00 0,00 129,50 512,00 87,99 155,12 0,00 155,12 1,93 100,00
The test pool size is 3 (replicated). (Deep)scrubbing is temporary off.
Networking, CPU and memory is underutilized during the test.
Particular rados command is
rados bench --name client.rbd_test -p rbd_test 600 write --no-cleanup --run-name $(hostname)_bench
The same story with
rados --name client.rbd_test -p rbd_test load-gen --min-object-size 4M --max-object-size 4M --min-op-len 4M --max-op-len 4M --max-ops 16 --read-percent 0 --target-throughput 1000 --run-length 600
Do you face the same behavior? It smells like particular PG related. Is it the effect of running number of rados bench tasks in parallel ?
Of course, I do not deny it's cluster limit, but I'm not sure why only one and always the same OSD keeps hitting 100% util. Tomorrow I'm going to test cluster using rbd,
How looks your clusters limit ? saturated LACP ? 100% utilized HDDs???
Thanks,
Jakub
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com