Testing cluster throughput - one OSD is always 100% utilized during rados bench write

Jakub Jaszewski <jaszewski.jakub@xxxxxxxxx> · Wed, 3 Oct 2018 00:12:26 +0200

Hi Cephers,

I'm testing cluster throughput before moving to the production. Ceph version 13.2.1 (I'll update to 13.2.2). 

I run rados bench from 10 cluster nodes and 10 clients in parallel.
Just after I call rados command, HDDs behind three OSDs are 100% utilized while others are < 40%. After the short while only one OSD stay 100% utilized. I've stopped this OSD to eliminate hardware issue, but then another OSD on another node start hitting 100% disk util during next rados bench write. The same OSD is fully utilized for each bench run. 

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdd               0,00     0,00    0,00  518,00     0,00   129,50   512,00    87,99  155,12    0,00  155,12   1,93 100,00

The test pool size is 3 (replicated). (Deep)scrubbing is temporary off.  

Networking, CPU and memory is underutilized during the test.

Particular rados command is
rados bench --name client.rbd_test -p rbd_test 600 write --no-cleanup --run-name $(hostname)_bench

The same story with 
rados --name client.rbd_test -p rbd_test load-gen --min-object-size 4M --max-object-size 4M --min-op-len 4M --max-op-len 4M --max-ops 16 --read-percent 0 --target-throughput 1000 --run-length 600

Do you face the same behavior? It smells like particular PG related. Is it the effect of running number of rados bench tasks in parallel ?

Of course, I do not deny it's cluster limit, but I'm not sure why only one and always the same OSD keeps hitting 100% util. Tomorrow I'm going to test cluster using rbd,

How looks your clusters limit ? saturated LACP ? 100% utilized HDDs???

Thanks,
Jakub
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com