Re: slow fio random read benchmark, need help

Alexandre DERUMIER <aderumier@xxxxxxxxx> · Wed, 31 Oct 2012 18:22:22 +0100 (CET)

Hi,
I use a small file size (1G), to be sure it can be handle in buffer. (I don't see any read access on disks with iostat during the test)
But I think the problem is not the disk hardware ios, but a bottleneck somewhere in the ceph protocol
(All benchs I have see in the ceph mailing never reach more than 20.000 iops with full ssd ceph cluster and bigger cpus)

>>If you can get 40K random read IOPS out of 18 spindles, I 
>>have to ask why you think most of those operations made 
>>it to disk. It sounds to me like they were being satisfied 
>>out of cache. 

(40k was on zfs san, handle in zfs arc memory buffer, so no read access on disk too)

>>Are you sure that fio is doing what you think it is doing? 
Yes, sure, I have already bench a lot of san array with fio.
I have also done same test with a sheepdog cluster (same hardware), I can reach 20.000-30.000io/s if the buffer cache is big enough.

I would like to know where is the bottleneck, before build an ceph cluster with more powerfull servers and full ssds osd.

Does Intank have some random read/write io benchmarks ?  
(I see a lot of sequential benchs with high bandwith results, but not so much random io/s results)

Regards,

Alexandre

----- Mail original ----- 

De: "Mark Kampe" <mark.kampe@xxxxxxxxxxx> 
À: "Alexandre DERUMIER" <aderumier@xxxxxxxxx> 
Cc: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> 
Envoyé: Mercredi 31 Octobre 2012 17:56:26 
Objet: Re: slow fio random read benchmark, need help 

I'm a little confused by the math here: 

15K RPM = 250 rotations/second 
3 hosts * 6 OSDs/host = 18 spindles 
18 spindles * 250 rotations/second = 4500 tracks/second (sans seeks) 

4K direct random reads against large files should have negligible 
cache hits. But large numbers of parallel operations (greater 
iodepth) may give us multiple (coincidental) reads per track, 
which could push us a little above one read per track, and 
enable some good head scheduling (keeping the seeks small) 
but even so the seeks are probably going to cut that number 
by half or worse. 

If you can get 40K random read IOPS out of 18 spindles, I 
have to ask why you think most of those operations made 
it to disk. It sounds to me like they were being satisfied 
out of cache. 

Are you sure that fio is doing what you think it is doing? 

On Wed, Oct 31, 2012 at 9:29 AM, Alexandre DERUMIER < aderumier@xxxxxxxxx > wrote: 

>>Have you tried increasing the iodepth? 
Yes, I have try with 100 and 200, same results. 

I have also try directly from the host, with /dev/rbd1, and I have same result. 
I have also try with 3 differents hosts, with differents cpus models. 

(note: I can reach around 40.000 iops with same fio config on a zfs iscsi array) 

My test ceph cluster nodes cpus are old (xeon E5420), but they are around 10% usage, so I think it's ok. 

Do you have an idea if I can trace something ? 

Thanks, 

Alexandre 

----- Mail original ----- 

De: "Sage Weil" < sage@xxxxxxxxxxx > 
À: "Alexandre DERUMIER" < aderumier@xxxxxxxxx > 
Cc: "ceph-devel" < ceph-devel@xxxxxxxxxxxxxxx > 
Envoyé: Mercredi 31 Octobre 2012 16:57:05 
Objet: Re: slow fio random read benchmark, need help 

On Wed, 31 Oct 2012, Alexandre DERUMIER wrote: 
> Hello, 
> 
> I'm doing some tests with fio from a qemu 1.2 guest (virtio disk,cache=none), randread, with 4K block size on a small size of 1G (so it can be handle by the buffer cache on ceph cluster) 
> 
> 
> fio --filename=/dev/vdb -rw=randread --bs=4K --size=1000M --iodepth=40 --group_reporting --name=file1 --ioengine=libaio --direct=1 
> 
> 
> I can't get more than 5000 iops. 

Have you tried increasing the iodepth? 

sage 

> 
> 
> RBD cluster is : 
> --------------- 
> 3 nodes,with each node : 
> -6 x osd 15k drives (xfs), journal on tmpfs, 1 mon 
> -cpu: 2x 4 cores intel xeon E5420@2.5GHZ 
> rbd 0.53 
> 
> ceph.conf 
> 
> journal dio = false 
> filestore fiemap = false 
> filestore flusher = false 
> osd op threads = 24 
> osd disk threads = 24 
> filestore op threads = 6 
> 
> kvm host is : 4 x 12 cores opteron 
> ------------ 
> 
> 
> During the bench: 
> 
> on ceph nodes: 
> - cpu is around 10% used 
> - iostat show no disks activity on osds. (so I think that the 1G file is handle in the linux buffer) 
> 
> 
> on kvm host: 
> 
> -cpu is around 20% used 
> 
> 
> I really don't see where is the bottleneck.... 
> 
> Any Ideas, hints ? 
> 
> 
> Regards, 
> 
> Alexandre 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
> the body of a message to majordomo@xxxxxxxxxxxxxxx 
> More majordomo info at http://vger.kernel.org/majordomo-info.html 
> 
> 
-- 
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in 
the body of a message to majordomo@xxxxxxxxxxxxxxx 
More majordomo info at http://vger.kernel.org/majordomo-info.html 

-- 
Mark.Kampe@xxxxxxxxxxx 
VP, Engineering 
Mobile: +1-213-400-8857 
Office: +1-323-375-3863 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html