Re: slow fio random read benchmark, need help

Stefan Priebe - Profihost AG <s.priebe@xxxxxxxxxxxx> · Thu, 1 Nov 2012 06:41:50 +0100

Am 01.11.2012 um 06:11 schrieb Alexandre DERUMIER <aderumier@xxxxxxxxx>:

>>> Come to think of it that 15k iops I mentioned was on 10G ethernet with
>>> NFS. I have tried infiniband with ipoib and tcp, it's similar to 10G
>>> ethernet.
> 
> I have see new arista 10GBe switch with latency around 1microsecond, that seem pretty good to do the job.

Pretty interesting. How can i measure switch / network latency.

Stefan

> 
> 
> 
>>> You will need to get creative. What you're asking for really is to
>>> have local latencies with remote storage. Just off of the top of my
>>> head you may look into some way to do local caching on SSD for your
>>> RBD volume, like bcache or flashcache.
> I have already thinked about it. (But I would like to use qemu-rbd if possible)
> 
> 
>>> At any rate, 5000 iops is not as good as a new SSD, but far better
>>> than a normal disk. Is there some specific application requirement, or
>>> is it just that you are feeling like you want the full performance
>>> from the VM?
> 
> I have some customers with some huge databases (too big to be handle in the bufer), require a lot of ios. (around 10K).
> 
> I have redone tests with 4 guest in parallel, I get 4 x 5000iops, so it seem to scale ! (and cpu is very low on the ceph cluster).
> 
> 
> So I'll try some tricks, like raid over multiple rbd devices, maybe it'll help.
> 
> Thanks again for the help Marcus, I was not aware of these latencies problems.
> 
> Regards,
> 
> Alexandre
> 
> 
> ----- Mail original -----
> 
> De: "Marcus Sorensen" <shadowsor@xxxxxxxxx>
> À: "Alexandre DERUMIER" <aderumier@xxxxxxxxx>
> Cc: "Sage Weil" <sage@xxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx>
> Envoyé: Mercredi 31 Octobre 2012 20:50:36
> Objet: Re: slow fio random read benchmark, need help
> 
> Come to think of it that 15k iops I mentioned was on 10G ethernet with
> NFS. I have tried infiniband with ipoib and tcp, it's similar to 10G
> ethernet.
> 
> You will need to get creative. What you're asking for really is to
> have local latencies with remote storage. Just off of the top of my
> head you may look into some way to do local caching on SSD for your
> RBD volume, like bcache or flashcache.
> 
> Depending on your application, it may actually be a bonus that no
> single server (or handful of servers) can crush your storage's
> performance. if you only have one or two clients anyway then that may
> not be much consolation, but if you're going to have dozens or more
> then there's not much benefit to having one take all performance at
> the expense of everyone, except for perhaps in bursts.
> 
> At any rate, 5000 iops is not as good as a new SSD, but far better
> than a normal disk. Is there some specific application requirement, or
> is it just that you are feeling like you want the full performance
> from the VM?
> 
> On Wed, Oct 31, 2012 at 12:56 PM, Alexandre DERUMIER
> <aderumier@xxxxxxxxx> wrote:
>> Yes, I think you are right, round trip with mon must cut by half the performance.
>> 
>> I have just done test with 2 parallel fio bench, from 2 differents host, 
>> I get 2 x 5000 iops
>> 
>> so it must be related to network latency.
>> 
>> I have also done tests with --numjob 1000, it doesn't help, same results.
>> 
>> 
>> Do you have an idea how I can have more io from 1 host ?
>> Doing lacp with multiple links ?
>> 
>> I think that 10gigabit latency is almost same, i'm not sure it will improve iops too much
>> Maybe InfiniBand can help?
>> 
>> ----- Mail original -----
>> 
>> De: "Marcus Sorensen" <shadowsor@xxxxxxxxx>
>> À: "Alexandre DERUMIER" <aderumier@xxxxxxxxx>
>> Cc: "Sage Weil" <sage@xxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx>
>> Envoyé: Mercredi 31 Octobre 2012 18:38:46
>> Objet: Re: slow fio random read benchmark, need help
>> 
>> Yes, I was going to say that the most I've ever seen out of gigabit is
>> about 15k iops, with parallel tests and NFS (or iSCSI). Multipathing
>> may not really parallelize the io for you. It can send an io down one
>> path, then move to the next path and send the next io without
>> necessarily waiting for the previous one to respond, but it only
>> shaves a slight amount from your latency under some scenarios as
>> opposed to sending down all paths simultaneously. I have seen it help
>> with high latency links.
>> 
>> I don't remember the Ceph design that well, but with distributed
>> storage systems you're going to pay a penalty. If you can do 10-15k
>> with one TCP round trip, you'll get half that with the round trip to
>> talk to the metadata server to find your blocks and then to fetch
>> them. Like I said, that might not be exactly what Ceph does, but
>> you're going to have more traffic than just a straight single attached
>> NFS or iscsi server.
>> 
>> On Wed, Oct 31, 2012 at 11:27 AM, Alexandre DERUMIER
>> <aderumier@xxxxxxxxx> wrote:
>>> Thanks Marcus,
>>> 
>>> indeed gigabit ethernet.
>>> 
>>> note that my iscsi results (40k)was with multipath, so multiple gigabit links.
>>> 
>>> I have also done tests with a netapp array, with nfs, single link, I'm around 13000 iops
>>> 
>>> I will do more tests with multiples vms, from differents hosts, and with --numjobs.
>>> 
>>> I'll keep you in touch,
>>> 
>>> Thanks for help,
>>> 
>>> Regards,
>>> 
>>> Alexandre
>>> 
>>> 
>>> ----- Mail original -----
>>> 
>>> De: "Marcus Sorensen" <shadowsor@xxxxxxxxx>
>>> À: "Alexandre DERUMIER" <aderumier@xxxxxxxxx>
>>> Cc: "Sage Weil" <sage@xxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx>
>>> Envoyé: Mercredi 31 Octobre 2012 18:08:11
>>> Objet: Re: slow fio random read benchmark, need help
>>> 
>>> 5000 is actually really good, if you ask me. Assuming everything is
>>> connected via gigabit. If you get 40k iops locally, you add the
>>> latency of tcp, as well as that of the ceph services and VM layer, and
>>> that's what you get. On my network I get about a .1ms round trip on
>>> gigabit over the same switch, which by definition can only do 10,000
>>> iops. Then if you have storage on the other end capable of 40k iops,
>>> you add the latencies together (.1ms + .025ms) and you're at 8k iops.
>>> Then add the small latency of the application servicing the io (NFS,
>>> Ceph, etc), and the latency introduced by your VM layer, and 5k sounds
>>> about right.
>>> 
>>> The good news is that you probably aren't taxing the storage, you can
>>> likely do many simultaneous tests from several VMs and get the same
>>> results.
>>> 
>>> You can try adding --numjobs to your fio to parallelize the specific
>>> test you're doing, or launching a second VM and doing the same test at
>>> the same time. This would be a good indicator if it's latency.
>>> 
>>> On Wed, Oct 31, 2012 at 10:29 AM, Alexandre DERUMIER
>>> <aderumier@xxxxxxxxx> wrote:
>>>>>> Have you tried increasing the iodepth?
>>>> Yes, I have try with 100 and 200, same results.
>>>> 
>>>> I have also try directly from the host, with /dev/rbd1, and I have same result.
>>>> I have also try with 3 differents hosts, with differents cpus models.
>>>> 
>>>> (note: I can reach around 40.000 iops with same fio config on a zfs iscsi array)
>>>> 
>>>> My test ceph cluster nodes cpus are old (xeon E5420), but they are around 10% usage, so I think it's ok.
>>>> 
>>>> 
>>>> Do you have an idea if I can trace something ?
>>>> 
>>>> Thanks,
>>>> 
>>>> Alexandre
>>>> 
>>>> ----- Mail original -----
>>>> 
>>>> De: "Sage Weil" <sage@xxxxxxxxxxx>
>>>> À: "Alexandre DERUMIER" <aderumier@xxxxxxxxx>
>>>> Cc: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx>
>>>> Envoyé: Mercredi 31 Octobre 2012 16:57:05
>>>> Objet: Re: slow fio random read benchmark, need help
>>>> 
>>>> On Wed, 31 Oct 2012, Alexandre DERUMIER wrote:
>>>>> Hello,
>>>>> 
>>>>> I'm doing some tests with fio from a qemu 1.2 guest (virtio disk,cache=none), randread, with 4K block size on a small size of 1G (so it can be handle by the buffer cache on ceph cluster)
>>>>> 
>>>>> 
>>>>> fio --filename=/dev/vdb -rw=randread --bs=4K --size=1000M --iodepth=40 --group_reporting --name=file1 --ioengine=libaio --direct=1
>>>>> 
>>>>> 
>>>>> I can't get more than 5000 iops.
>>>> 
>>>> Have you tried increasing the iodepth?
>>>> 
>>>> sage
>>>> 
>>>>> 
>>>>> 
>>>>> RBD cluster is :
>>>>> ---------------
>>>>> 3 nodes,with each node :
>>>>> -6 x osd 15k drives (xfs), journal on tmpfs, 1 mon
>>>>> -cpu: 2x 4 cores intel xeon E5420@2.5GHZ
>>>>> rbd 0.53
>>>>> 
>>>>> ceph.conf
>>>>> 
>>>>> journal dio = false
>>>>> filestore fiemap = false
>>>>> filestore flusher = false
>>>>> osd op threads = 24
>>>>> osd disk threads = 24
>>>>> filestore op threads = 6
>>>>> 
>>>>> kvm host is : 4 x 12 cores opteron
>>>>> ------------
>>>>> 
>>>>> 
>>>>> During the bench:
>>>>> 
>>>>> on ceph nodes:
>>>>> - cpu is around 10% used
>>>>> - iostat show no disks activity on osds. (so I think that the 1G file is handle in the linux buffer)
>>>>> 
>>>>> 
>>>>> on kvm host:
>>>>> 
>>>>> -cpu is around 20% used
>>>>> 
>>>>> 
>>>>> I really don't see where is the bottleneck....
>>>>> 
>>>>> Any Ideas, hints ?
>>>>> 
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Alexandre
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> 
>>>>> 
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html