Am 01.11.2012 um 06:11 schrieb Alexandre DERUMIER <aderumier@xxxxxxxxx>: >>> Come to think of it that 15k iops I mentioned was on 10G ethernet with >>> NFS. I have tried infiniband with ipoib and tcp, it's similar to 10G >>> ethernet. > > I have see new arista 10GBe switch with latency around 1microsecond, that seem pretty good to do the job. Pretty interesting. How can i measure switch / network latency. Stefan > > > >>> You will need to get creative. What you're asking for really is to >>> have local latencies with remote storage. Just off of the top of my >>> head you may look into some way to do local caching on SSD for your >>> RBD volume, like bcache or flashcache. > I have already thinked about it. (But I would like to use qemu-rbd if possible) > > >>> At any rate, 5000 iops is not as good as a new SSD, but far better >>> than a normal disk. Is there some specific application requirement, or >>> is it just that you are feeling like you want the full performance >>> from the VM? > > I have some customers with some huge databases (too big to be handle in the bufer), require a lot of ios. (around 10K). > > I have redone tests with 4 guest in parallel, I get 4 x 5000iops, so it seem to scale ! (and cpu is very low on the ceph cluster). > > > So I'll try some tricks, like raid over multiple rbd devices, maybe it'll help. > > Thanks again for the help Marcus, I was not aware of these latencies problems. > > Regards, > > Alexandre > > > ----- Mail original ----- > > De: "Marcus Sorensen" <shadowsor@xxxxxxxxx> > À: "Alexandre DERUMIER" <aderumier@xxxxxxxxx> > Cc: "Sage Weil" <sage@xxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> > Envoyé: Mercredi 31 Octobre 2012 20:50:36 > Objet: Re: slow fio random read benchmark, need help > > Come to think of it that 15k iops I mentioned was on 10G ethernet with > NFS. I have tried infiniband with ipoib and tcp, it's similar to 10G > ethernet. > > You will need to get creative. What you're asking for really is to > have local latencies with remote storage. Just off of the top of my > head you may look into some way to do local caching on SSD for your > RBD volume, like bcache or flashcache. > > Depending on your application, it may actually be a bonus that no > single server (or handful of servers) can crush your storage's > performance. if you only have one or two clients anyway then that may > not be much consolation, but if you're going to have dozens or more > then there's not much benefit to having one take all performance at > the expense of everyone, except for perhaps in bursts. > > At any rate, 5000 iops is not as good as a new SSD, but far better > than a normal disk. Is there some specific application requirement, or > is it just that you are feeling like you want the full performance > from the VM? > > On Wed, Oct 31, 2012 at 12:56 PM, Alexandre DERUMIER > <aderumier@xxxxxxxxx> wrote: >> Yes, I think you are right, round trip with mon must cut by half the performance. >> >> I have just done test with 2 parallel fio bench, from 2 differents host, >> I get 2 x 5000 iops >> >> so it must be related to network latency. >> >> I have also done tests with --numjob 1000, it doesn't help, same results. >> >> >> Do you have an idea how I can have more io from 1 host ? >> Doing lacp with multiple links ? >> >> I think that 10gigabit latency is almost same, i'm not sure it will improve iops too much >> Maybe InfiniBand can help? >> >> ----- Mail original ----- >> >> De: "Marcus Sorensen" <shadowsor@xxxxxxxxx> >> À: "Alexandre DERUMIER" <aderumier@xxxxxxxxx> >> Cc: "Sage Weil" <sage@xxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >> Envoyé: Mercredi 31 Octobre 2012 18:38:46 >> Objet: Re: slow fio random read benchmark, need help >> >> Yes, I was going to say that the most I've ever seen out of gigabit is >> about 15k iops, with parallel tests and NFS (or iSCSI). Multipathing >> may not really parallelize the io for you. It can send an io down one >> path, then move to the next path and send the next io without >> necessarily waiting for the previous one to respond, but it only >> shaves a slight amount from your latency under some scenarios as >> opposed to sending down all paths simultaneously. I have seen it help >> with high latency links. >> >> I don't remember the Ceph design that well, but with distributed >> storage systems you're going to pay a penalty. If you can do 10-15k >> with one TCP round trip, you'll get half that with the round trip to >> talk to the metadata server to find your blocks and then to fetch >> them. Like I said, that might not be exactly what Ceph does, but >> you're going to have more traffic than just a straight single attached >> NFS or iscsi server. >> >> On Wed, Oct 31, 2012 at 11:27 AM, Alexandre DERUMIER >> <aderumier@xxxxxxxxx> wrote: >>> Thanks Marcus, >>> >>> indeed gigabit ethernet. >>> >>> note that my iscsi results (40k)was with multipath, so multiple gigabit links. >>> >>> I have also done tests with a netapp array, with nfs, single link, I'm around 13000 iops >>> >>> I will do more tests with multiples vms, from differents hosts, and with --numjobs. >>> >>> I'll keep you in touch, >>> >>> Thanks for help, >>> >>> Regards, >>> >>> Alexandre >>> >>> >>> ----- Mail original ----- >>> >>> De: "Marcus Sorensen" <shadowsor@xxxxxxxxx> >>> À: "Alexandre DERUMIER" <aderumier@xxxxxxxxx> >>> Cc: "Sage Weil" <sage@xxxxxxxxxxx>, "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >>> Envoyé: Mercredi 31 Octobre 2012 18:08:11 >>> Objet: Re: slow fio random read benchmark, need help >>> >>> 5000 is actually really good, if you ask me. Assuming everything is >>> connected via gigabit. If you get 40k iops locally, you add the >>> latency of tcp, as well as that of the ceph services and VM layer, and >>> that's what you get. On my network I get about a .1ms round trip on >>> gigabit over the same switch, which by definition can only do 10,000 >>> iops. Then if you have storage on the other end capable of 40k iops, >>> you add the latencies together (.1ms + .025ms) and you're at 8k iops. >>> Then add the small latency of the application servicing the io (NFS, >>> Ceph, etc), and the latency introduced by your VM layer, and 5k sounds >>> about right. >>> >>> The good news is that you probably aren't taxing the storage, you can >>> likely do many simultaneous tests from several VMs and get the same >>> results. >>> >>> You can try adding --numjobs to your fio to parallelize the specific >>> test you're doing, or launching a second VM and doing the same test at >>> the same time. This would be a good indicator if it's latency. >>> >>> On Wed, Oct 31, 2012 at 10:29 AM, Alexandre DERUMIER >>> <aderumier@xxxxxxxxx> wrote: >>>>>> Have you tried increasing the iodepth? >>>> Yes, I have try with 100 and 200, same results. >>>> >>>> I have also try directly from the host, with /dev/rbd1, and I have same result. >>>> I have also try with 3 differents hosts, with differents cpus models. >>>> >>>> (note: I can reach around 40.000 iops with same fio config on a zfs iscsi array) >>>> >>>> My test ceph cluster nodes cpus are old (xeon E5420), but they are around 10% usage, so I think it's ok. >>>> >>>> >>>> Do you have an idea if I can trace something ? >>>> >>>> Thanks, >>>> >>>> Alexandre >>>> >>>> ----- Mail original ----- >>>> >>>> De: "Sage Weil" <sage@xxxxxxxxxxx> >>>> À: "Alexandre DERUMIER" <aderumier@xxxxxxxxx> >>>> Cc: "ceph-devel" <ceph-devel@xxxxxxxxxxxxxxx> >>>> Envoyé: Mercredi 31 Octobre 2012 16:57:05 >>>> Objet: Re: slow fio random read benchmark, need help >>>> >>>> On Wed, 31 Oct 2012, Alexandre DERUMIER wrote: >>>>> Hello, >>>>> >>>>> I'm doing some tests with fio from a qemu 1.2 guest (virtio disk,cache=none), randread, with 4K block size on a small size of 1G (so it can be handle by the buffer cache on ceph cluster) >>>>> >>>>> >>>>> fio --filename=/dev/vdb -rw=randread --bs=4K --size=1000M --iodepth=40 --group_reporting --name=file1 --ioengine=libaio --direct=1 >>>>> >>>>> >>>>> I can't get more than 5000 iops. >>>> >>>> Have you tried increasing the iodepth? >>>> >>>> sage >>>> >>>>> >>>>> >>>>> RBD cluster is : >>>>> --------------- >>>>> 3 nodes,with each node : >>>>> -6 x osd 15k drives (xfs), journal on tmpfs, 1 mon >>>>> -cpu: 2x 4 cores intel xeon E5420@2.5GHZ >>>>> rbd 0.53 >>>>> >>>>> ceph.conf >>>>> >>>>> journal dio = false >>>>> filestore fiemap = false >>>>> filestore flusher = false >>>>> osd op threads = 24 >>>>> osd disk threads = 24 >>>>> filestore op threads = 6 >>>>> >>>>> kvm host is : 4 x 12 cores opteron >>>>> ------------ >>>>> >>>>> >>>>> During the bench: >>>>> >>>>> on ceph nodes: >>>>> - cpu is around 10% used >>>>> - iostat show no disks activity on osds. (so I think that the 1G file is handle in the linux buffer) >>>>> >>>>> >>>>> on kvm host: >>>>> >>>>> -cpu is around 20% used >>>>> >>>>> >>>>> I really don't see where is the bottleneck.... >>>>> >>>>> Any Ideas, hints ? >>>>> >>>>> >>>>> Regards, >>>>> >>>>> Alexandre >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html