(CCing the list) So, the problem isn't the bandwidth. Before we respond to the client, we write the operation to the journal. In this case, that operation is taking >1s per operation on osd.1. Both rbd and rados bench will only allow a limited number of ops in flight at a time, so this latency is killing your throughput. For comparison, the latency for writing to the journal on osd.0 is < .3s. Can you measure direct io latency for writes to your osd.1 journal file? -Sam On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov <andrey@xxxxxxx> wrote: > Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s, > not Megabits. > > On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov <andrey@xxxxxxx> wrote: >> [global] >> log dir = /ceph/out >> log_file = "" >> logger dir = /ceph/log >> pid file = /ceph/out/$type$id.pid >> [mds] >> pid file = /ceph/out/$name.pid >> lockdep = 1 >> mds log max segments = 2 >> [osd] >> lockdep = 1 >> filestore_xattr_use_omap = 1 >> osd data = /ceph/dev/osd$id >> osd journal = /ceph/meta/journal >> osd journal size = 100 >> [mon] >> lockdep = 1 >> mon data = /ceph/dev/mon$id >> [mon.0] >> host = 172.20.1.32 >> mon addr = 172.20.1.32:6789 >> [mon.1] >> host = 172.20.1.33 >> mon addr = 172.20.1.33:6789 >> [mon.2] >> host = 172.20.1.35 >> mon addr = 172.20.1.35:6789 >> [osd.0] >> host = 172.20.1.32 >> [osd.1] >> host = 172.20.1.33 >> [mds.a] >> host = 172.20.1.32 >> >> /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr) >> /dev/mapper/system-cephmeta on /ceph/meta type ext4 (rw,barrier=0,user_xattr) >> Simple performance tests on those fs shows ~133Mb/s for /ceph and >> metadata/. Also both machines do not hold anything else which may >> impact osd. >> >> Also please note of following: >> >> http://i.imgur.com/ZgFdO.png >> >> First two peaks are related to running rados bench, then goes cluster >> recreation, automated debian install and final peaks are dd test. >> Surely I can have more precise graphs, but current one probably enough >> to state a situation - rbd utilizing about a quarter of possible >> bandwidth(if we can count rados bench as 100%). >> >> On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just <sam.just@xxxxxxxxxxxxx> wrote: >>> Hmm, there seem to be writes taking as long as 1.5s to hit journal on >>> osd.1... Could you post your ceph.conf? Might there be a problem >>> with the osd.1 journal disk? >>> -Sam >>> >>> On Wed, Mar 21, 2012 at 1:25 PM, Andrey Korolyov <andrey@xxxxxxx> wrote: >>>> Oh, sorry - they probably inherited rights from log files, fixed. >>>> >>>> On Thu, Mar 22, 2012 at 12:17 AM, Samuel Just <sam.just@xxxxxxxxxxxxx> wrote: >>>>> I get 403 Forbidden when I try to download any of the files. >>>>> -Sam >>>>> >>>>> On Wed, Mar 21, 2012 at 11:51 AM, Andrey Korolyov <andrey@xxxxxxx> wrote: >>>>>> http://xdel.ru/downloads/ceph-logs/ >>>>>> >>>>>> 1/ contains logs related to bench initiated at the osd0 machine and 2/ >>>>>> - at osd1. >>>>>> >>>>>> On Wed, Mar 21, 2012 at 8:54 PM, Samuel Just <sam.just@xxxxxxxxxxxxx> wrote: >>>>>>> Hmm, I'm seeing some very high latency on ops sent to osd.1. Can you >>>>>>> post osd.1's logs? >>>>>>> -Sam >>>>>>> >>>>>>> On Wed, Mar 21, 2012 at 3:51 AM, Andrey Korolyov <andrey@xxxxxxx> wrote: >>>>>>>> Here, please: http://xdel.ru/downloads/ceph.log.gz >>>>>>>> >>>>>>>> Sometimes 'cur MB/s ' shows zero during rados bench, even if any debug >>>>>>>> output disabled and log_file set to the empty value, hope it`s okay. >>>>>>>> >>>>>>>> On Wed, Mar 21, 2012 at 2:36 AM, Samuel Just <sam.just@xxxxxxxxxxxxx> wrote: >>>>>>>>> Can you set osd and filestore debugging to 20, restart the osds, run >>>>>>>>> rados bench as before, and post the logs? >>>>>>>>> -Sam Just >>>>>>>>> >>>>>>>>> On Tue, Mar 20, 2012 at 1:37 PM, Andrey Korolyov <andrey@xxxxxxx> wrote: >>>>>>>>>> rados bench 60 write -p data >>>>>>>>>> <skip> >>>>>>>>>> Total time run: 61.217676 >>>>>>>>>> Total writes made: 989 >>>>>>>>>> Write size: 4194304 >>>>>>>>>> Bandwidth (MB/sec): 64.622 >>>>>>>>>> >>>>>>>>>> Average Latency: 0.989608 >>>>>>>>>> Max latency: 2.21701 >>>>>>>>>> Min latency: 0.255315 >>>>>>>>>> >>>>>>>>>> Here a snip from osd log, seems write size is okay. >>>>>>>>>> >>>>>>>>>> 2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83 >>>>>>>>>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82 >>>>>>>>>> active+clean] removing repgather(0x31b5360 applying 10'83 rep_tid=597 >>>>>>>>>> wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.000000000040 [write >>>>>>>>>> 1220608~4096] 0.17eb9fd8) v4) >>>>>>>>>> 2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83 >>>>>>>>>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82 >>>>>>>>>> active+clean] q front is repgather(0x31b5360 applying 10'83 >>>>>>>>>> rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533 >>>>>>>>>> rb.0.2.000000000040 [write 1220608~4096] 0.17eb9fd8) v4) >>>>>>>>>> >>>>>>>>>> Sorry for my previous question about rbd chunks, it was really stupid :) >>>>>>>>>> >>>>>>>>>> On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin <josh.durgin@xxxxxxxxxxxxx> wrote: >>>>>>>>>>> On 03/19/2012 11:13 AM, Andrey Korolyov wrote: >>>>>>>>>>>> >>>>>>>>>>>> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage >>>>>>>>>>>> mentioned too small value and I`ve changed it to 64M before posting >>>>>>>>>>>> previous message with no success - both 8M and this value cause a >>>>>>>>>>>> performance drop. When I tried to wrote small amount of data that can >>>>>>>>>>>> be compared to writeback cache size(both on raw device and ext3 with >>>>>>>>>>>> sync option), following results were made: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I just want to clarify that the writeback window isn't a full writeback >>>>>>>>>>> cache - it doesn't affect reads, and does not help with request merging etc. >>>>>>>>>>> It simply allows a bunch of writes to be in flight while acking the write to >>>>>>>>>>> the guest immediately. We're working on a full-fledged writeback cache that >>>>>>>>>>> to replace the writeback window. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost >>>>>>>>>>>> same without oflag there and in the following samples) >>>>>>>>>>>> 10+0 records in >>>>>>>>>>>> 10+0 records out >>>>>>>>>>>> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s >>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct >>>>>>>>>>>> 20+0 records in >>>>>>>>>>>> 20+0 records out >>>>>>>>>>>> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s >>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct >>>>>>>>>>>> 30+0 records in >>>>>>>>>>>> 30+0 records out >>>>>>>>>>>> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s >>>>>>>>>>>> >>>>>>>>>>>> and so on. Reference test with bs=1M and count=2000 has slightly worse >>>>>>>>>>>> results _with_ writeback cache than without, as I`ve mentioned before. >>>>>>>>>>>> Here the bench results, they`re almost equal on both nodes: >>>>>>>>>>>> >>>>>>>>>>>> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> One thing to check is the size of the writes that are actually being sent by >>>>>>>>>>> rbd. The guest is probably splitting them into relatively small (128 or >>>>>>>>>>> 256k) writes. Ideally it would be sending 4k writes, and this should be a >>>>>>>>>>> lot faster. >>>>>>>>>>> >>>>>>>>>>> You can see the writes being sent by adding debug_ms=1 to the client or osd. >>>>>>>>>>> The format is osd_op(.*[write OFFSET~LENGTH]). >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> Also, because I`ve not mentioned it before, network performance is >>>>>>>>>>>> enough to hold fair gigabit connectivity with MTU 1500. Seems that it >>>>>>>>>>>> is not interrupt problem or something like it - even if ceph-osd, >>>>>>>>>>>> ethernet card queues and kvm instance pinned to different sets of >>>>>>>>>>>> cores, nothing changes. >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum >>>>>>>>>>>> <gregory.farnum@xxxxxxxxxxxxx> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> It sounds like maybe you're using Xen? The "rbd writeback window" option >>>>>>>>>>>>> only works for userspace rbd implementations (eg, KVM). >>>>>>>>>>>>> If you are using KVM, you probably want 81920000 (~80MB) rather than >>>>>>>>>>>>> 8192000 (~8MB). >>>>>>>>>>>>> >>>>>>>>>>>>> What options are you running dd with? If you run a rados bench from both >>>>>>>>>>>>> machines, what do the results look like? >>>>>>>>>>>>> Also, can you do the ceph osd bench on each of your OSDs, please? >>>>>>>>>>>>> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance) >>>>>>>>>>>>> -Greg >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> More strangely, writing speed drops down by fifteen percent when this >>>>>>>>>>>>>> option was set in vm` config(instead of result from >>>>>>>>>>>>>> http://www.mail-archive.com/ceph-devel@xxxxxxxxxxxxxxx/msg03685.html). >>>>>>>>>>>>>> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been >>>>>>>>>>>>>> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and >>>>>>>>>>>>>> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes >>>>>>>>>>>>>> under heavy load. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil<sage@xxxxxxxxxxxx >>>>>>>>>>>>>> (mailto:sage@xxxxxxxxxxxx)> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sat, 17 Mar 2012, Andrey Korolyov wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I`ve did some performance tests at the following configuration: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - >>>>>>>>>>>>>>>> dom0 with three dedicated cores and 1.5G, mostly idle. First three >>>>>>>>>>>>>>>> disks on each r410 arranged into raid0 and holds osd data when fourth >>>>>>>>>>>>>>>> holds os and osd` journal partition, all ceph-related stuff mounted on >>>>>>>>>>>>>>>> the ext4 without barriers. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Firstly, I`ve noticed about a difference of benchmark performance and >>>>>>>>>>>>>>>> write speed through rbd from small kvm instance running on one of >>>>>>>>>>>>>>>> first two machines - when bench gave me about 110Mb/s, writing zeros >>>>>>>>>>>>>>>> to raw block device inside vm with dd was at top speed about 45 mb/s, >>>>>>>>>>>>>>>> for vm`fs (ext4 with default options) performance drops to ~23Mb/s. >>>>>>>>>>>>>>>> Things get worse, when I`ve started second vm at second host and tried >>>>>>>>>>>>>>>> to continue same dd tests simultaneously - performance fairly divided >>>>>>>>>>>>>>>> by half for each instance :). Enabling jumbo frames, playing with cpu >>>>>>>>>>>>>>>> affinity for ceph and vm instances and trying different TCP congestion >>>>>>>>>>>>>>>> protocols gave no effect at all - with DCTCP I have slightly smoother >>>>>>>>>>>>>>>> network load graph and that`s all. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Can ml please suggest anything to try to improve performance? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Can you try setting >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> rbd writeback window = 8192000 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> or similar, and see what kind of effect that has? I suspect it'll speed >>>>>>>>>>>>>>> up dd; I'm less sure about ext3. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>> sage >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2 >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>>>>>>>>> (mailto:majordomo@xxxxxxxxxxxxxxx) >>>>>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>>>>>>> (mailto:majordomo@xxxxxxxxxxxxxxx) >>>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>> -- >>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html