random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2 Starting 1 process Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s] random-rw: (groupid=0, jobs=1): err= 0: pid=9647 write: io=163840KB, bw=37760KB/s, iops=9439, runt= 4339msec clat (usec): min=70, max=39801, avg=104.19, stdev=324.29 bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, stdev=5770.28 cpu : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued r/w: total=0/40960, short=0/0 lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11% lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01% On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just <sam.just@xxxxxxxxxxxxx> wrote: > Our journal writes are actually sequential. Could you send FIO > results for sequential 4k writes osd.0's journal and osd.1's journal? > -Sam > > On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov <andrey@xxxxxxx> wrote: >> FIO output for journal partition, directio enabled, seems good(same >> results for ext4 on other single sata disks). >> >> random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2 >> Starting 1 process >> Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s] >> random-rw: (groupid=0, jobs=1): err= 0: pid=21926 >> write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec >> clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04 >> bw (KB/s) : min= 552, max= 3880, per=100.61%, avg=2341.23, stdev=480.05 >> cpu : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42 >> IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% >> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% >> issued r/w: total=0/40960, short=0/0 >> lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63% >> lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07% >> lat (msec): 500=0.04% >> >> >> >> On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just <sam.just@xxxxxxxxxxxxx> wrote: >>> (CCing the list) >>> >>> So, the problem isn't the bandwidth. Before we respond to the client, >>> we write the operation to the journal. In this case, that operation >>> is taking >1s per operation on osd.1. Both rbd and rados bench will >>> only allow a limited number of ops in flight at a time, so this >>> latency is killing your throughput. For comparison, the latency for >>> writing to the journal on osd.0 is < .3s. Can you measure direct io >>> latency for writes to your osd.1 journal file? >>> -Sam >>> >>> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov <andrey@xxxxxxx> wrote: >>>> Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s, >>>> not Megabits. >>>> >>>> On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov <andrey@xxxxxxx> wrote: >>>>> [global] >>>>> log dir = /ceph/out >>>>> log_file = "" >>>>> logger dir = /ceph/log >>>>> pid file = /ceph/out/$type$id.pid >>>>> [mds] >>>>> pid file = /ceph/out/$name.pid >>>>> lockdep = 1 >>>>> mds log max segments = 2 >>>>> [osd] >>>>> lockdep = 1 >>>>> filestore_xattr_use_omap = 1 >>>>> osd data = /ceph/dev/osd$id >>>>> osd journal = /ceph/meta/journal >>>>> osd journal size = 100 >>>>> [mon] >>>>> lockdep = 1 >>>>> mon data = /ceph/dev/mon$id >>>>> [mon.0] >>>>> host = 172.20.1.32 >>>>> mon addr = 172.20.1.32:6789 >>>>> [mon.1] >>>>> host = 172.20.1.33 >>>>> mon addr = 172.20.1.33:6789 >>>>> [mon.2] >>>>> host = 172.20.1.35 >>>>> mon addr = 172.20.1.35:6789 >>>>> [osd.0] >>>>> host = 172.20.1.32 >>>>> [osd.1] >>>>> host = 172.20.1.33 >>>>> [mds.a] >>>>> host = 172.20.1.32 >>>>> >>>>> /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr) >>>>> /dev/mapper/system-cephmeta on /ceph/meta type ext4 (rw,barrier=0,user_xattr) >>>>> Simple performance tests on those fs shows ~133Mb/s for /ceph and >>>>> metadata/. Also both machines do not hold anything else which may >>>>> impact osd. >>>>> >>>>> Also please note of following: >>>>> >>>>> http://i.imgur.com/ZgFdO.png >>>>> >>>>> First two peaks are related to running rados bench, then goes cluster >>>>> recreation, automated debian install and final peaks are dd test. >>>>> Surely I can have more precise graphs, but current one probably enough >>>>> to state a situation - rbd utilizing about a quarter of possible >>>>> bandwidth(if we can count rados bench as 100%). >>>>> >>>>> On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just <sam.just@xxxxxxxxxxxxx> wrote: >>>>>> Hmm, there seem to be writes taking as long as 1.5s to hit journal on >>>>>> osd.1... Could you post your ceph.conf? Might there be a problem >>>>>> with the osd.1 journal disk? >>>>>> -Sam >>>>>> >>>>>> On Wed, Mar 21, 2012 at 1:25 PM, Andrey Korolyov <andrey@xxxxxxx> wrote: >>>>>>> Oh, sorry - they probably inherited rights from log files, fixed. >>>>>>> >>>>>>> On Thu, Mar 22, 2012 at 12:17 AM, Samuel Just <sam.just@xxxxxxxxxxxxx> wrote: >>>>>>>> I get 403 Forbidden when I try to download any of the files. >>>>>>>> -Sam >>>>>>>> >>>>>>>> On Wed, Mar 21, 2012 at 11:51 AM, Andrey Korolyov <andrey@xxxxxxx> wrote: >>>>>>>>> http://xdel.ru/downloads/ceph-logs/ >>>>>>>>> >>>>>>>>> 1/ contains logs related to bench initiated at the osd0 machine and 2/ >>>>>>>>> - at osd1. >>>>>>>>> >>>>>>>>> On Wed, Mar 21, 2012 at 8:54 PM, Samuel Just <sam.just@xxxxxxxxxxxxx> wrote: >>>>>>>>>> Hmm, I'm seeing some very high latency on ops sent to osd.1. Can you >>>>>>>>>> post osd.1's logs? >>>>>>>>>> -Sam >>>>>>>>>> >>>>>>>>>> On Wed, Mar 21, 2012 at 3:51 AM, Andrey Korolyov <andrey@xxxxxxx> wrote: >>>>>>>>>>> Here, please: http://xdel.ru/downloads/ceph.log.gz >>>>>>>>>>> >>>>>>>>>>> Sometimes 'cur MB/s ' shows zero during rados bench, even if any debug >>>>>>>>>>> output disabled and log_file set to the empty value, hope it`s okay. >>>>>>>>>>> >>>>>>>>>>> On Wed, Mar 21, 2012 at 2:36 AM, Samuel Just <sam.just@xxxxxxxxxxxxx> wrote: >>>>>>>>>>>> Can you set osd and filestore debugging to 20, restart the osds, run >>>>>>>>>>>> rados bench as before, and post the logs? >>>>>>>>>>>> -Sam Just >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Mar 20, 2012 at 1:37 PM, Andrey Korolyov <andrey@xxxxxxx> wrote: >>>>>>>>>>>>> rados bench 60 write -p data >>>>>>>>>>>>> <skip> >>>>>>>>>>>>> Total time run: 61.217676 >>>>>>>>>>>>> Total writes made: 989 >>>>>>>>>>>>> Write size: 4194304 >>>>>>>>>>>>> Bandwidth (MB/sec): 64.622 >>>>>>>>>>>>> >>>>>>>>>>>>> Average Latency: 0.989608 >>>>>>>>>>>>> Max latency: 2.21701 >>>>>>>>>>>>> Min latency: 0.255315 >>>>>>>>>>>>> >>>>>>>>>>>>> Here a snip from osd log, seems write size is okay. >>>>>>>>>>>>> >>>>>>>>>>>>> 2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83 >>>>>>>>>>>>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82 >>>>>>>>>>>>> active+clean] removing repgather(0x31b5360 applying 10'83 rep_tid=597 >>>>>>>>>>>>> wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.000000000040 [write >>>>>>>>>>>>> 1220608~4096] 0.17eb9fd8) v4) >>>>>>>>>>>>> 2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83 >>>>>>>>>>>>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82 >>>>>>>>>>>>> active+clean] q front is repgather(0x31b5360 applying 10'83 >>>>>>>>>>>>> rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533 >>>>>>>>>>>>> rb.0.2.000000000040 [write 1220608~4096] 0.17eb9fd8) v4) >>>>>>>>>>>>> >>>>>>>>>>>>> Sorry for my previous question about rbd chunks, it was really stupid :) >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin <josh.durgin@xxxxxxxxxxxxx> wrote: >>>>>>>>>>>>>> On 03/19/2012 11:13 AM, Andrey Korolyov wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage >>>>>>>>>>>>>>> mentioned too small value and I`ve changed it to 64M before posting >>>>>>>>>>>>>>> previous message with no success - both 8M and this value cause a >>>>>>>>>>>>>>> performance drop. When I tried to wrote small amount of data that can >>>>>>>>>>>>>>> be compared to writeback cache size(both on raw device and ext3 with >>>>>>>>>>>>>>> sync option), following results were made: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I just want to clarify that the writeback window isn't a full writeback >>>>>>>>>>>>>> cache - it doesn't affect reads, and does not help with request merging etc. >>>>>>>>>>>>>> It simply allows a bunch of writes to be in flight while acking the write to >>>>>>>>>>>>>> the guest immediately. We're working on a full-fledged writeback cache that >>>>>>>>>>>>>> to replace the writeback window. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost >>>>>>>>>>>>>>> same without oflag there and in the following samples) >>>>>>>>>>>>>>> 10+0 records in >>>>>>>>>>>>>>> 10+0 records out >>>>>>>>>>>>>>> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s >>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct >>>>>>>>>>>>>>> 20+0 records in >>>>>>>>>>>>>>> 20+0 records out >>>>>>>>>>>>>>> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s >>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct >>>>>>>>>>>>>>> 30+0 records in >>>>>>>>>>>>>>> 30+0 records out >>>>>>>>>>>>>>> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> and so on. Reference test with bs=1M and count=2000 has slightly worse >>>>>>>>>>>>>>> results _with_ writeback cache than without, as I`ve mentioned before. >>>>>>>>>>>>>>> Here the bench results, they`re almost equal on both nodes: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> One thing to check is the size of the writes that are actually being sent by >>>>>>>>>>>>>> rbd. The guest is probably splitting them into relatively small (128 or >>>>>>>>>>>>>> 256k) writes. Ideally it would be sending 4k writes, and this should be a >>>>>>>>>>>>>> lot faster. >>>>>>>>>>>>>> >>>>>>>>>>>>>> You can see the writes being sent by adding debug_ms=1 to the client or osd. >>>>>>>>>>>>>> The format is osd_op(.*[write OFFSET~LENGTH]). >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Also, because I`ve not mentioned it before, network performance is >>>>>>>>>>>>>>> enough to hold fair gigabit connectivity with MTU 1500. Seems that it >>>>>>>>>>>>>>> is not interrupt problem or something like it - even if ceph-osd, >>>>>>>>>>>>>>> ethernet card queues and kvm instance pinned to different sets of >>>>>>>>>>>>>>> cores, nothing changes. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum >>>>>>>>>>>>>>> <gregory.farnum@xxxxxxxxxxxxx> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It sounds like maybe you're using Xen? The "rbd writeback window" option >>>>>>>>>>>>>>>> only works for userspace rbd implementations (eg, KVM). >>>>>>>>>>>>>>>> If you are using KVM, you probably want 81920000 (~80MB) rather than >>>>>>>>>>>>>>>> 8192000 (~8MB). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> What options are you running dd with? If you run a rados bench from both >>>>>>>>>>>>>>>> machines, what do the results look like? >>>>>>>>>>>>>>>> Also, can you do the ceph osd bench on each of your OSDs, please? >>>>>>>>>>>>>>>> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance) >>>>>>>>>>>>>>>> -Greg >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> More strangely, writing speed drops down by fifteen percent when this >>>>>>>>>>>>>>>>> option was set in vm` config(instead of result from >>>>>>>>>>>>>>>>> http://www.mail-archive.com/ceph-devel@xxxxxxxxxxxxxxx/msg03685.html). >>>>>>>>>>>>>>>>> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been >>>>>>>>>>>>>>>>> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and >>>>>>>>>>>>>>>>> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes >>>>>>>>>>>>>>>>> under heavy load. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil<sage@xxxxxxxxxxxx >>>>>>>>>>>>>>>>> (mailto:sage@xxxxxxxxxxxx)> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Sat, 17 Mar 2012, Andrey Korolyov wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I`ve did some performance tests at the following configuration: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - >>>>>>>>>>>>>>>>>>> dom0 with three dedicated cores and 1.5G, mostly idle. First three >>>>>>>>>>>>>>>>>>> disks on each r410 arranged into raid0 and holds osd data when fourth >>>>>>>>>>>>>>>>>>> holds os and osd` journal partition, all ceph-related stuff mounted on >>>>>>>>>>>>>>>>>>> the ext4 without barriers. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Firstly, I`ve noticed about a difference of benchmark performance and >>>>>>>>>>>>>>>>>>> write speed through rbd from small kvm instance running on one of >>>>>>>>>>>>>>>>>>> first two machines - when bench gave me about 110Mb/s, writing zeros >>>>>>>>>>>>>>>>>>> to raw block device inside vm with dd was at top speed about 45 mb/s, >>>>>>>>>>>>>>>>>>> for vm`fs (ext4 with default options) performance drops to ~23Mb/s. >>>>>>>>>>>>>>>>>>> Things get worse, when I`ve started second vm at second host and tried >>>>>>>>>>>>>>>>>>> to continue same dd tests simultaneously - performance fairly divided >>>>>>>>>>>>>>>>>>> by half for each instance :). Enabling jumbo frames, playing with cpu >>>>>>>>>>>>>>>>>>> affinity for ceph and vm instances and trying different TCP congestion >>>>>>>>>>>>>>>>>>> protocols gave no effect at all - with DCTCP I have slightly smoother >>>>>>>>>>>>>>>>>>> network load graph and that`s all. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Can ml please suggest anything to try to improve performance? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Can you try setting >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> rbd writeback window = 8192000 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> or similar, and see what kind of effect that has? I suspect it'll speed >>>>>>>>>>>>>>>>>> up dd; I'm less sure about ext3. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>>>>>> sage >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2 >>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>>>>>>>>>>>>>>> in >>>>>>>>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>>>>>>>>>>>> (mailto:majordomo@xxxxxxxxxxxxxxx) >>>>>>>>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>>>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>>>>>>>>>> (mailto:majordomo@xxxxxxxxxxxxxxx) >>>>>>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>>>>> -- >>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html