Thanks to Greg, I have noticed very strange thing - data pool filled with a bunch of objects like rb.0.0.0000000004db with typical size 4194304 when original pool for guest os has size only 112(created as 40g). Seems that something went wrong, because on 0.42 I had more impressive performance on cheaper hardware. For first time, I blamed recent crash and recreated cluster from scratch about a hour ago, but those objects created in a bare data/ pool with only one vm. On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin <josh.durgin@xxxxxxxxxxxxx> wrote: > On 03/19/2012 11:13 AM, Andrey Korolyov wrote: >> >> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage >> mentioned too small value and I`ve changed it to 64M before posting >> previous message with no success - both 8M and this value cause a >> performance drop. When I tried to wrote small amount of data that can >> be compared to writeback cache size(both on raw device and ext3 with >> sync option), following results were made: > > > I just want to clarify that the writeback window isn't a full writeback > cache - it doesn't affect reads, and does not help with request merging etc. > It simply allows a bunch of writes to be in flight while acking the write to > the guest immediately. We're working on a full-fledged writeback cache that > to replace the writeback window. > > >> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost >> same without oflag there and in the following samples) >> 10+0 records in >> 10+0 records out >> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s >> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct >> 20+0 records in >> 20+0 records out >> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s >> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct >> 30+0 records in >> 30+0 records out >> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s >> >> and so on. Reference test with bs=1M and count=2000 has slightly worse >> results _with_ writeback cache than without, as I`ve mentioned before. >> Here the bench results, they`re almost equal on both nodes: >> >> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec > > > One thing to check is the size of the writes that are actually being sent by > rbd. The guest is probably splitting them into relatively small (128 or > 256k) writes. Ideally it would be sending 4k writes, and this should be a > lot faster. > > You can see the writes being sent by adding debug_ms=1 to the client or osd. > The format is osd_op(.*[write OFFSET~LENGTH]). > > >> Also, because I`ve not mentioned it before, network performance is >> enough to hold fair gigabit connectivity with MTU 1500. Seems that it >> is not interrupt problem or something like it - even if ceph-osd, >> ethernet card queues and kvm instance pinned to different sets of >> cores, nothing changes. >> >> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum >> <gregory.farnum@xxxxxxxxxxxxx> wrote: >>> >>> It sounds like maybe you're using Xen? The "rbd writeback window" option >>> only works for userspace rbd implementations (eg, KVM). >>> If you are using KVM, you probably want 81920000 (~80MB) rather than >>> 8192000 (~8MB). >>> >>> What options are you running dd with? If you run a rados bench from both >>> machines, what do the results look like? >>> Also, can you do the ceph osd bench on each of your OSDs, please? >>> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance) >>> -Greg >>> >>> >>> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote: >>> >>>> More strangely, writing speed drops down by fifteen percent when this >>>> option was set in vm` config(instead of result from >>>> http://www.mail-archive.com/ceph-devel@xxxxxxxxxxxxxxx/msg03685.html). >>>> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been >>>> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and >>>> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes >>>> under heavy load. >>>> >>>> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil<sage@xxxxxxxxxxxx >>>> (mailto:sage@xxxxxxxxxxxx)> wrote: >>>>> >>>>> On Sat, 17 Mar 2012, Andrey Korolyov wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> I`ve did some performance tests at the following configuration: >>>>>> >>>>>> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 - >>>>>> dom0 with three dedicated cores and 1.5G, mostly idle. First three >>>>>> disks on each r410 arranged into raid0 and holds osd data when fourth >>>>>> holds os and osd` journal partition, all ceph-related stuff mounted on >>>>>> the ext4 without barriers. >>>>>> >>>>>> Firstly, I`ve noticed about a difference of benchmark performance and >>>>>> write speed through rbd from small kvm instance running on one of >>>>>> first two machines - when bench gave me about 110Mb/s, writing zeros >>>>>> to raw block device inside vm with dd was at top speed about 45 mb/s, >>>>>> for vm`fs (ext4 with default options) performance drops to ~23Mb/s. >>>>>> Things get worse, when I`ve started second vm at second host and tried >>>>>> to continue same dd tests simultaneously - performance fairly divided >>>>>> by half for each instance :). Enabling jumbo frames, playing with cpu >>>>>> affinity for ceph and vm instances and trying different TCP congestion >>>>>> protocols gave no effect at all - with DCTCP I have slightly smoother >>>>>> network load graph and that`s all. >>>>>> >>>>>> Can ml please suggest anything to try to improve performance? >>>>> >>>>> >>>>> Can you try setting >>>>> >>>>> rbd writeback window = 8192000 >>>>> >>>>> or similar, and see what kind of effect that has? I suspect it'll speed >>>>> up dd; I'm less sure about ext3. >>>>> >>>>> Thanks! >>>>> sage >>>>> >>>>> >>>>>> >>>>>> ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2 >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>> in >>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>> (mailto:majordomo@xxxxxxxxxxxxxxx) >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>>> >>>> >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>> (mailto:majordomo@xxxxxxxxxxxxxxx) >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >>> >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html