Re: Mysteriously poor write performance

Josh Durgin <josh.durgin@xxxxxxxxxxxxx> · Mon, 19 Mar 2012 11:40:21 -0700

On 03/19/2012 11:13 AM, Andrey Korolyov wrote:
Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
mentioned too small value and I`ve changed it to 64M before posting
previous message with no success - both 8M and this value cause a
performance drop. When I tried to wrote small amount of data that can
be compared to writeback cache size(both on raw device and ext3 with
sync option), following results were made:

I just want to clarify that the writeback window isn't a full writeback 
cache - it doesn't affect reads, and does not help with request merging 
etc. It simply allows a bunch of writes to be in flight while acking the 
write to the guest immediately. We're working on a full-fledged 
writeback cache that to replace the writeback window.

dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
same without oflag there and in the following samples)
10+0 records in
10+0 records out
104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
20+0 records in
20+0 records out
209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
30+0 records in
30+0 records out
314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s

and so on. Reference test with bs=1M and count=2000 has slightly worse
results _with_ writeback cache than without, as I`ve mentioned before.
  Here the bench results, they`re almost equal on both nodes:

bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec

One thing to check is the size of the writes that are actually being 
sent by rbd. The guest is probably splitting them into relatively small 
(128 or 256k) writes. Ideally it would be sending 4k writes, and this 
should be a lot faster.

You can see the writes being sent by adding debug_ms=1 to the client or 
osd. The format is osd_op(.*[write OFFSET~LENGTH]).

Also, because I`ve not mentioned it before, network performance is
enough to hold fair gigabit connectivity with MTU 1500. Seems that it
is not interrupt problem or something like it - even if ceph-osd,
ethernet card queues and kvm instance pinned to different sets of
cores, nothing changes.

On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
<gregory.farnum@xxxxxxxxxxxxx>  wrote:
It sounds like maybe you're using Xen? The "rbd writeback window" option only works for userspace rbd implementations (eg, KVM).
If you are using KVM, you probably want 81920000 (~80MB) rather than 8192000 (~8MB).

What options are you running dd with? If you run a rados bench from both machines, what do the results look like?
Also, can you do the ceph osd bench on each of your OSDs, please? (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
-Greg

On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:

More strangely, writing speed drops down by fifteen percent when this
option was set in vm` config(instead of result from
http://www.mail-archive.com/ceph-devel@xxxxxxxxxxxxxxx/msg03685.html).
As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
under heavy load.

On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil<sage@xxxxxxxxxxxx (mailto:sage@xxxxxxxxxxxx)>  wrote:
On Sat, 17 Mar 2012, Andrey Korolyov wrote:
Hi,

I`ve did some performance tests at the following configuration:

mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
dom0 with three dedicated cores and 1.5G, mostly idle. First three
disks on each r410 arranged into raid0 and holds osd data when fourth
holds os and osd` journal partition, all ceph-related stuff mounted on
the ext4 without barriers.

Firstly, I`ve noticed about a difference of benchmark performance and
write speed through rbd from small kvm instance running on one of
first two machines - when bench gave me about 110Mb/s, writing zeros
to raw block device inside vm with dd was at top speed about 45 mb/s,
for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
Things get worse, when I`ve started second vm at second host and tried
to continue same dd tests simultaneously - performance fairly divided
by half for each instance :). Enabling jumbo frames, playing with cpu
affinity for ceph and vm instances and trying different TCP congestion
protocols gave no effect at all - with DCTCP I have slightly smoother
network load graph and that`s all.

Can ml please suggest anything to try to improve performance?

Can you try setting

rbd writeback window = 8192000

or similar, and see what kind of effect that has? I suspect it'll speed
up dd; I'm less sure about ext3.

Thanks!
sage

ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx (mailto:majordomo@xxxxxxxxxxxxxxx)
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx (mailto:majordomo@xxxxxxxxxxxxxxx)
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html