Re: Mysteriously poor write performance

Andrey Korolyov <andrey@xxxxxxx> · Thu, 22 Mar 2012 21:40:26 +0300



random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
Starting 1 process
Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s]
random-rw: (groupid=0, jobs=1): err= 0: pid=9647
  write: io=163840KB, bw=37760KB/s, iops=9439, runt=  4339msec
    clat (usec): min=70, max=39801, avg=104.19, stdev=324.29
    bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, stdev=5770.28
  cpu          : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued r/w: total=0/40960, short=0/0
     lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11%
     lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%


On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just <sam.just@xxxxxxxxxxxxx> wrote:
> Our journal writes are actually sequential.  Could you send FIO
> results for sequential 4k writes osd.0's journal and osd.1's journal?
> -Sam
>
> On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov <andrey@xxxxxxx> wrote:
>> FIO output for journal partition, directio enabled, seems good(same
>> results for ext4 on other single sata disks).
>>
>> random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
>> Starting 1 process
>> Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s]
>> random-rw: (groupid=0, jobs=1): err= 0: pid=21926
>>  write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec
>>    clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04
>>    bw (KB/s) : min=  552, max= 3880, per=100.61%, avg=2341.23, stdev=480.05
>>  cpu          : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42
>>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>     issued r/w: total=0/40960, short=0/0
>>     lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63%
>>     lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07%
>>     lat (msec): 500=0.04%
>>
>>
>>
>> On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just <sam.just@xxxxxxxxxxxxx> wrote:
>>> (CCing the list)
>>>
>>> So, the problem isn't the bandwidth.  Before we respond to the client,
>>> we write the operation to the journal.  In this case, that operation
>>> is taking >1s per operation on osd.1.  Both rbd and rados bench will
>>> only allow a limited number of ops in flight at a time, so this
>>> latency is killing your throughput.  For comparison, the latency for
>>> writing to the journal on osd.0 is < .3s.  Can you measure direct io
>>> latency for writes to your osd.1 journal file?
>>> -Sam
>>>
>>> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov <andrey@xxxxxxx> wrote:
>>>> Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
>>>> not Megabits.
>>>>
>>>> On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov <andrey@xxxxxxx> wrote:
>>>>> [global]
>>>>>       log dir = /ceph/out
>>>>>       log_file = ""
>>>>>       logger dir = /ceph/log
>>>>>       pid file = /ceph/out/$type$id.pid
>>>>> [mds]
>>>>>       pid file = /ceph/out/$name.pid
>>>>>       lockdep = 1
>>>>>       mds log max segments = 2
>>>>> [osd]
>>>>>       lockdep = 1
>>>>>       filestore_xattr_use_omap = 1
>>>>>       osd data = /ceph/dev/osd$id
>>>>>       osd journal = /ceph/meta/journal
>>>>>       osd journal size = 100
>>>>> [mon]
>>>>>       lockdep = 1
>>>>>       mon data = /ceph/dev/mon$id
>>>>> [mon.0]
>>>>>       host = 172.20.1.32
>>>>>       mon addr = 172.20.1.32:6789
>>>>> [mon.1]
>>>>>       host = 172.20.1.33
>>>>>       mon addr = 172.20.1.33:6789
>>>>> [mon.2]
>>>>>       host = 172.20.1.35
>>>>>       mon addr = 172.20.1.35:6789
>>>>> [osd.0]
>>>>>       host = 172.20.1.32
>>>>> [osd.1]
>>>>>       host = 172.20.1.33
>>>>> [mds.a]
>>>>>       host = 172.20.1.32
>>>>>
>>>>> /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr)
>>>>> /dev/mapper/system-cephmeta on /ceph/meta type ext4 (rw,barrier=0,user_xattr)
>>>>> Simple performance tests on those fs shows ~133Mb/s for /ceph and
>>>>> metadata/. Also both machines do not hold anything else which may
>>>>> impact osd.
>>>>>
>>>>> Also please note of following:
>>>>>
>>>>> http://i.imgur.com/ZgFdO.png
>>>>>
>>>>> First two peaks are related to running rados bench, then goes cluster
>>>>> recreation, automated debian install and final peaks are dd test.
>>>>> Surely I can have more precise graphs, but current one probably enough
>>>>> to state a situation - rbd utilizing about a quarter of possible
>>>>> bandwidth(if we can count rados bench as 100%).
>>>>>
>>>>> On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just <sam.just@xxxxxxxxxxxxx> wrote:
>>>>>> Hmm, there seem to be writes taking as long as 1.5s to hit journal on
>>>>>> osd.1...  Could you post your ceph.conf?  Might there be a problem
>>>>>> with the osd.1 journal disk?
>>>>>> -Sam
>>>>>>
>>>>>> On Wed, Mar 21, 2012 at 1:25 PM, Andrey Korolyov <andrey@xxxxxxx> wrote:
>>>>>>> Oh, sorry - they probably inherited rights from log files, fixed.
>>>>>>>
>>>>>>> On Thu, Mar 22, 2012 at 12:17 AM, Samuel Just <sam.just@xxxxxxxxxxxxx> wrote:
>>>>>>>> I get 403 Forbidden when I try to download any of the files.
>>>>>>>> -Sam
>>>>>>>>
>>>>>>>> On Wed, Mar 21, 2012 at 11:51 AM, Andrey Korolyov <andrey@xxxxxxx> wrote:
>>>>>>>>> http://xdel.ru/downloads/ceph-logs/
>>>>>>>>>
>>>>>>>>> 1/ contains logs related to bench initiated at the osd0 machine and 2/
>>>>>>>>> - at osd1.
>>>>>>>>>
>>>>>>>>> On Wed, Mar 21, 2012 at 8:54 PM, Samuel Just <sam.just@xxxxxxxxxxxxx> wrote:
>>>>>>>>>> Hmm, I'm seeing some very high latency on ops sent to osd.1.  Can you
>>>>>>>>>> post osd.1's logs?
>>>>>>>>>> -Sam
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 21, 2012 at 3:51 AM, Andrey Korolyov <andrey@xxxxxxx> wrote:
>>>>>>>>>>> Here, please: http://xdel.ru/downloads/ceph.log.gz
>>>>>>>>>>>
>>>>>>>>>>> Sometimes 'cur MB/s ' shows zero during rados bench, even if any debug
>>>>>>>>>>> output disabled and log_file set to the empty value, hope it`s okay.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 21, 2012 at 2:36 AM, Samuel Just <sam.just@xxxxxxxxxxxxx> wrote:
>>>>>>>>>>>> Can you set osd and filestore debugging to 20, restart the osds, run
>>>>>>>>>>>> rados bench as before, and post the logs?
>>>>>>>>>>>> -Sam Just
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 20, 2012 at 1:37 PM, Andrey Korolyov <andrey@xxxxxxx> wrote:
>>>>>>>>>>>>> rados bench 60 write -p data
>>>>>>>>>>>>> <skip>
>>>>>>>>>>>>> Total time run:        61.217676
>>>>>>>>>>>>> Total writes made:     989
>>>>>>>>>>>>> Write size:            4194304
>>>>>>>>>>>>> Bandwidth (MB/sec):    64.622
>>>>>>>>>>>>>
>>>>>>>>>>>>> Average Latency:       0.989608
>>>>>>>>>>>>> Max latency:           2.21701
>>>>>>>>>>>>> Min latency:           0.255315
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here a snip from osd log, seems write size is okay.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
>>>>>>>>>>>>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
>>>>>>>>>>>>> active+clean]  removing repgather(0x31b5360 applying 10'83 rep_tid=597
>>>>>>>>>>>>> wfack= wfdisk= op=osd_op(client.4599.0:2533 rb.0.2.000000000040 [write
>>>>>>>>>>>>> 1220608~4096] 0.17eb9fd8) v4)
>>>>>>>>>>>>> 2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 10'83
>>>>>>>>>>>>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 10'82
>>>>>>>>>>>>> active+clean]    q front is repgather(0x31b5360 applying 10'83
>>>>>>>>>>>>> rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533
>>>>>>>>>>>>> rb.0.2.000000000040 [write 1220608~4096] 0.17eb9fd8) v4)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry for my previous question about rbd chunks, it was really stupid :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin <josh.durgin@xxxxxxxxxxxxx> wrote:
>>>>>>>>>>>>>> On 03/19/2012 11:13 AM, Andrey Korolyov wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed that Sage
>>>>>>>>>>>>>>> mentioned too small value and I`ve changed it to 64M before posting
>>>>>>>>>>>>>>> previous message with no success - both 8M and this value cause a
>>>>>>>>>>>>>>> performance drop. When I tried to wrote small amount of data that can
>>>>>>>>>>>>>>> be compared to writeback cache size(both on raw device and ext3 with
>>>>>>>>>>>>>>> sync option), following results were made:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I just want to clarify that the writeback window isn't a full writeback
>>>>>>>>>>>>>> cache - it doesn't affect reads, and does not help with request merging etc.
>>>>>>>>>>>>>> It simply allows a bunch of writes to be in flight while acking the write to
>>>>>>>>>>>>>> the guest immediately. We're working on a full-fledged writeback cache that
>>>>>>>>>>>>>> to replace the writeback window.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct (almost
>>>>>>>>>>>>>>> same without oflag there and in the following samples)
>>>>>>>>>>>>>>> 10+0 records in
>>>>>>>>>>>>>>> 10+0 records out
>>>>>>>>>>>>>>> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
>>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
>>>>>>>>>>>>>>> 20+0 records in
>>>>>>>>>>>>>>> 20+0 records out
>>>>>>>>>>>>>>> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
>>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
>>>>>>>>>>>>>>> 30+0 records in
>>>>>>>>>>>>>>> 30+0 records out
>>>>>>>>>>>>>>> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and so on. Reference test with bs=1M and count=2000 has slightly worse
>>>>>>>>>>>>>>> results _with_ writeback cache than without, as I`ve mentioned before.
>>>>>>>>>>>>>>>  Here the bench results, they`re almost equal on both nodes:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 113 MB/sec
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> One thing to check is the size of the writes that are actually being sent by
>>>>>>>>>>>>>> rbd. The guest is probably splitting them into relatively small (128 or
>>>>>>>>>>>>>> 256k) writes. Ideally it would be sending 4k writes, and this should be a
>>>>>>>>>>>>>> lot faster.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You can see the writes being sent by adding debug_ms=1 to the client or osd.
>>>>>>>>>>>>>> The format is osd_op(.*[write OFFSET~LENGTH]).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also, because I`ve not mentioned it before, network performance is
>>>>>>>>>>>>>>> enough to hold fair gigabit connectivity with MTU 1500. Seems that it
>>>>>>>>>>>>>>> is not interrupt problem or something like it - even if ceph-osd,
>>>>>>>>>>>>>>> ethernet card queues and kvm instance pinned to different sets of
>>>>>>>>>>>>>>> cores, nothing changes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
>>>>>>>>>>>>>>> <gregory.farnum@xxxxxxxxxxxxx>  wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It sounds like maybe you're using Xen? The "rbd writeback window" option
>>>>>>>>>>>>>>>> only works for userspace rbd implementations (eg, KVM).
>>>>>>>>>>>>>>>> If you are using KVM, you probably want 81920000 (~80MB) rather than
>>>>>>>>>>>>>>>> 8192000 (~8MB).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> What options are you running dd with? If you run a rados bench from both
>>>>>>>>>>>>>>>> machines, what do the results look like?
>>>>>>>>>>>>>>>> Also, can you do the ceph osd bench on each of your OSDs, please?
>>>>>>>>>>>>>>>> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
>>>>>>>>>>>>>>>> -Greg
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> More strangely, writing speed drops down by fifteen percent when this
>>>>>>>>>>>>>>>>> option was set in vm` config(instead of result from
>>>>>>>>>>>>>>>>> http://www.mail-archive.com/ceph-devel@xxxxxxxxxxxxxxx/msg03685.html).
>>>>>>>>>>>>>>>>> As I mentioned, I`m using 0.43, but due to crashed osds, ceph has been
>>>>>>>>>>>>>>>>> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 and
>>>>>>>>>>>>>>>>> 1468d95101adfad44247016a1399aab6b86708d2 - both cases caused crashes
>>>>>>>>>>>>>>>>> under heavy load.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Sun, Mar 18, 2012 at 10:22 PM, Sage Weil<sage@xxxxxxxxxxxx
>>>>>>>>>>>>>>>>> (mailto:sage@xxxxxxxxxxxx)>  wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I`ve did some performance tests at the following configuration:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 32G ram, mon2 -
>>>>>>>>>>>>>>>>>>> dom0 with three dedicated cores and 1.5G, mostly idle. First three
>>>>>>>>>>>>>>>>>>> disks on each r410 arranged into raid0 and holds osd data when fourth
>>>>>>>>>>>>>>>>>>> holds os and osd` journal partition, all ceph-related stuff mounted on
>>>>>>>>>>>>>>>>>>> the ext4 without barriers.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Firstly, I`ve noticed about a difference of benchmark performance and
>>>>>>>>>>>>>>>>>>> write speed through rbd from small kvm instance running on one of
>>>>>>>>>>>>>>>>>>> first two machines - when bench gave me about 110Mb/s, writing zeros
>>>>>>>>>>>>>>>>>>> to raw block device inside vm with dd was at top speed about 45 mb/s,
>>>>>>>>>>>>>>>>>>> for vm`fs (ext4 with default options) performance drops to ~23Mb/s.
>>>>>>>>>>>>>>>>>>> Things get worse, when I`ve started second vm at second host and tried
>>>>>>>>>>>>>>>>>>> to continue same dd tests simultaneously - performance fairly divided
>>>>>>>>>>>>>>>>>>> by half for each instance :). Enabling jumbo frames, playing with cpu
>>>>>>>>>>>>>>>>>>> affinity for ceph and vm instances and trying different TCP congestion
>>>>>>>>>>>>>>>>>>> protocols gave no effect at all - with DCTCP I have slightly smoother
>>>>>>>>>>>>>>>>>>> network load graph and that`s all.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Can ml please suggest anything to try to improve performance?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Can you try setting
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> rbd writeback window = 8192000
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> or similar, and see what kind of effect that has? I suspect it'll speed
>>>>>>>>>>>>>>>>>> up dd; I'm less sure about ext3.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>>> sage
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>>>>>>>>>>> (mailto:majordomo@xxxxxxxxxxxxxxx)
>>>>>>>>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>>>>>>>>> (mailto:majordomo@xxxxxxxxxxxxxxx)
>>>>>>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>> --
>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html