Re: RBD single process read performance

Wido den Hollander <wido@xxxxxxxx> · Thu, 25 Apr 2013 17:04:27 +0200

On 04/25/2013 12:48 PM, Igor Laskovy wrote:
Sorry, maybe this sounds weird to you, but what if totally exclude LACP
client side? Do you tried this tests directly on one of OSD nodes? What
about RAM utilization on OSD nodes?

The LACP isn't the issue here, since the reads are no where near the 
limit of the network.

The nodes themselfs perform well, simple benchmarks on the disks show 
very high read and write speeds.

The RAM is 32GB per node and they are using about 4GB right now.

Wido

On Wed, Apr 24, 2013 at 10:21 PM, Wido den Hollander <wido@xxxxxxxx
<mailto:wido@xxxxxxxx>> wrote:

    On 04/24/2013 02:23 PM, Mark Nelson wrote:

        On 04/24/2013 06:17 AM, Wido den Hollander wrote:

            Hi,

            I've been working with a Ceph 0.56.4 setup and I've been
            seeing some RBD
            read performance issues with single processes / threads.

            The setup is:
            - 36 OSDs (2TB WD RE drives)
            - 9 hosts (4 per OSD)
            - 120GB Intel SSD as a journal per host
            - 32GB Ram per host
            - Quad Core Xeon CPU (E3-1220 V2 @ 3.10GHz)
            - 2Gbit LACP link

            The client (3.8.8 kernel) in this case is a single node
            connected with
            20Gbit LACP to the same switches.

            To sum it up, with "rados bench" I'm seeing about 918MB/sec
            read (LACP
            doesn't balance well with one client) and 400MB/sec write.

            Note: 2 RADOS bench processes with 64 threads each.

            While doing those RADOS benches the disks nor the SSDs are
            really busy,
            so it seems that can be tuned a bit further.

            The problem is that when using either kernel RBD or librbd
            the read
            speeds are a lot slower then a write in a single process:

            dd if=/dev/zero of=/dev/rbd1 bs=4M count=1024: 290MB/sec
            dd if=/dev/rbd1 of=/dev/null bs=4M count=1024: 65MB/sec

            When running multiple writers I max out at somewhere around
            400MB/sec,
            the same as RADOS bench was telling me, but the reads go up
            to 300MB/sec
            when running multiple readers.

            Running multiple dd instances will still achieve about
            60MB/sec per dd,
            but it sums up to somewhere around 300MB/sec. (5 readers)

            I changed the following settings:

            osd op threads = 8
            journal aio = true

            The AIO journal showed a huge increase in write performance
            as expected,
            but increasing the op threads didn't change that much. Going
            from 2
            (default) to 4 gave me about 5MB/sec and going to 8 added
            another
            3MB/sec.

            Since I'm hitting the same RBD image over and over I'd
            expected these
            blocks to be in the cache of that OSDs and have the read
            speeds reach
            near line performance.

            The big difference seems to be in the amount of threads. I
            noticed the
            same with RADOS bench. With a smaller number of threads I
            wouldn't get
            to the 918MB/sec and I had to spawn multiple processes to
            get there.

            However, 65MB/sec write per RBD device doesn't seem like a lot.

            I also tried with librbd, but that gives a similar read
            performance as
            kernel RBD.

            The end-goal is to run with librbd (OpenStack), but for now
            I just want
            to crank up the read performance of a single process.

            I found multiple threads regarding the read performance, one
            showed that
            AMD systems where a problem with the hypertransport, but
            since these are
            Intel systems that isn't the case.

            Any suggestions? I'm not trying to touch any kernel settings
            (yet) since
            the RADOS bench shows me a pretty high read performance.

        Hi Wido,

        I did some RBD testing with fio recently. This was 1 client node
        talking
        to 1 server with 24 OSDs over 2 round-robin bonded 10GbE
        interfaces.  No
        replication.  Multiple rados bench instances from the client
        node tops
        out at like ~1.8GB/s writes and ~1.4GB/s reads.  I'm planning on
        doing a
        more complete write up, but for now, here are some of the single
        volume
        fio results.  The big thing here is that concurrency, even with
        a single
        IO process, is needed to get good performance.  With more
        clients (even
        just VMs on the same node), we can get throughput within about
        80% of
        the RADOS bench numbers.

    In my case I'm using 3x replication, so when writing I have 36/3 the
    write performance available.

    Issue is that I don't see a 100% util at all. But it's not about
    writing, the reading is just to slow.

        4MB write performance using libaio:

        1 volume, 1 process, and iodepth = 1

        ceph 0.58, krbd:                164MB/s
        ceph 0.58, qemu/kvm, no cache:            84MB/s
        ceph 0.58, qemu/kvm, rbd cache:            240MB/s
        ceph wip-rbd-cache-aio, qemu/kvm, rbd cache:    244MB/s

    I tried with wip-bobtail-rbd-backports-req-__order and with the
    recent patch for Qemu ( http://patchwork.ozlabs.org/__patch/232489/
    <http://patchwork.ozlabs.org/patch/232489/> ) and get about 90MB/sec
    write, but again, it's about reads.

        1 volume, 1 process, and iodepth = 16

        ceph 0.58, krbd:                711MB/s
        ceph 0.58, qemu/kvm, no cache:            899MB/s
        ceph 0.58, qemu/kvm, rbd cache:            227MB/s
        ceph wip-rbd-cache-aio, qemu/kvm, rbd cache:    680MB/s

        4MB read performance using libaio:

        1 volume, 1 process, and iodepth = 1

        ceph 0.58, krbd:                108MB/s
        ceph 0.58, qemu/kvm, no cache:            85MB/s
        ceph 0.58, qemu/kvm, rbd cache:            85MB/s
        ceph wip-rbd-cache-aio, qemu/kvm, rbd cache:    89MB/s

        1 volume, 1 process, and iodepth = 16

        ceph 0.58, krbd:                516MB/s
        ceph 0.58, qemu/kvm, no cache:            839MB/s
        ceph 0.58, qemu/kvm, rbd cache:            823MB/s
        ceph wip-rbd-cache-aio, qemu/kvm, rbd cache:    830MB/s

    With 4m size and an iodepth of 16 I'm maxing out a 90MB/sec inside a
    Qemu VM.

    The whole reading seems sluggish. For example "man fio" took about 4
    seconds to show up. Even running apt-get update is rather slow.

    The VM doesn't feel responsive at all, so trying to figure out where
    that comes from.

        To get single request performance to scale farther, you'll have to
        diagnose if there are places that you can lower latency rather
        than hide
        it with concurrency.  That's not an easy task in a distributed
        system
        like Ceph.  There are probably opportunities for optimization, but I
        suspect it may take more than tweaking the ceph.conf file.

    I fully get that the distributed nature has it's drawbacks in serial
    performance and that Ceph excels in parallel performance, however,
    just 60 ~ 80MB/sec seems rather slow. On a pretty idle cluster that
    should be better, especially when all the OSDs have everything in
    their page cache.

        Mark
        _________________________________________________
        ceph-users mailing list
        ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
        http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
        <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

    --
    Wido den Hollander
    42on B.V.

    Phone: +31 (0)20 700 9902 <tel:%2B31%20%280%2920%20700%209902>
    Skype: contact42on
    _________________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
    http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
    <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

--
Igor Laskovy
facebook.com/igor.laskovy <http://facebook.com/igor.laskovy>
studiogrizzly.com <http://studiogrizzly.com>

--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com