Re: Performance problems

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Fri, 12 Apr 2013 10:04:58 -0500

On 04/11/2013 07:25 PM, Ziemowit Pierzycki wrote:
No, I'm not using RDMA in this configuration since this will eventually
get deployed to production with 10G ethernet (yes RDMA is faster).  I
would prefer Ceph because it has a storage drive built into OpenNebula
which my company is using and as you mentioned individual drives.

I'm not sure what the problem is but it appears to me that one of the
hosts may be holding up the rest... with Ceph if the performance of one
of the hosts is much faster than others could this potentially slow down
the cluster to this level?

Definitely!  Even 1 slow OSD can cause dramatic slow downs.  This is 
because we (by default) try to distribute data evenly to every OSD in 
the cluster.  If even 1 OSD is really slow, it will accumulate more and 
more outstanding operations while all of the other OSDs complete their 
requests.  What will happen is that eventually you will have all of your 
outstanding operations waiting on that slow OSD, and all of the other 
OSDs will sit idle waiting for new requests.

If you know that some OSDs are permanently slower than others, you can 
re-weight them so that they receive fewer requests than the others which 
can mitigate this, but that isn't always an optimal solution.  Some 
times a slow OSD can be a sign of other hardware problems too.

Mark

On Thu, Apr 11, 2013 at 7:42 AM, Mark Nelson <mark.nelson@xxxxxxxxxxx
<mailto:mark.nelson@xxxxxxxxxxx>> wrote:

    With GlusterFS are you using the native RDMA support?

    Ceph and Gluster tend to prefer pretty different disk setups too.
      Afaik RH still recommends RAID6 beind each brick while we do
    better with individual disks behind each OSD.  You might want to
    watch the OSD admin socket and see if operations are backing up on
    any specific OSDs.

    Mark

    On 04/09/2013 12:54 PM, Ziemowit Pierzycki wrote:

        Neither made a difference.  I also have a glusterFS cluster with two
        nodes in replicating mode residing on 1TB drives:

        [root@triton speed]# dd conv=fdatasync if=/dev/zero
        of=/mnt/speed/test.out bs=512k count=10000
        10000+0 records in
        10000+0 records out
        5242880000 bytes (5.2 GB) copied, 43.573 s, 120 MB/s

        ... and Ceph:

        [root@triton temp]# dd conv=fdatasync if=/dev/zero
        of=/mnt/temp/test.out
        bs=512k count=10000
        10000+0 records in
        10000+0 records out
        5242880000 bytes (5.2 GB) copied, 366.911 s, 14.3 MB/s

        On Mon, Apr 8, 2013 at 4:29 PM, Mark Nelson
        <mark.nelson@xxxxxxxxxxx <mailto:mark.nelson@xxxxxxxxxxx>
        <mailto:mark.nelson@inktank.__com
        <mailto:mark.nelson@xxxxxxxxxxx>>> wrote:

             On 04/08/2013 04:12 PM, Ziemowit Pierzycki wrote:

                 There is one SSD in each node.  IPoIB performance is
        about 7 gbps
                 between each host.  CephFS is mounted via kernel
        client.  Ceph
                 version
                 is ceph-0.56.3-1.  I have a 1GB journal on the same
        drive as the
                 OSD but
                 on a seperate file system split via LVM.

                 Here is output of another test with fdatasync:

                 [root@triton temp]# dd conv=fdatasync if=/dev/zero
                 of=/mnt/temp/test.out
                 bs=512k count=10000
                 10000+0 records in
                 10000+0 records out
                 5242880000 bytes (5.2 GB) copied, 359.307 s, 14.6 MB/s
                 [root@triton temp]# dd if=/mnt/temp/test.out
        of=/dev/null bs=512k
                 count=10000
                 10000+0 records in
                 10000+0 records out
                 5242880000 bytes (5.2 GB) copied, 14.0521 s, 373 MB/s

             Definitely seems off!  How many SSDs are involved and how
        fast are
             they each?  The MTU idea might have merit, but I honestly
        don't know
             enough about how well IPoIB handles giant MTUs like that.
          One thing
             I have noticed on other IPoIB setups is that TCP autotuning can
             cause a ton of problems.  You may want to try disabling it
        on all of
             the hosts involved:

             echo 0 | tee /proc/sys/net/ipv4/tcp_____moderate_rcvbuf

             If that doesn't work, maybe try setting MTU to 9000 or 1500
        if possible.

             Mark

                 The network traffic appears to match the transfer
        speeds shown
                 here too.
                    Writing is very slow.

                 On Mon, Apr 8, 2013 at 3:04 PM, Mark Nelson
                 <mark.nelson@xxxxxxxxxxx
        <mailto:mark.nelson@xxxxxxxxxxx>
        <mailto:mark.nelson@inktank.__com <mailto:mark.nelson@xxxxxxxxxxx>>
                 <mailto:mark.nelson@inktank.
        <mailto:mark.nelson@inktank.>____com

                 <mailto:mark.nelson@inktank.__com
        <mailto:mark.nelson@xxxxxxxxxxx>>>> wrote:

                      Hi,

                      How many drives?  Have you tested your IPoIB
        performance
                 with iperf?
                        Is this CephFS with the kernel client?  What
        version of
                 Ceph?  How
                      are your journals configured? etc.  It's tough to
        make any
                      recommendations without knowing more about what
        you are doing.

                      Also, please use conv=fdatasync when doing buffered IO
                 writes with dd.

                      Thanks,
                      Mark

                      On 04/08/2013 03:00 PM, Ziemowit Pierzycki wrote:

                          Hi,

                          The first test was writing 500 mb file and was
        clocked
                 at 1.2
                          GBps.  The
                          second test was writing 5000 mb file at 17
        MBps.  The
                 third test was
                          reading the file at ~400 MBps.

                          On Mon, Apr 8, 2013 at 2:56 PM, Gregory Farnum
                 <greg@xxxxxxxxxxx <mailto:greg@xxxxxxxxxxx>
        <mailto:greg@xxxxxxxxxxx <mailto:greg@xxxxxxxxxxx>>
                          <mailto:greg@xxxxxxxxxxx
        <mailto:greg@xxxxxxxxxxx> <mailto:greg@xxxxxxxxxxx
        <mailto:greg@xxxxxxxxxxx>>>
                          <mailto:greg@xxxxxxxxxxx
        <mailto:greg@xxxxxxxxxxx> <mailto:greg@xxxxxxxxxxx
        <mailto:greg@xxxxxxxxxxx>>
                 <mailto:greg@xxxxxxxxxxx <mailto:greg@xxxxxxxxxxx>
        <mailto:greg@xxxxxxxxxxx <mailto:greg@xxxxxxxxxxx>>>>> wrote:

                               More details, please. You ran the same
        test twice and
                          performance went
                               up from 17.5MB/s to 394MB/s? How many
        drives in
                 each node,
                          and of what
                               kind?
                               -Greg
                               Software Engineer #42 @ http://inktank.com |
        http://ceph.com

                               On Mon, Apr 8, 2013 at 12:38 PM, Ziemowit
        Pierzycki
                               <ziemowit@xxxxxxxxxxxxx
        <mailto:ziemowit@xxxxxxxxxxxxx>
                 <mailto:ziemowit@xxxxxxxxxxxxx
        <mailto:ziemowit@xxxxxxxxxxxxx>__>
        <mailto:ziemowit@xxxxxxxxxxxxx <mailto:ziemowit@xxxxxxxxxxxxx>
                 <mailto:ziemowit@xxxxxxxxxxxxx
        <mailto:ziemowit@xxxxxxxxxxxxx>__>__>
                          <mailto:ziemowit@xxxxxxxxxxxxx
        <mailto:ziemowit@xxxxxxxxxxxxx>
                 <mailto:ziemowit@xxxxxxxxxxxxx
        <mailto:ziemowit@xxxxxxxxxxxxx>__>

                          <mailto:ziemowit@xxxxxxxxxxxxx
        <mailto:ziemowit@xxxxxxxxxxxxx>
                 <mailto:ziemowit@xxxxxxxxxxxxx
        <mailto:ziemowit@xxxxxxxxxxxxx>__>__>__>> wrote:
                                > Hi,
                                >
                                > I have a 3 node SSD-backed cluster
        connected over
                          infiniband (16K
                               MTU) and
                                > here is the performance I am seeing:
                                >
                                > [root@triton temp]# !dd
                                > dd if=/dev/zero of=/mnt/temp/test.out
        bs=512k
                 count=1000
                                > 1000+0 records in
                                > 1000+0 records out
                                > 524288000 bytes (524 MB) copied,
        0.436249 s,
                 1.2 GB/s
                                > [root@triton temp]# dd if=/dev/zero
                          of=/mnt/temp/test.out bs=512k
                                > count=10000
                                > 10000+0 records in
                                > 10000+0 records out
                                > 5242880000 bytes (5.2 GB) copied,
        299.077 s,
                 17.5 MB/s
                                > [root@triton temp]# dd
        if=/mnt/temp/test.out
                          of=/dev/null bs=512k
                                > count=1000010000+0 records in
                                > 10000+0 records out
                                > 5242880000 bytes (5.2 GB) copied,
        13.3015 s,
                 394 MB/s
                                >
                                > Does that look right?  How do I check
        this is
                 not a network
                               problem, because
                                > I remember seeing a kernel issue
        related to
                 large MTU.
                                >
                                >
        _____________________________________________________

                                > ceph-users mailing list
                                > ceph-users@xxxxxxxxxxxxxx
        <mailto:ceph-users@xxxxxxxxxxxxxx>
                 <mailto:ceph-users@xxxxxxxxxx.__com
        <mailto:ceph-users@xxxxxxxxxxxxxx>>
                          <mailto:ceph-users@xxxxxxxxxx.
        <mailto:ceph-users@xxxxxxxxxx.>____com
                 <mailto:ceph-users@xxxxxxxxxx.__com
        <mailto:ceph-users@xxxxxxxxxxxxxx>>>
                          <mailto:ceph-users@xxxxxxxxxx
        <mailto:ceph-users@xxxxxxxxxx>.
                 <mailto:ceph-users@xxxxxxxxxx
        <mailto:ceph-users@xxxxxxxxxx>.__>____com
                          <mailto:ceph-users@xxxxxxxxxx.
        <mailto:ceph-users@xxxxxxxxxx.>____com
                 <mailto:ceph-users@xxxxxxxxxx.__com
        <mailto:ceph-users@xxxxxxxxxxxxxx>>>>
                                >
        http://lists.ceph.com/______listinfo.cgi/ceph-users-ceph.______com
        <http://lists.ceph.com/____listinfo.cgi/ceph-users-ceph.____com>

        <http://lists.ceph.com/____listinfo.cgi/ceph-users-ceph.____com
        <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com>>

        <http://lists.ceph.com/____listinfo.cgi/ceph-users-ceph.____com
        <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com>

        <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
        <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>>>

                                >

          _____________________________________________________

                          ceph-users mailing list
        ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
        <mailto:ceph-users@xxxxxxxxxx.__com
        <mailto:ceph-users@xxxxxxxxxxxxxx>>
                 <mailto:ceph-users@xxxxxxxxxx.
        <mailto:ceph-users@xxxxxxxxxx.>____com
                 <mailto:ceph-users@xxxxxxxxxx.__com
        <mailto:ceph-users@xxxxxxxxxxxxxx>>>
        http://lists.ceph.com/______listinfo.cgi/ceph-users-ceph.______com
        <http://lists.ceph.com/____listinfo.cgi/ceph-users-ceph.____com>

        <http://lists.ceph.com/____listinfo.cgi/ceph-users-ceph.____com
        <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com>>

        <http://lists.ceph.com/____listinfo.cgi/ceph-users-ceph.____com
        <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com>

        <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
        <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>>>

                      _____________________________________________________

                      ceph-users mailing list
        ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
        <mailto:ceph-users@xxxxxxxxxx.__com
        <mailto:ceph-users@xxxxxxxxxxxxxx>>
                 <mailto:ceph-users@xxxxxxxxxx.
        <mailto:ceph-users@xxxxxxxxxx.>____com
                 <mailto:ceph-users@xxxxxxxxxx.__com
        <mailto:ceph-users@xxxxxxxxxxxxxx>>>
        http://lists.ceph.com/______listinfo.cgi/ceph-users-ceph.______com
        <http://lists.ceph.com/____listinfo.cgi/ceph-users-ceph.____com>

        <http://lists.ceph.com/____listinfo.cgi/ceph-users-ceph.____com
        <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com>>

          <http://lists.ceph.com/____listinfo.cgi/ceph-users-ceph.____com <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com>

        <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
        <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>>>

                 ___________________________________________________
                 ceph-users mailing list
        ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
        <mailto:ceph-users@xxxxxxxxxx.__com
        <mailto:ceph-users@xxxxxxxxxxxxxx>>
        http://lists.ceph.com/____listinfo.cgi/ceph-users-ceph.____com
        <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com>

        <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
        <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>>

             ___________________________________________________
             ceph-users mailing list
        ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
        <mailto:ceph-users@xxxxxxxxxx.__com
        <mailto:ceph-users@xxxxxxxxxxxxxx>>
        http://lists.ceph.com/____listinfo.cgi/ceph-users-ceph.____com
        <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com>
             <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
        <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com