Re: disperse volume brick counts limits in RHES

Xavier Hernandez <xhernandez@xxxxxxxxxx> · Tue, 9 May 2017 08:53:19 +0200

Hi Alastair,

the numbers I'm giving correspond to an Intel Xeon E5-2630L 2 GHz CPU.

On 08/05/17 22:44, Alastair Neil wrote:
so the bottleneck is that computations with 16x20 matrix require  ~4
times the cycles?

This is only part of the problem. A 16x16 matrix can be processed at a 
rate of 400 MB/s, so a single fragment on a brick will be processed at 
400/16 = 25 MB/s which is not the case.

Note that the fragment on a brick is only part of a whole file, so 25 
MB/s on a brick means that the real file is being processed at 400 MB/s.

It seems then that there is ample room for
improvement, as there are many linear algebra packages out there that
scale better than O(nxm).

That's true for much bigger matrices where synchronization time between 
threads is negligible compared to the computation time. In this case the 
algorithm is highly optimized and any attempt to distribute the 
computation would be worse.

Note that the current algorithm can rebuild the original data at a rate 
of ~5 CPU cycles per byte with a 16x16 configuration without any SIMD 
extension. With SSE or AVX this goes down to near 1 cycle per byte.

In this case the best we can do is to do more than one heal in parallel. 
This will use more than one core to compute the matrices, getting an 
overall better performance.

Is the healing time dominated by the EC
compute time?  If Serkan saw a hard 2x scaling then it seems likely.

Partially. The computation speed is doubled on a 8+2 configuration, but 
also the number of IOPS is halved, and each one is of twice the size of 
a 16+4 operation. This means that we only have half of the latencies 
when using 8+2 and bandwidth is better utilized.

The theoretical speed of matrix processing is 25 MB/s per brick, but the 
real speed seen is considerably smaller, so network latencies and other 
factors also contribute to the heal time.

Xavi

-Alastair

On 8 May 2017 at 03:02, Xavier Hernandez <xhernandez@xxxxxxxxxx
<mailto:xhernandez@xxxxxxxxxx>> wrote:

    On 05/05/17 13:49, Pranith Kumar Karampuri wrote:

        On Fri, May 5, 2017 at 2:38 PM, Serkan Çoban
        <cobanserkan@xxxxxxxxx <mailto:cobanserkan@xxxxxxxxx>
        <mailto:cobanserkan@xxxxxxxxx <mailto:cobanserkan@xxxxxxxxx>>>
        wrote:

            It is the over all time, 8TB data disk healed 2x faster in 8+2
            configuration.

        Wow, that is counter intuitive for me. I will need to explore
        about this
        to find out why that could be. Thanks a lot for this feedback!

    Matrix multiplication for encoding/decoding of 8+2 is 4 times faster
    than 16+4 (one matrix of 16x16 is composed by 4 submatrices of 8x8),
    however each matrix operation on a 16+4 configuration takes twice
    the amount of data of a 8+2, so net effect is that 8+2 is twice as
    fast as 16+4.

    An 8+2 also uses bigger blocks on each brick, processing the same
    amount of data in less I/O operations and bigger network packets.

    Probably these are the reasons why 16+4 is slower than 8+2.

    See my other email for more detailed description.

    Xavi

            On Fri, May 5, 2017 at 10:00 AM, Pranith Kumar Karampuri
            <pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>
        <mailto:pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>>> wrote:
            >
            >
            > On Fri, May 5, 2017 at 11:42 AM, Serkan Çoban
            <cobanserkan@xxxxxxxxx <mailto:cobanserkan@xxxxxxxxx>
        <mailto:cobanserkan@xxxxxxxxx <mailto:cobanserkan@xxxxxxxxx>>>
        wrote:
            >>
            >> Healing gets slower as you increase m in m+n configuration.
            >> We are using 16+4 configuration without any problems
        other then heal
            >> speed.
            >> I tested heal speed with 8+2 and 16+4 on 3.9.0 and see
        that heals on
            >> 8+2 is faster by 2x.
            >
            >
            > As you increase number of nodes that are participating in
        an EC
            set number
            > of parallel heals increase. Is the heal speed you saw
        improved per
            file or
            > the over all time it took to heal the data?
            >
            >>
            >>
            >>
            >> On Fri, May 5, 2017 at 9:04 AM, Ashish Pandey
            <aspandey@xxxxxxxxxx <mailto:aspandey@xxxxxxxxxx>
        <mailto:aspandey@xxxxxxxxxx <mailto:aspandey@xxxxxxxxxx>>> wrote:
            >> >
            >> > 8+2 and 8+3 configurations are not the limitation but just
            suggestions.
            >> > You can create 16+3 volume without any issue.
            >> >
            >> > Ashish
            >> >
            >> > ________________________________
            >> > From: "Alastair Neil" <ajneil.tech@xxxxxxxxx
        <mailto:ajneil.tech@xxxxxxxxx>
            <mailto:ajneil.tech@xxxxxxxxx <mailto:ajneil.tech@xxxxxxxxx>>>
            >> > To: "gluster-users" <gluster-users@xxxxxxxxxxx
        <mailto:gluster-users@xxxxxxxxxxx>
            <mailto:gluster-users@xxxxxxxxxxx
        <mailto:gluster-users@xxxxxxxxxxx>>>
            >> > Sent: Friday, May 5, 2017 2:23:32 AM
            >> > Subject:  disperse volume brick counts
        limits in
            RHES
            >> >
            >> >
            >> > Hi
            >> >
            >> > we are deploying a large (24node/45brick) cluster and noted
            that the
            >> > RHES
            >> > guidelines limit the number of data bricks in a
        disperse set to
            8.  Is
            >> > there
            >> > any reason for this.  I am aware that you want this to be a
            power of 2,
            >> > but
            >> > as we have a large number of nodes we were planning on
        going
            with 16+3.
            >> > Dropping to 8+2 or 8+3 will be a real waste for us.
            >> >
            >> > Thanks,
            >> >
            >> >
            >> > Alastair
            >> >
            >> >
            >> > _______________________________________________
            >> > Gluster-users mailing list
            >> > Gluster-users@xxxxxxxxxxx
        <mailto:Gluster-users@xxxxxxxxxxx>
        <mailto:Gluster-users@xxxxxxxxxxx
        <mailto:Gluster-users@xxxxxxxxxxx>>
            >> > http://lists.gluster.org/mailman/listinfo/gluster-users
        <http://lists.gluster.org/mailman/listinfo/gluster-users>
            <http://lists.gluster.org/mailman/listinfo/gluster-users
        <http://lists.gluster.org/mailman/listinfo/gluster-users>>
            >> >
            >> >
            >> > _______________________________________________
            >> > Gluster-users mailing list
            >> > Gluster-users@xxxxxxxxxxx
        <mailto:Gluster-users@xxxxxxxxxxx>
        <mailto:Gluster-users@xxxxxxxxxxx
        <mailto:Gluster-users@xxxxxxxxxxx>>
            >> > http://lists.gluster.org/mailman/listinfo/gluster-users
        <http://lists.gluster.org/mailman/listinfo/gluster-users>
            <http://lists.gluster.org/mailman/listinfo/gluster-users
        <http://lists.gluster.org/mailman/listinfo/gluster-users>>
            >> _______________________________________________
            >> Gluster-users mailing list
            >> Gluster-users@xxxxxxxxxxx
        <mailto:Gluster-users@xxxxxxxxxxx>
        <mailto:Gluster-users@xxxxxxxxxxx
        <mailto:Gluster-users@xxxxxxxxxxx>>
            >> http://lists.gluster.org/mailman/listinfo/gluster-users
        <http://lists.gluster.org/mailman/listinfo/gluster-users>
            <http://lists.gluster.org/mailman/listinfo/gluster-users
        <http://lists.gluster.org/mailman/listinfo/gluster-users>>
            >
            >
            >
            >
            > --
            > Pranith

        --
        Pranith

        _______________________________________________
        Gluster-users mailing list
        Gluster-users@xxxxxxxxxxx <mailto:Gluster-users@xxxxxxxxxxx>
        http://lists.gluster.org/mailman/listinfo/gluster-users
        <http://lists.gluster.org/mailman/listinfo/gluster-users>

    _______________________________________________
    Gluster-users mailing list
    Gluster-users@xxxxxxxxxxx <mailto:Gluster-users@xxxxxxxxxxx>
    http://lists.gluster.org/mailman/listinfo/gluster-users
    <http://lists.gluster.org/mailman/listinfo/gluster-users>

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users