Re: performance issues Manoj found in EC testing

Xavier Hernandez <xhernandez@xxxxxxxxxx> · Tue, 28 Jun 2016 09:06:05 +0200

Hi Pranith,

On 28/06/16 08:08, Pranith Kumar Karampuri wrote:

On Tue, Jun 28, 2016 at 10:21 AM, Poornima Gurusiddaiah
<pgurusid@xxxxxxxxxx <mailto:pgurusid@xxxxxxxxxx>> wrote:

    Regards,
    Poornima

    ------------------------------------------------------------------------

        *From: *"Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx
        <mailto:pkarampu@xxxxxxxxxx>>
        *To: *"Xavier Hernandez" <xhernandez@xxxxxxxxxx
        <mailto:xhernandez@xxxxxxxxxx>>
        *Cc: *"Gluster Devel" <gluster-devel@xxxxxxxxxxx
        <mailto:gluster-devel@xxxxxxxxxxx>>
        *Sent: *Monday, June 27, 2016 5:48:24 PM
        *Subject: *Re:  performance issues Manoj found in
        EC testing

        On Mon, Jun 27, 2016 at 12:42 PM, Pranith Kumar Karampuri
        <pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>> wrote:

            On Mon, Jun 27, 2016 at 11:52 AM, Xavier Hernandez
            <xhernandez@xxxxxxxxxx <mailto:xhernandez@xxxxxxxxxx>> wrote:

                Hi Manoj,

                I always enable client-io-threads option for disperse
                volumes. It improves performance sensibly, most probably
                because of the problem you have detected.

                I don't see any other way to solve that problem.

            I agree. Updated the bug with same info.

                I think it would be a lot better to have a true thread
                pool (and maybe an I/O thread pool shared by fuse,
                client and server xlators) in libglusterfs instead of
                the io-threads xlator. This would allow each xlator to
                decide when and what should be parallelized in a more
                intelligent way, since basing the decision solely on the
                fop type seems too simplistic to me.

                In the specific case of EC, there are a lot of
                operations to perform for a single high level fop, and
                not all of them require the same priority. Also some of
                them could be executed in parallel instead of sequentially.

            I think it is high time we actually schedule(for which
            release) to get this in gluster. May be you should send out
            a doc where we can work out details? I will be happy to
            explore options to integrate io-threads, syncop/barrier with
            this infra based on the design may be.

        I was just thinking why we can't reuse synctask framework. It
        already scales up/down based on the tasks. At max it uses 16
        threads. Whatever we want to be executed in parallel we can
        create a synctask around it and run it. Would that be good enough?

    Yes, synctask framework can be preferred over io-threads, else it
    would mean 16 synctask threads + 16(?) io-threads for one instance
    of mount, this will blow out the gfapi clients if they have many
    mounts from the same process. Also using synctask would mean code
    changes in EC?

Yes it will need some changes but I don't think they are big changes. I
think the functions to decode/encode already exist. We just to need to
move encoding/decoding as tasks and run as synctasks.

I was also thinking in sleeping fops. Currently when they are resumed, 
they are processed in the same thread that was processing another fop. 
This could add latencies to fops or unnecessary delays in lock 
management. If they can be scheduled to be executed by another thread, 
these delays are drastically reduced.

On the other hand, splitting the computation of EC encoding into 
multiple threads is bad because current implementation takes advantage 
of internal CPU memory cache, which is really fast. We should compute 
all fragments of a single request in the same thread. Multiple 
independent computations could be executed by different threads.

Xavi,
      Long time back we chatted a bit about synctask code and you wanted
the scheduling to happen by kernel or something. Apart from that do you
see any other issues? At least if the tasks are synchronous i.e. nothing
goes out the wire, task scheduling = thread scheduling by kernel and it
works exactly like thread-pool you were referring to. It does
multi-tasking only if the tasks are asynchronous in nature.

How would this work ? should we have to create a new synctask for each 
background function we want to execute ? I think this has an important 
overhead, since each synctask requires its own stack, creates a frame 
that we don't really need in most cases, and it causes context switches.

We could have hundreds or thousands of requests per second. they could 
even require more than one background task for each request in some 
cases. I'm not sure if synctasks are the right choice in this case.

I think that a thread pool is more lightweight.

Xavi

                Xavi

                On 25/06/16 19:42, Manoj Pillai wrote:

                    ----- Original Message -----

                        From: "Pranith Kumar Karampuri"
                        <pkarampu@xxxxxxxxxx <mailto:pkarampu@xxxxxxxxxx>>
                        To: "Xavier Hernandez" <xhernandez@xxxxxxxxxx
                        <mailto:xhernandez@xxxxxxxxxx>>
                        Cc: "Manoj Pillai" <mpillai@xxxxxxxxxx
                        <mailto:mpillai@xxxxxxxxxx>>, "Gluster Devel"
                        <gluster-devel@xxxxxxxxxxx
                        <mailto:gluster-devel@xxxxxxxxxxx>>
                        Sent: Thursday, June 23, 2016 8:50:44 PM
                        Subject: performance issues Manoj found in EC
                        testing

                        hi Xavi,
                                  Meet Manoj from performance team
                        Redhat. He has been testing EC
                        performance in his stretch clusters. He found
                        some interesting things we
                        would like to share with you.

                        1) When we perform multiple streams of big file
                        writes(12 parallel dds I
                        think) he found one thread to be always hot
                        (99%CPU always). He was asking
                        me if fuse_reader thread does any extra
                        processing in EC compared to
                        replicate. Initially I thought it would just
                        lock and epoll threads will
                        perform the encoding but later realized that
                        once we have the lock and
                        version details, next writes on the file would
                        be encoded in the same
                        thread that comes to EC. write-behind could play
                        a role and make the writes
                        come to EC in an epoll thread but we saw
                        consistently there was just one
                        thread that is hot. Not multiple threads. We
                        will be able to confirm this
                        in tomorrow's testing.

                        2) This is one more thing Raghavendra G found,
                        that our current
                        implementation of epoll doesn't let other epoll
                        threads pick messages from
                        a socket while one thread is processing one
                        message from that socket. In
                        EC's case that can be encoding of the
                        write/decoding read. This will not
                        let replies of operations on different files to
                        be processed in parallel.
                        He thinks this can be fixed for 3.9.

                        Manoj will be raising a bug to gather all his
                        findings. I just wanted to
                        introduce him and let you know the interesting
                        things he is finding before
                        you see the bug :-).
                        --
                        Pranith

                    Thanks, Pranith :).

                    Here's the bug:
                    https://bugzilla.redhat.com/show_bug.cgi?id=1349953

                    Comparing EC and replica-2 runs, the hot thread is
                    seen in both cases, so
                    I have not opened this as an EC bug. But initial
                    impression is that
                    performance impact for EC is particularly bad
                    (details in the bug).

                    -- Manoj

            --
            Pranith

        --
        Pranith

        _______________________________________________
        Gluster-devel mailing list
        Gluster-devel@xxxxxxxxxxx <mailto:Gluster-devel@xxxxxxxxxxx>
        http://www.gluster.org/mailman/listinfo/gluster-devel

--
Pranith
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel