Re: Slow write times to gluster disk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




Hi Soumya,

For the latest test we set up a test gluster volume consisting of 2 bricks both residing on an NFS disk (/home). The gluster volume is neither replicated nor striped. The tests were performed on the server hosting the disk, so no network was involved.

Addition details of the system are in http://lists.gluster.org/pipermail/gluster-users/2017-April/030529.html (note that here the tests are now all being done under the /home disk)

Pat


On 05/31/2017 06:56 AM, Soumya Koduri wrote:


On 05/31/2017 07:24 AM, Pranith Kumar Karampuri wrote:
Thanks this is good information.

+Soumya

Soumya,
       We are trying to find why kNFS is performing way better than
plain distribute glusterfs+fuse. What information do you think will
benefit us to compare the operations with kNFS vs gluster+fuse? We
already have profile output from fuse.

Could be because all operations done by kNFS are local to the system. The operations done by FUSE mount over network could be more in number and time-consuming than the ones sent by NFS-client. We could compare and examine the pattern from tcpump taken over fuse-mount and NFS-mount. Also nfsstat [1] may give some clue.

Sorry I hadn't followed this mail from the beginning. But is this comparison between single brick volume and kNFS exporting that brick? Otherwise its not a fair comparison if the volume is replicated or distributed.

Thanks,
Soumya

[1] https://linux.die.net/man/8/nfsstat


On Wed, May 31, 2017 at 7:10 AM, Pat Haley <phaley@xxxxxxx
<mailto:phaley@xxxxxxx>> wrote:


    Hi Pranith,

    The "dd" command was:

        dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync

    There were 2 instances where dd reported 22 seconds. The output from
    the dd tests are in

http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt>

    Pat


    On 05/30/2017 09:27 PM, Pranith Kumar Karampuri wrote:
    Pat,
           What is the command you used? As per the following output,
    it seems like at least one write operation took 16 seconds. Which
    is really bad.
96.39 1165.10 us 89.00 us *16487014.00 us* 393212 WRITE


    On Tue, May 30, 2017 at 10:36 PM, Pat Haley <phaley@xxxxxxx
    <mailto:phaley@xxxxxxx>> wrote:


        Hi Pranith,

        I ran the same 'dd' test both in the gluster test volume and
        in the .glusterfs directory of each brick.  The median results
        (12 dd trials in each test) are similar to before

          * gluster test volume: 586.5 MB/s
          * bricks (in .glusterfs): 1.4 GB/s

        The profile for the gluster test-volume is in

http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt
<http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt>

        Thanks

        Pat




        On 05/30/2017 12:10 PM, Pranith Kumar Karampuri wrote:
        Let's start with the same 'dd' test we were testing with to
        see, what the numbers are. Please provide profile numbers for
        the same. From there on we will start tuning the volume to
        see what we can do.

        On Tue, May 30, 2017 at 9:16 PM, Pat Haley <phaley@xxxxxxx
        <mailto:phaley@xxxxxxx>> wrote:


            Hi Pranith,

            Thanks for the tip.  We now have the gluster volume
            mounted under /home.  What tests do you recommend we run?

            Thanks

            Pat



            On 05/17/2017 05:01 AM, Pranith Kumar Karampuri wrote:


            On Tue, May 16, 2017 at 9:20 PM, Pat Haley
            <phaley@xxxxxxx <mailto:phaley@xxxxxxx>> wrote:


                Hi Pranith,

                Sorry for the delay.  I never saw received your
                reply (but I did receive Ben Turner's follow-up to
                your reply).  So we tried to create a gluster volume
                under /home using different variations of

                gluster volume create test-volume
                mseas-data2:/home/gbrick_test_1
                mseas-data2:/home/gbrick_test_2 transport tcp

                However we keep getting errors of the form

                Wrong brick type: transport, use
<HOSTNAME>:<export-dir-abs-path>

                Any thoughts on what we're doing wrong?


            You should give transport tcp at the beginning I think.
            Anyways, transport tcp is the default, so no need to
            specify so remove those two words from the CLI.


                Also do you have a list of the test we should be
                running once we get this volume created? Given the
                time-zone difference it might help if we can run a
                small battery of tests and post the results rather
                than test-post-new test-post... .


            This is the first time I am doing performance analysis
            on users as far as I remember. In our team there are
            separate engineers who do these tests. Ben who replied
            earlier is one such engineer.

            Ben,
                Have any suggestions?



                Thanks

                Pat



On 05/11/2017 12:06 PM, Pranith Kumar Karampuri wrote:


                On Thu, May 11, 2017 at 9:32 PM, Pat Haley
                <phaley@xxxxxxx <mailto:phaley@xxxxxxx>> wrote:


                    Hi Pranith,

                    The /home partition is mounted as ext4
                    /home              ext4
                    defaults,usrquota,grpquota      1 2

                    The brick partitions are mounted ax xfs
                    /mnt/brick1  xfs defaults        0 0
                    /mnt/brick2  xfs defaults        0 0

                    Will this cause a problem with creating a
                    volume under /home?


                I don't think the bottleneck is disk. You can do
the same tests you did on your new volume to confirm?



                    Pat



                    On 05/11/2017 11:32 AM, Pranith Kumar Karampuri
                    wrote:


                    On Thu, May 11, 2017 at 8:57 PM, Pat Haley
                    <phaley@xxxxxxx <mailto:phaley@xxxxxxx>> wrote:


                        Hi Pranith,

                        Unfortunately, we don't have similar
                        hardware for a small scale test.  All we
                        have is our production hardware.


                    You said something about /home partition which
                    has lesser disks, we can create plain
                    distribute volume inside one of those
                    directories. After we are done, we can remove
                    the setup. What do you say?



                        Pat




                        On 05/11/2017 07:05 AM, Pranith Kumar
                        Karampuri wrote:


                        On Thu, May 11, 2017 at 2:48 AM, Pat
                        Haley <phaley@xxxxxxx
<mailto:phaley@xxxxxxx>> wrote:


                            Hi Pranith,

                            Since we are mounting the partitions
                            as the bricks, I tried the dd test
                            writing to
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.
                            The results without oflag=sync were
                            1.6 Gb/s (faster than gluster but not
                            as fast as I was expecting given the
                            1.2 Gb/s to the no-gluster area w/
                            fewer disks).


                        Okay, then 1.6Gb/s is what we need to
                        target for, considering your volume is
                        just distribute. Is there any way you can
                        do tests on similar hardware but at a
                        small scale? Just so we can run the
                        workload to learn more about the
                        bottlenecks in the system? We can
                        probably try to get the speed to 1.2Gb/s
                        on your /home partition you were telling
                        me yesterday. Let me know if that is
                        something you are okay to do.



                            Pat



                            On 05/10/2017 01:27 PM, Pranith Kumar
                            Karampuri wrote:


                            On Wed, May 10, 2017 at 10:15 PM,
                            Pat Haley <phaley@xxxxxxx
<mailto:phaley@xxxxxxx>> wrote:


                                Hi Pranith,

                                Not entirely sure (this isn't my
                                area of expertise).  I'll run
                                your answer by some other people
                                who are more familiar with this.

                                I am also uncertain about how to
                                interpret the results when we
                                also add the dd tests writing to
                                the /home area (no gluster,
                                still on the same machine)

                                  * dd test without oflag=sync
                                    (rough average of multiple
                                    tests)
                                      o gluster w/ fuse mount :
                                        570 Mb/s
                                      o gluster w/ nfs mount:
                                        390 Mb/s
o nfs (no gluster): 1.2 Gb/s
                                  * dd test with oflag=sync
                                    (rough average of multiple
                                    tests)
                                      o gluster w/ fuse mount:
                                        5 Mb/s
                                      o gluster w/ nfs mount:
                                        200 Mb/s
                                      o nfs (no gluster): 20 Mb/s

                                Given that the non-gluster area
                                is a RAID-6 of 4 disks while
                                each brick of the gluster area
                                is a RAID-6 of 32 disks, I would
                                naively expect the writes to the
                                gluster area to be roughly 8x
                                faster than to the non-gluster.


                            I think a better test is to try and
                            write to a file using nfs without
                            any gluster to a location that is
                            not inside the brick but someother
                            location that is on same disk(s). If
                            you are mounting the partition as
                            the brick, then we can write to a
                            file inside .glusterfs directory,
                            something like
<brick-path>/.glusterfs/<file-to-be-removed-after-test>.



                                I still think we have a speed
                                issue, I can't tell if fuse vs
                                nfs is part of the problem.


                            I got interested in the post because
                            I read that fuse speed is lesser
                            than nfs speed which is
                            counter-intuitive to my
                            understanding. So wanted
                            clarifications. Now that I got my
                            clarifications where fuse
                            outperformed nfs without sync, we
                            can resume testing as described
                            above and try to find what it is.
                            Based on your email-id I am guessing
                            you are from Boston and I am from
                            Bangalore so if you are okay with
                            doing this debugging for multiple
                            days because of timezones, I will be
                            happy to help. Please be a bit
                            patient with me, I am under a
                            release crunch but I am very curious
                            with the problem you posted.

                                  Was there anything useful in
                                the profiles?


                            Unfortunately profiles didn't help
                            me much, I think we are collecting
                            the profiles from an active volume,
                            so it has a lot of information that
                            is not pertaining to dd so it is
                            difficult to find the contributions
                            of dd. So I went through your post
                            again and found something I didn't
                            pay much attention to earlier i.e.
                            oflag=sync, so did my own tests on
my setup with FUSE so sent that reply.



                                Pat



                                On 05/10/2017 12:15 PM, Pranith
                                Kumar Karampuri wrote:
Okay good. At least this
                                validates my doubts. Handling
                                O_SYNC in gluster NFS and fuse
                                is a bit different.
                                When application opens a file
                                with O_SYNC on fuse mount then
                                each write syscall has to be
                                written to disk as part of the
                                syscall where as in case of
                                NFS, there is no concept of
                                open. NFS performs write though
                                a handle saying it needs to be
                                a synchronous write, so write()
                                syscall is performed first then
                                it performs fsync(). so an
                                write on an fd with O_SYNC
                                becomes write+fsync. I am
                                suspecting that when multiple
                                threads do this write+fsync()
                                operation on the same file,
                                multiple writes are batched
                                together to be written do disk
                                so the throughput on the disk
                                is increasing is my guess.

                                Does it answer your doubts?

                                On Wed, May 10, 2017 at 9:35
                                PM, Pat Haley <phaley@xxxxxxx
<mailto:phaley@xxxxxxx>> wrote:


                                    Without the oflag=sync and
                                    only a single test of each,
                                    the FUSE is going faster
                                    than NFS:

                                    FUSE:
mseas-data2(dri_nascar)% dd
                                    if=/dev/zero count=4096
                                    bs=1048576 of=zeros.txt
                                    conv=sync
                                    4096+0 records in
                                    4096+0 records out
                                    4294967296 bytes (4.3 GB)
                                    copied, 7.46961 s, 575 MB/s


                                    NFS
mseas-data2(HYCOM)% dd
                                    if=/dev/zero count=4096
                                    bs=1048576 of=zeros.txt
                                    conv=sync
                                    4096+0 records in
                                    4096+0 records out
                                    4294967296 bytes (4.3 GB)
                                    copied, 11.4264 s, 376 MB/s



                                    On 05/10/2017 11:53 AM,
Pranith Kumar Karampuri wrote:
Could you let me know the
                                    speed without oflag=sync
                                    on both the mounts? No
                                    need to collect profiles.

                                    On Wed, May 10, 2017 at
                                    9:17 PM, Pat Haley
<phaley@xxxxxxx
<mailto:phaley@xxxxxxx>>
                                    wrote:


                                        Here is what I see now:

[root@mseas-data2 ~]#
gluster volume info

Volume Name: data-volume
                                        Type: Distribute
                                        Volume ID:
c162161e-2a2d-4dac-b015-f31fd89ceb18
Status: Started
                                        Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1:
mseas-data2:/mnt/brick1
Brick2:
mseas-data2:/mnt/brick2
Options Reconfigured:
diagnostics.count-fop-hits:
                                        on
diagnostics.latency-measurement:
                                        on
nfs.exports-auth-enable:
                                        on
diagnostics.brick-sys-log-level:
WARNING
performance.readdir-ahead:
                                        on
nfs.disable: on
nfs.export-volumes: off



                                        On 05/10/2017 11:44
                                        AM, Pranith Kumar
Karampuri wrote:
Is this the volume
                                        info you have?

>/[root at mseas-data2
<http://www.gluster.org/mailman/listinfo/gluster-users>
~]# gluster volume info />//>/Volume Name:
data-volume />/Type: Distribute />/Volume ID:
c162161e-2a2d-4dac-b015-f31fd89ceb18
/>/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1:
mseas-data2:/mnt/brick1 />/Brick2:
mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off /
                                        ​I copied this from
                                        old thread from 2016.
                                        This is distribute
volume. Did you
change any of the
options in between?

                                        --

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Pat Haley Email: phaley@xxxxxxx
<mailto:phaley@xxxxxxx>
Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125 MIT, Room 5-213 http://web.mit.edu/phaley/www/
                                        77 Massachusetts Avenue
Cambridge, MA  02139-4301

                                    --
                                    Pranith

                                    --

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Pat Haley Email: phaley@xxxxxxx
<mailto:phaley@xxxxxxx>
Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125 MIT, Room 5-213 http://web.mit.edu/phaley/www/
                                    77 Massachusetts Avenue
                                    Cambridge, MA  02139-4301

                                --
                                Pranith

                                --

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: phaley@xxxxxxx
<mailto:phaley@xxxxxxx>
Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125 MIT, Room 5-213 http://web.mit.edu/phaley/www/
                                77 Massachusetts Avenue
                                Cambridge, MA 02139-4301

                            --
                            Pranith

                            --

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: phaley@xxxxxxx <mailto:phaley@xxxxxxx> Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125 MIT, Room 5-213 http://web.mit.edu/phaley/www/
                            77 Massachusetts Avenue
                            Cambridge, MA 02139-4301

                        --
                        Pranith

                        --

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: phaley@xxxxxxx <mailto:phaley@xxxxxxx> Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125 MIT, Room 5-213 http://web.mit.edu/phaley/www/
                        77 Massachusetts Avenue
                        Cambridge, MA  02139-4301

                    --
                    Pranith

                    --

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: phaley@xxxxxxx <mailto:phaley@xxxxxxx> Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125
                    MIT, Room 5-213 http://web.mit.edu/phaley/www/
                    77 Massachusetts Avenue
                    Cambridge, MA  02139-4301

                --
                Pranith

                --

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: phaley@xxxxxxx <mailto:phaley@xxxxxxx> Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125
                MIT, Room 5-213 http://web.mit.edu/phaley/www/
                77 Massachusetts Avenue
                Cambridge, MA  02139-4301




            --
            Pranith

            --

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: phaley@xxxxxxx <mailto:phaley@xxxxxxx>
            Center for Ocean Engineering       Phone:  (617) 253-6824
            Dept. of Mechanical Engineering    Fax:    (617) 253-8125
            MIT, Room 5-213 http://web.mit.edu/phaley/www/
            77 Massachusetts Avenue
            Cambridge, MA  02139-4301




        --
        Pranith

        --

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: phaley@xxxxxxx <mailto:phaley@xxxxxxx>
        Center for Ocean Engineering       Phone:  (617) 253-6824
        Dept. of Mechanical Engineering    Fax:    (617) 253-8125
        MIT, Room 5-213 http://web.mit.edu/phaley/www/
        77 Massachusetts Avenue
        Cambridge, MA  02139-4301




    --
    Pranith

    --

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley Email: phaley@xxxxxxx <mailto:phaley@xxxxxxx>
    Center for Ocean Engineering       Phone:  (617) 253-6824
    Dept. of Mechanical Engineering    Fax:    (617) 253-8125
    MIT, Room 5-213 http://web.mit.edu/phaley/www/
    77 Massachusetts Avenue
    Cambridge, MA  02139-4301




--
Pranith

--

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Pat Haley                          Email:  phaley@xxxxxxx
Center for Ocean Engineering       Phone:  (617) 253-6824
Dept. of Mechanical Engineering    Fax:    (617) 253-8125
MIT, Room 5-213                    http://web.mit.edu/phaley/www/
77 Massachusetts Avenue
Cambridge, MA  02139-4301

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users




[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux