Hi Ben, Sorry this took so long, but we had a real-time forecasting exercise last week and I could only get to this now. Backend Hardware/OS:
I have had time to run the tests for one of the dd tests you requested to the underlying XFS FS. The median rate was 170 MB/s. The dd results and iostat record are in http://mseas.mit.edu/download/phaley/GlusterUsers/TestXFS/ I'll add tests for the other brick and to the NFS area later. Thanks Pat On 06/12/2017 06:06 PM, Ben Turner
wrote:
Ok you are correct, you have a pure distributed volume. IE no replication overhead. So normally for pure dist I use: throughput = slowest of disks / NIC * .6-.7 In your case we have: 1200 * .6 = 720 So you are seeing a little less throughput than I would expect in your configuration. What I like to do here is: -First tell me more about your back end storage, will it sustain 1200 MB / sec? What kind of HW? How many disks? What type and specs are the disks? What kind of RAID are you using? -Second can you refresh me on your workload? Are you doing reads / writes or both? If both what mix? Since we are using DD I assume you are working iwth large file sequential I/O, is this correct? -Run some DD tests on the back end XFS FS. I normally have /xfs-mount/gluster-brick, if you have something similar just mkdir on the XFS -> /xfs-mount/my-test-dir. Inside the test dir run: If you are focusing on a write workload run: # dd if=/dev/zero of=/xfs-mount/file bs=1024k count=10000 conv=fdatasync If you are focusing on a read workload run: # echo 3 > /proc/sys/vm/drop_caches # dd if=/gluster-mount/file of=/dev/null bs=1024k count=10000 ** MAKE SURE TO DROP CACHE IN BETWEEN READS!! ** Run this in a loop similar to how you did in: http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt Run this on both servers one at a time and if you are running on a SAN then run again on both at the same time. While this is running gather iostat for me: # iostat -c -m -x 1 > iostat-$(hostname).txt Lets see how the back end performs on both servers while capturing iostat, then see how the same workload / data looks on gluster. -Last thing, when you run your kernel NFS tests are you using the same filesystem / storage you are using for the gluster bricks? I want to be sure we have an apples to apples comparison here. -b ----- Original Message -----From: "Pat Haley" <phaley@xxxxxxx> To: "Ben Turner" <bturner@xxxxxxxxxx> Sent: Monday, June 12, 2017 5:18:07 PM Subject: Re: Slow write times to gluster disk Hi Ben, Here is the output: [root@mseas-data2 ~]# gluster volume info Volume Name: data-volume Type: Distribute Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: mseas-data2:/mnt/brick1 Brick2: mseas-data2:/mnt/brick2 Options Reconfigured: nfs.exports-auth-enable: on diagnostics.brick-sys-log-level: WARNING performance.readdir-ahead: on nfs.disable: on nfs.export-volumes: off On 06/12/2017 05:01 PM, Ben Turner wrote:What is the output of gluster v info? That will tell us more about your config. -b ----- Original Message -----From: "Pat Haley" <phaley@xxxxxxx> To: "Ben Turner" <bturner@xxxxxxxxxx> Sent: Monday, June 12, 2017 4:54:00 PM Subject: Re: Slow write times to gluster disk Hi Ben, I guess I'm confused about what you mean by replication. If I look at the underlying bricks I only ever have a single copy of any file. It either resides on one brick or the other (directories exist on both bricks but not files). We are not using gluster for redundancy (or at least that wasn't our intent). Is that what you meant by replication or is it something else? Thanks Pat On 06/12/2017 04:28 PM, Ben Turner wrote:----- Original Message -----From: "Pat Haley" <phaley@xxxxxxx> To: "Ben Turner" <bturner@xxxxxxxxxx>, "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx> Cc: "Ravishankar N" <ravishankar@xxxxxxxxxx>, gluster-users@xxxxxxxxxxx, "Steve Postma" <SPostma@xxxxxxxxxxxx> Sent: Monday, June 12, 2017 2:35:41 PM Subject: Re: Slow write times to gluster disk Hi Guys, I was wondering what our next steps should be to solve the slow write times. Recently I was debugging a large code and writing a lot of output at every time step. When I tried writing to our gluster disks, it was taking over a day to do a single time step whereas if I had the same program (same hardware, network) write to our nfs disk the time per time-step was about 45 minutes. What we are shooting for here would be to have similar times to either gluster of nfs.I can see in your test: http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt You averaged ~600 MB / sec(expected for replica 2 with 10G, {~1200 MB / sec} / #replicas{2} = 600). Gluster does client side replication so with replica 2 you will only ever see 1/2 the speed of your slowest part of the stack(NW, disk, RAM, CPU). This is usually NW or disk and 600 is normally a best case. Now in your output I do see the instances where you went down to 200 MB / sec. I can only explain this in three ways: 1. You are not using conv=fdatasync and writes are actually going to page cache and then being flushed to disk. During the fsync the memory is not yet available and the disks are busy flushing dirty pages. 2. Your storage RAID group is shared across multiple LUNS(like in a SAN) and when write times are slow the RAID group is busy serviceing other LUNs. 3. Gluster bug / config issue / some other unknown unknown. So I see 2 issues here: 1. NFS does in 45 minutes what gluster can do in 24 hours. 2. Sometimes your throughput drops dramatically. WRT #1 - have a look at my estimates above. My formula for guestimating gluster perf is: throughput = NIC throughput or storage(whatever is slower) / # replicas * overhead(figure .7 or .8). Also the larger the record size the better for glusterfs mounts, I normally like to be at LEAST 64k up to 1024k: # dd if=/dev/zero of=/gluster-mount/file bs=1024k count=10000 conv=fdatasync WRT #2 - Again, I question your testing and your storage config. Try using conv=fdatasync for your DDs, use a larger record size, and make sure that your back end storage is not causing your slowdowns. Also remember that with replica 2 you will take ~50% hit on writes because the client uses 50% of its bandwidth to write to one replica and 50% to the other. -bThanks Pat On 06/02/2017 01:07 AM, Ben Turner wrote:Are you sure using conv=sync is what you want? I normally use conv=fdatasync, I'll look up the difference between the two and see if it affects your test. -b ----- Original Message -----From: "Pat Haley" <phaley@xxxxxxx> To: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx> Cc: "Ravishankar N" <ravishankar@xxxxxxxxxx>, gluster-users@xxxxxxxxxxx, "Steve Postma" <SPostma@xxxxxxxxxxxx>, "Ben Turner" <bturner@xxxxxxxxxx> Sent: Tuesday, May 30, 2017 9:40:34 PM Subject: Re: Slow write times to gluster disk Hi Pranith, The "dd" command was: dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync There were 2 instances where dd reported 22 seconds. The output from the dd tests are in http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/dd_testvol_gluster.txt Pat On 05/30/2017 09:27 PM, Pranith Kumar Karampuri wrote:Pat, What is the command you used? As per the following output, it seems like at least one write operation took 16 seconds. Which is really bad. 96.39 1165.10 us 89.00 us*16487014.00 us* 393212 WRITE On Tue, May 30, 2017 at 10:36 PM, Pat Haley <phaley@xxxxxxx <mailto:phaley@xxxxxxx>> wrote: Hi Pranith, I ran the same 'dd' test both in the gluster test volume and in the .glusterfs directory of each brick. The median results (12 dd trials in each test) are similar to before * gluster test volume: 586.5 MB/s * bricks (in .glusterfs): 1.4 GB/s The profile for the gluster test-volume is in http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt <http://mseas.mit.edu/download/phaley/GlusterUsers/TestVol/profile_testvol_gluster.txt> Thanks Pat On 05/30/2017 12:10 PM, Pranith Kumar Karampuri wrote:Let's start with the same 'dd' test we were testing with to see, what the numbers are. Please provide profile numbers for the same. From there on we will start tuning the volume to see what we can do. On Tue, May 30, 2017 at 9:16 PM, Pat Haley <phaley@xxxxxxx <mailto:phaley@xxxxxxx>> wrote: Hi Pranith, Thanks for the tip. We now have the gluster volume mounted under /home. What tests do you recommend we run? Thanks Pat On 05/17/2017 05:01 AM, Pranith Kumar Karampuri wrote:On Tue, May 16, 2017 at 9:20 PM, Pat Haley <phaley@xxxxxxx <mailto:phaley@xxxxxxx>> wrote: Hi Pranith, Sorry for the delay. I never saw received your reply (but I did receive Ben Turner's follow-up to your reply). So we tried to create a gluster volume under /home using different variations of gluster volume create test-volume mseas-data2:/home/gbrick_test_1 mseas-data2:/home/gbrick_test_2 transport tcp However we keep getting errors of the form Wrong brick type: transport, use <HOSTNAME>:<export-dir-abs-path> Any thoughts on what we're doing wrong? You should give transport tcp at the beginning I think. Anyways, transport tcp is the default, so no need to specify so remove those two words from the CLI. Also do you have a list of the test we should be running once we get this volume created? Given the time-zone difference it might help if we can run a small battery of tests and post the results rather than test-post-new test-post... . This is the first time I am doing performance analysis on users as far as I remember. In our team there are separate engineers who do these tests. Ben who replied earlier is one such engineer. Ben, Have any suggestions? Thanks Pat On 05/11/2017 12:06 PM, Pranith Kumar Karampuri wrote:On Thu, May 11, 2017 at 9:32 PM, Pat Haley <phaley@xxxxxxx <mailto:phaley@xxxxxxx>> wrote: Hi Pranith, The /home partition is mounted as ext4 /home ext4 defaults,usrquota,grpquota 1 2 The brick partitions are mounted ax xfs /mnt/brick1 xfs defaults 0 0 /mnt/brick2 xfs defaults 0 0 Will this cause a problem with creating a volume under /home? I don't think the bottleneck is disk. You can do the same tests you did on your new volume to confirm? Pat On 05/11/2017 11:32 AM, Pranith Kumar Karampuri wrote:On Thu, May 11, 2017 at 8:57 PM, Pat Haley <phaley@xxxxxxx <mailto:phaley@xxxxxxx>> wrote: Hi Pranith, Unfortunately, we don't have similar hardware for a small scale test. All we have is our production hardware. You said something about /home partition which has lesser disks, we can create plain distribute volume inside one of those directories. After we are done, we can remove the setup. What do you say? Pat On 05/11/2017 07:05 AM, Pranith Kumar Karampuri wrote:On Thu, May 11, 2017 at 2:48 AM, Pat Haley <phaley@xxxxxxx <mailto:phaley@xxxxxxx>> wrote: Hi Pranith, Since we are mounting the partitions as the bricks, I tried the dd test writing to <brick-path>/.glusterfs/<file-to-be-removed-after-test>. The results without oflag=sync were 1.6 Gb/s (faster than gluster but not as fast as I was expecting given the 1.2 Gb/s to the no-gluster area w/ fewer disks). Okay, then 1.6Gb/s is what we need to target for, considering your volume is just distribute. Is there any way you can do tests on similar hardware but at a small scale? Just so we can run the workload to learn more about the bottlenecks in the system? We can probably try to get the speed to 1.2Gb/s on your /home partition you were telling me yesterday. Let me know if that is something you are okay to do. Pat On 05/10/2017 01:27 PM, Pranith Kumar Karampuri wrote:On Wed, May 10, 2017 at 10:15 PM, Pat Haley <phaley@xxxxxxx <mailto:phaley@xxxxxxx>> wrote: Hi Pranith, Not entirely sure (this isn't my area of expertise). I'll run your answer by some other people who are more familiar with this. I am also uncertain about how to interpret the results when we also add the dd tests writing to the /home area (no gluster, still on the same machine) * dd test without oflag=sync (rough average of multiple tests) o gluster w/ fuse mount : 570 Mb/s o gluster w/ nfs mount: 390 Mb/s o nfs (no gluster): 1.2 Gb/s * dd test with oflag=sync (rough average of multiple tests) o gluster w/ fuse mount: 5 Mb/s o gluster w/ nfs mount: 200 Mb/s o nfs (no gluster): 20 Mb/s Given that the non-gluster area is a RAID-6 of 4 disks while each brick of the gluster area is a RAID-6 of 32 disks, I would naively expect the writes to the gluster area to be roughly 8x faster than to the non-gluster. I think a better test is to try and write to a file using nfs without any gluster to a location that is not inside the brick but someother location that is on same disk(s). If you are mounting the partition as the brick, then we can write to a file inside .glusterfs directory, something like <brick-path>/.glusterfs/<file-to-be-removed-after-test>. I still think we have a speed issue, I can't tell if fuse vs nfs is part of the problem. I got interested in the post because I read that fuse speed is lesser than nfs speed which is counter-intuitive to my understanding. So wanted clarifications. Now that I got my clarifications where fuse outperformed nfs without sync, we can resume testing as described above and try to find what it is. Based on your email-id I am guessing you are from Boston and I am from Bangalore so if you are okay with doing this debugging for multiple days because of timezones, I will be happy to help. Please be a bit patient with me, I am under a release crunch but I am very curious with the problem you posted. Was there anything useful in the profiles? Unfortunately profiles didn't help me much, I think we are collecting the profiles from an active volume, so it has a lot of information that is not pertaining to dd so it is difficult to find the contributions of dd. So I went through your post again and found something I didn't pay much attention to earlier i.e. oflag=sync, so did my own tests on my setup with FUSE so sent that reply. Pat On 05/10/2017 12:15 PM, Pranith Kumar Karampuri wrote:Okay good. At least this validates my doubts. Handling O_SYNC in gluster NFS and fuse is a bit different. When application opens a file with O_SYNC on fuse mount then each write syscall has to be written to disk as part of the syscall where as in case of NFS, there is no concept of open. NFS performs write though a handle saying it needs to be a synchronous write, so write() syscall is performed first then it performs fsync(). so an write on an fd with O_SYNC becomes write+fsync. I am suspecting that when multiple threads do this write+fsync() operation on the same file, multiple writes are batched together to be written do disk so the throughput on the disk is increasing is my guess. Does it answer your doubts? On Wed, May 10, 2017 at 9:35 PM, Pat Haley <phaley@xxxxxxx <mailto:phaley@xxxxxxx>> wrote: Without the oflag=sync and only a single test of each, the FUSE is going faster than NFS: FUSE: mseas-data2(dri_nascar)% dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync 4096+0 records in 4096+0 records out 4294967296 bytes (4.3 GB) copied, 7.46961 s, 575 MB/s NFS mseas-data2(HYCOM)% dd if=/dev/zero count=4096 bs=1048576 of=zeros.txt conv=sync 4096+0 records in 4096+0 records out 4294967296 bytes (4.3 GB) copied, 11.4264 s, 376 MB/s On 05/10/2017 11:53 AM, Pranith Kumar Karampuri wrote:Could you let me know the speed without oflag=sync on both the mounts? No need to collect profiles. On Wed, May 10, 2017 at 9:17 PM, Pat Haley <phaley@xxxxxxx <mailto:phaley@xxxxxxx>> wrote: Here is what I see now: [root@mseas-data2 ~]# gluster volume info Volume Name: data-volume Type: Distribute Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 Status: Started Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: mseas-data2:/mnt/brick1 Brick2: mseas-data2:/mnt/brick2 Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on nfs.exports-auth-enable: on diagnostics.brick-sys-log-level: WARNING performance.readdir-ahead: on nfs.disable: on nfs.export-volumes: off On 05/10/2017 11:44 AM, Pranith Kumar Karampuri wrote:Is this the volume info you have? >/[root at >mseas-data2 <http://www.gluster.org/mailman/listinfo/gluster-users> ~]# gluster volume info />//>/Volume Name: data-volume />/Type: Distribute />/Volume ID: c162161e-2a2d-4dac-b015-f31fd89ceb18 />/Status: Started />/Number of Bricks: 2 />/Transport-type: tcp />/Bricks: />/Brick1: mseas-data2:/mnt/brick1 />/Brick2: mseas-data2:/mnt/brick2 />/Options Reconfigured: />/performance.readdir-ahead: on />/nfs.disable: on />/nfs.export-volumes: off / I copied this from old thread from 2016. This is distribute volume. Did you change any of the options in between?-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Pat Haley Email:phaley@xxxxxxx <mailto:phaley@xxxxxxx> Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125 MIT, Room 5-213http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue Cambridge, MA 02139-4301 -- Pranith-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Pat Haley Email:phaley@xxxxxxx <mailto:phaley@xxxxxxx> Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125 MIT, Room 5-213http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue Cambridge, MA 02139-4301 -- Pranith-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Pat Haley Email:phaley@xxxxxxx <mailto:phaley@xxxxxxx> Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125 MIT, Room 5-213http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue Cambridge, MA 02139-4301 -- Pranith-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Pat Haley Email:phaley@xxxxxxx <mailto:phaley@xxxxxxx> Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125 MIT, Room 5-213http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue Cambridge, MA 02139-4301 -- Pranith-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Pat Haley Email:phaley@xxxxxxx <mailto:phaley@xxxxxxx> Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125 MIT, Room 5-213http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue Cambridge, MA 02139-4301 -- Pranith-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Pat Haley Email:phaley@xxxxxxx <mailto:phaley@xxxxxxx> Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125 MIT, Room 5-213http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue Cambridge, MA 02139-4301 -- Pranith-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Pat Haley Email:phaley@xxxxxxx <mailto:phaley@xxxxxxx> Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125 MIT, Room 5-213http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue Cambridge, MA 02139-4301 -- Pranith-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Pat Haley Email:phaley@xxxxxxx <mailto:phaley@xxxxxxx> Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125 MIT, Room 5-213http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue Cambridge, MA 02139-4301 -- Pranith-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Pat Haley Email:phaley@xxxxxxx <mailto:phaley@xxxxxxx> Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125 MIT, Room 5-213http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue Cambridge, MA 02139-4301 -- Pranith-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Pat Haley Email: phaley@xxxxxxx Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125 MIT, Room 5-213 http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue Cambridge, MA 02139-4301-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Pat Haley Email: phaley@xxxxxxx Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125 MIT, Room 5-213 http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue Cambridge, MA 02139-4301-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Pat Haley Email: phaley@xxxxxxx Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125 MIT, Room 5-213 http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue Cambridge, MA 02139-4301-- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Pat Haley Email: phaley@xxxxxxx Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125 MIT, Room 5-213 http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue Cambridge, MA 02139-4301 -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Pat Haley Email: phaley@xxxxxxx Center for Ocean Engineering Phone: (617) 253-6824 Dept. of Mechanical Engineering Fax: (617) 253-8125 MIT, Room 5-213 http://web.mit.edu/phaley/www/ 77 Massachusetts Avenue Cambridge, MA 02139-4301 |
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-users