I will forward the emails to Shyam to the devel list. David (Sent from mobile) =============================== David F. Robinson, Ph.D. President - Corvid Technologies 704.799.6944 x101 [office] 704.252.1310 [cell] 704.799.7974 [fax] David.Robinson@xxxxxxxxxxxxx http://www.corvidtechnologies.com > On Feb 11, 2015, at 8:21 AM, Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> wrote: > > >> On 02/11/2015 06:49 PM, Pranith Kumar Karampuri wrote: >> >>> On 02/11/2015 08:36 AM, Shyam wrote: >>> Did some analysis with David today on this here is a gist for the list, >>> >>> 1) Volumes classified as slow (i.e with a lot of pre-existing data) and fast (new volumes carved from the same backend file system that slow bricks are on, with little or no data) >>> >>> 2) We ran an strace of tar and also collected io-stats outputs from these volumes, both show that create and mkdir is slower on slow as compared to the fast volume. This seems to be the overall reason for slowness. >> Did you happen to do strace of the brick when this happened? If not, David, can we get that information as well? > It would be nice to compare the difference in syscalls of the bricks of two volumes to see if there are any extra syscalls that is adding to the delay. > > Pranith >> >> Pranith >>> >>> 3) The tarball extraction is to a new directory on the gluster mount, so all lookups etc. happen within this new name space on the volume >>> >>> 4) Checked memory footprints of the slow bricks and fast bricks etc. nothing untoward noticed there >>> >>> 5) Restarted the slow volume, just as a test case to do things from scratch, no improvement in performance. >>> >>> Currently attempting to reproduce this on a local system to see if the same behavior is seen so that it becomes easier to debug etc. >>> >>> Others on the list can chime in as they see fit. >>> >>> Thanks, >>> Shyam >>> >>>> On 02/10/2015 09:58 AM, David F. Robinson wrote: >>>> Forwarding to devel list as recommended by Justin... >>>> >>>> David >>>> >>>> >>>> ------ Forwarded Message ------ >>>> From: "David F. Robinson" <david.robinson@xxxxxxxxxxxxx> >>>> To: "Justin Clift" <justin@xxxxxxxxxxx> >>>> Sent: 2/10/2015 9:49:09 AM >>>> Subject: Re[2]: missing files >>>> >>>> Bad news... I don't think it is the old linkto files. Bad because if >>>> that was the issue, cleaning up all of bad linkto files would have fixed >>>> the issue. It seems like the system just gets slower as you add data. >>>> >>>> First, I setup a new clean volume (test2brick) on the same system as the >>>> old one (homegfs_bkp). See 'gluster v info' below. I ran my simple tar >>>> extraction test on the new volume and it took 58-seconds to complete >>>> (which, BTW, is 10-seconds faster than my old non-gluster system, so >>>> kudos). The time on homegfs_bkp is 19-minutes. >>>> >>>> Next, I copied 10-terabytes of data over to test2brick and re-ran the >>>> test which then took 7-minutes. I created a test3brick and ran the test >>>> and it took 53-seconds. >>>> >>>> To confirm all of this, I deleted all of the data from test2brick and >>>> re-ran the test. It took 51-seconds!!! >>>> >>>> BTW. I also checked the .glusterfs for stale linkto files (find . -type >>>> f -size 0 -perm 1000 -exec ls -al {} \;). There are many, many thousands >>>> of these types of files on the old volume and none on the new one, so I >>>> don't think this is related to the performance issue. >>>> >>>> Let me know how I should proceed. Send this to devel list? Pranith? >>>> others? Thanks... >>>> >>>> [root@gfs01bkp .glusterfs]# gluster volume info homegfs_bkp >>>> Volume Name: homegfs_bkp >>>> Type: Distribute >>>> Volume ID: 96de8872-d957-4205-bf5a-076e3f35b294 >>>> Status: Started >>>> Number of Bricks: 2 >>>> Transport-type: tcp >>>> Bricks: >>>> Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/homegfs_bkp >>>> Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/homegfs_bkp >>>> >>>> [root@gfs01bkp .glusterfs]# gluster volume info test2brick >>>> Volume Name: test2brick >>>> Type: Distribute >>>> Volume ID: 123259b2-3c61-4277-a7e8-27c7ec15e550 >>>> Status: Started >>>> Number of Bricks: 2 >>>> Transport-type: tcp >>>> Bricks: >>>> Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/test2brick >>>> Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/test2brick >>>> >>>> [root@gfs01bkp glusterfs]# gluster volume info test3brick >>>> Volume Name: test3brick >>>> Type: Distribute >>>> Volume ID: 9b1613fc-f7e5-4325-8f94-e3611a5c3701 >>>> Status: Started >>>> Number of Bricks: 2 >>>> Transport-type: tcp >>>> Bricks: >>>> Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/test3brick >>>> Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/test3brick >>>> >>>> >>>> From homegfs_bkp: >>>> # find . -type f -size 0 -perm 1000 -exec ls -al {} \; >>>> --------T 2 gmathur pme_ics 0 Jan 9 16:59 >>>> ./00/16/00169a69-1a7a-44c9-b2d8-991671ee87c4 >>>> ---------T 3 jcowan users 0 Jan 9 17:51 >>>> ./00/16/0016a0a0-fd22-4fb5-b6fb-5d7f9024ab74 >>>> ---------T 2 morourke sbir 0 Jan 9 18:17 >>>> ./00/16/0016b36f-32fc-4f2c-accd-e36be2f6c602 >>>> ---------T 2 carpentr irl 0 Jan 9 18:52 >>>> ./00/16/00163faf-741c-4e40-8081-784786b3cc71 >>>> ---------T 3 601 raven 0 Jan 9 22:49 >>>> ./00/16/00163385-a332-4050-8104-1b1af6cd8249 >>>> ---------T 3 bangell sbir 0 Jan 9 22:56 >>>> ./00/16/00167803-0244-46de-8246-d9c382dd3083 >>>> ---------T 2 morourke sbir 0 Jan 9 23:17 >>>> ./00/16/00167bc5-fc56-42ee-9e3f-1e238f3828f4 >>>> ---------T 3 morourke sbir 0 Jan 9 23:34 >>>> ./00/16/0016a71e-89cf-4a86-9575-49c7e9d216c6 >>>> ---------T 2 gmathur users 0 Jan 9 23:47 >>>> ./00/16/00168aa2-d069-4a77-8790-e36431324ca5 >>>> ---------T 2 bangell users 0 Jan 22 09:24 >>>> ./00/16/0016e720-a190-4e43-962f-aa3e4216e5f5 >>>> ---------T 2 root root 0 Jan 22 09:26 >>>> ./00/16/00169e95-64b7-455c-82dc-d9940ee7fe43 >>>> ---------T 2 dfrobins users 0 Jan 22 09:27 >>>> ./00/16/00161b04-1612-4fba-99a4-2a2b54062fdb >>>> ---------T 2 mdick users 0 Jan 22 09:27 >>>> ./00/16/0016ba60-310a-4bee-968a-36eb290e8c9e >>>> ---------T 2 dfrobins users 0 Jan 22 09:43 >>>> ./00/16/00160315-1533-4290-8c1a-72e2fbb1962a >>>> From test2brick: >>>> find . -type f -size 0 -perm 1000 -exec ls -al {} \; >>>> >>>> >>>> >>>> >>>> >>>> ------ Original Message ------ >>>> From: "Justin Clift" <justin@xxxxxxxxxxx> >>>> To: "David F. Robinson" <david.robinson@xxxxxxxxxxxxx> >>>> Sent: 2/9/2015 11:33:54 PM >>>> Subject: Re: missing files >>>> >>>>> Interesting. (I'm 1/2 asleep atm and really need sleep soon, so take this >>>>> with a grain of salt... ;>) >>>>> >>>>> As a curiosity question, does the homegfs_bkp volume have a bunch of >>>>> outdated metadata still in it? eg left over extended attributes or >>>>> something >>>>> >>>>> Remembering a question you asked earlier er... today/yesterday about old >>>>> extended attribute entries and if they hang around forever. I don't >>>>> know the >>>>> answer to that, but if the old volume still has a 1000's (or more) of >>>>> entries >>>>> around, perhaps there's some lookup problem that's killing lookup >>>>> times for >>>>> file operations. >>>>> >>>>> On a side note, I can probably setup my test lab stuff here again >>>>> tomorrow >>>>> and try this stuff out myself to see if I can replicate the problem. >>>>> (if that >>>>> could potentially be useful?) >>>>> >>>>> + Justin >>>>> >>>>> >>>>> >>>>> On 9 Feb 2015, at 22:56, David F. Robinson >>>>> <david.robinson@xxxxxxxxxxxxx> wrote: >>>>>> Justin, >>>>>> >>>>>> Hoping you can help point this to the right people once again. Maybe >>>>>> all of these issues are related. >>>>>> >>>>>> You can look at the email traffic below, but the summary is that I >>>>>> was working with Ben to figure out why my GFS system was 20x slower >>>>>> than my old storage system. During my tracing of this issue, I >>>>>> determined that if I create a new volume on my storage system, this >>>>>> slowness goes away. So, either it is faster because it doesn't have >>>>>> any data on this new volume (I hope this isn't the case) or the older >>>>>> partitions somehow became corrupted during the upgrades or has some >>>>>> depricated parameters set that slow it down. >>>>>> >>>>>> Very strange and hoping you can once again help... Thanks in advance... >>>>>> >>>>>> David >>>>>> >>>>>> >>>>>> ------ Forwarded Message ------ >>>>>> From: "David F. Robinson" <david.robinson@xxxxxxxxxxxxx> >>>>>> To: "Benjamin Turner" <bennyturns@xxxxxxxxx> >>>>>> Sent: 2/9/2015 5:52:00 PM >>>>>> Subject: Re[5]: missing files >>>>>> >>>>>> Ben, >>>>>> >>>>>> I cleared the logs and rebooted the machine. Same issue. homegfs_bkp >>>>>> takes 19-minutes and test2brick (the new volume) takes 1-minute. >>>>>> >>>>>> Is it possible that some old parameters are still set for >>>>>> homegfs_bkp that are no longer in use? I tried a gluster volume reset >>>>>> for homegfs_bkp, but it didn't have any effect. >>>>>> >>>>>> I have attached the full logs. >>>>>> >>>>>> David >>>>>> >>>>>> >>>>>> ------ Original Message ------ >>>>>> From: "David F. Robinson" <david.robinson@xxxxxxxxxxxxx> >>>>>> To: "Benjamin Turner" <bennyturns@xxxxxxxxx> >>>>>> Sent: 2/9/2015 5:39:18 PM >>>>>> Subject: Re[4]: missing files >>>>>> >>>>>>> Ben, >>>>>>> >>>>>>> I have traced this out to a point where I can rule out many issues. >>>>>>> I was hoping you could help me from here. >>>>>>> I went with the "tar -xPf boost.tar" as my test case, which on my >>>>>>> old storage system took about 1-minute to extract. On my backup >>>>>>> system and my primary storage (both gluster), it takes roughly >>>>>>> 19-minutes. >>>>>>> >>>>>>> First step was to create a new storage system (striped RAID, two >>>>>>> sets of 3-drives). All was good here with a gluster extraction time >>>>>>> of 1-minute. I then went to my backup system and created another >>>>>>> partition using only one of the two bricks on that system. Still >>>>>>> 1-minute. I went to a two brick setup and it stayed at 1-minute. >>>>>>> >>>>>>> At this point, I have recreated using the same parameters on a >>>>>>> test2brick volume that should be identical to my homegfs_bkp volume. >>>>>>> Everything is the same including how I mounted the volume. The only >>>>>>> different is that the homegfs_bkp has 30-TB of data and the >>>>>>> test2brick is blank. I didn't think that performance would be >>>>>>> affected by putting data on the volume. >>>>>>> >>>>>>> Can you help? Do you have any suggestions? Do you think upgrading >>>>>>> gluster from 3.5 to 3.6.1 to 3.6.2 somehow message up homegfs_bkp? >>>>>>> My layout is shown below. These should give identical speeds. >>>>>>> >>>>>>> [root@gfs01bkp test2brick]# gluster volume info homegfs_bkp >>>>>>> Volume Name: homegfs_bkp >>>>>>> Type: Distribute >>>>>>> Volume ID: 96de8872-d957-4205-bf5a-076e3f35b294 >>>>>>> Status: Started >>>>>>> Number of Bricks: 2 >>>>>>> Transport-type: tcp >>>>>>> Bricks: >>>>>>> Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/homegfs_bkp >>>>>>> Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/homegfs_bkp >>>>>>> [root@gfs01bkp test2brick]# gluster volume info test2brick >>>>>>> >>>>>>> Volume Name: test2brick >>>>>>> Type: Distribute >>>>>>> Volume ID: 123259b2-3c61-4277-a7e8-27c7ec15e550 >>>>>>> Status: Started >>>>>>> Number of Bricks: 2 >>>>>>> Transport-type: tcp >>>>>>> Bricks: >>>>>>> Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/test2brick >>>>>>> Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/test2brick >>>>>>> >>>>>>> >>>>>>> [root@gfs01bkp brick02bkp]# mount | grep test2brick >>>>>>> gfsib01bkp.corvidtec.com:/test2brick.tcp on /test2brick type >>>>>>> fuse.glusterfs (rw,allow_other,max_read=131072) >>>>>>> [root@gfs01bkp brick02bkp]# mount | grep homegfs_bkp >>>>>>> gfsib01bkp.corvidtec.com:/homegfs_bkp.tcp on /backup/homegfs type >>>>>>> fuse.glusterfs (rw,allow_other,max_read=131072) >>>>>>> >>>>>>> [root@gfs01bkp brick02bkp]# df -h >>>>>>> Filesystem Size Used Avail Use% Mounted on >>>>>>> /dev/mapper/vg00-lv_root 20G 1.7G 18G 9% / >>>>>>> tmpfs 16G 0 16G 0% /dev/shm >>>>>>> /dev/md126p1 1008M 110M 848M 12% /boot >>>>>>> /dev/mapper/vg00-lv_opt 5.0G 220M 4.5G 5% /opt >>>>>>> /dev/mapper/vg00-lv_tmp 5.0G 139M 4.6G 3% /tmp >>>>>>> /dev/mapper/vg00-lv_usr 20G 2.7G 17G 15% /usr >>>>>>> /dev/mapper/vg00-lv_var 40G 4.4G 34G 12% /var >>>>>>> /dev/mapper/vg01-lvol1 88T 22T 67T 25% /data/brick01bkp >>>>>>> /dev/mapper/vg02-lvol1 88T 22T 67T 25% /data/brick02bkp >>>>>>> gfsib01bkp.corvidtec.com:/homegfs_bkp.tcp 175T 43T 133T 25% >>>>>>> /backup/homegfs >>>>>>> gfsib01bkp.corvidtec.com:/test2brick.tcp 175T 43T 133T 25% /test2brick >>>>>>> >>>>>>> >>>>>>> ------ Original Message ------ >>>>>>> From: "Benjamin Turner" <bennyturns@xxxxxxxxx> >>>>>>> To: "David F. Robinson" <david.robinson@xxxxxxxxxxxxx> >>>>>>> Sent: 2/6/2015 12:52:58 PM >>>>>>> Subject: Re: Re[2]: missing files >>>>>>> >>>>>>>> Hi David. Lets start with the basics and go from there. IIRC you >>>>>>>> are using LVM with thick provisioning, lets verify the following: >>>>>>>> >>>>>>>> 1. You have everything properly aligned for your RAID stripe size, >>>>>>>> etc. I have attached the script we package with RHS that I am in >>>>>>>> the process of updating. I want to double check you created the PV >>>>>>>> / VG / LV with the proper variables. Have a look at the create_pv, >>>>>>>> create_vg, and create_lv(old) functions. You will need to know the >>>>>>>> stripe size of your raid and the number of stripe elements(data >>>>>>>> disks, not hotspares). Also make sure you mkfs.xfs with: >>>>>>>> >>>>>>>> echo "mkfs -t xfs -f -K -i size=$inode_size -d >>>>>>>> sw=$stripe_elements,su=$stripesize -n size=$fs_block_size >>>>>>>> /dev/$vgname/$lvname" >>>>>>>> >>>>>>>> We use 512k inodes because some workload use more than the default >>>>>>>> inode size and you don't want xattrs bleeding over inodes. >>>>>>>> >>>>>>>> 2. Are you running RHEL or Centos? If so I would recommend >>>>>>>> tuned_profile=rhs-high-throughput. If you don't have that tuned >>>>>>>> profile I'll get you everything it sets. >>>>>>>> >>>>>>>> 3. For small files we we recommend the following: >>>>>>>> >>>>>>>> # RAID related variables. >>>>>>>> # stripesize - RAID controller stripe unit size >>>>>>>> # stripe_elements - the number of data disks >>>>>>>> # The --dataalignment option is used while creating the physical >>>>>>>> volumeTo >>>>>>>> # align I/O at LVM layer >>>>>>>> # dataalign - >>>>>>>> # RAID6 is recommended when the workload has predominantly larger >>>>>>>> # files ie not in kilobytes. >>>>>>>> # For RAID6 with 12 disks and 128K stripe element size. >>>>>>>> stripesize=128k >>>>>>>> stripe_elements=10 >>>>>>>> dataalign=1280k >>>>>>>> >>>>>>>> # RAID10 is recommended when the workload has predominantly >>>>>>>> smaller files >>>>>>>> # i.e in kilobytes. >>>>>>>> # For RAID10 with 12 disks and 256K stripe element size, uncomment >>>>>>>> the >>>>>>>> # lines below. >>>>>>>> # stripesize=256k >>>>>>>> # stripe_elements=6 >>>>>>>> # dataalign=1536k >>>>>>>> >>>>>>>> 4. Jumbo frames everywhere! Check out the effect of jumbo frames, >>>>>>>> make sure they are setup properly on your switch and add the >>>>>>>> MTU=9000 to your ifcfg files(unless you have it already): >>>>>>>> >>>>>>>> >>>>>>>> https://rhsummit.files.wordpress.com/2013/07/england_th_0450_rhs_perf_practices-4_neependra.pdf >>>>>>>> (see the jumbo frames section here, the whole thing is a good read) >>>>>>>> >>>>>>>> https://rhsummit.files.wordpress.com/2014/04/bengland_h_1100_rhs_performance.pdf >>>>>>>> (this is updated for 2014) >>>>>>>> >>>>>>>> 5. There is a smallfile enhancement that just landed in master >>>>>>>> that is showing me a 60% improvement in writes. This is called >>>>>>>> multi threaded epoll and it is looking VERY promising WRT smallfile >>>>>>>> performance. Here is a summary: >>>>>>>> >>>>>>>> Hi all. I see alot of discussion on $subject and I wanted to take >>>>>>>> a minute to talk about it and what we can do to test / observe the >>>>>>>> effects of it. Lets start with a bit of background: >>>>>>>> >>>>>>>> **Background** >>>>>>>> >>>>>>>> -Currently epoll is single threaded on both clients and servers. >>>>>>>> *This leads to a "hot thread" which consumes 100% of a CPU core. >>>>>>>> *This can be observed by running BenE's smallfile benchmark to >>>>>>>> create files, running top(on both clients and servers), and >>>>>>>> pressing H to show threads. >>>>>>>> *You will be able to see a single glusterfs thread eating 100% >>>>>>>> of the CPU: >>>>>>>> >>>>>>>> 2871 root 20 0 746m 24m 3004 S 100.0 0.1 14:35.89 glusterfsd >>>>>>>> 4522 root 20 0 747m 24m 3004 S 5.3 0.1 0:02.25 glusterfsd >>>>>>>> 4507 root 20 0 747m 24m 3004 S 5.0 0.1 0:05.91 glusterfsd >>>>>>>> 21200 root 20 0 747m 24m 3004 S 4.6 0.1 0:21.16 glusterfsd >>>>>>>> >>>>>>>> -Single threaded epoll is a bottlenck for high IOP / low metadata >>>>>>>> workloads(think smallfile). With single threaded epoll we are CPU >>>>>>>> bound by the single thread pegging out a CPU. >>>>>>>> >>>>>>>> So the proposed solution to this problem is to make epoll multi >>>>>>>> threaded on both servers and clients. Here is a link to the >>>>>>>> upstream proposal: >>>>>>>> >>>>>>>> >>>>>>>> http://www.gluster.org/community/documentation/index.php/Features/Feature_Smallfile_Perf#multi-thread-epoll >>>>>>>> >>>>>>>> >>>>>>>> Status: [ http://review.gluster.org/#/c/3842/ based on Anand >>>>>>>> Avati's patch ] >>>>>>>> >>>>>>>> Why: remove single-thread-per-brick barrier to higher CPU >>>>>>>> utilization by servers >>>>>>>> >>>>>>>> Use case: multi-client and multi-thread applications >>>>>>>> >>>>>>>> Improvement: measured 40% with 2 epoll threads and 100% with 4 >>>>>>>> epoll threads for small file creates to an SSD >>>>>>>> >>>>>>>> Disadvantage: conflicts with support for SSL sockets, may require >>>>>>>> significant code change to support both. >>>>>>>> >>>>>>>> Note: this enhancement also helps high-IOPS applications such as >>>>>>>> databases and virtualization which are not metadata-intensive. This >>>>>>>> has been measured already using a Fusion I/O SSD performing random >>>>>>>> reads and writes -- it was necessary to define multiple bricks per >>>>>>>> SSD device to get Gluster to the same order of magnitude IOPS as a >>>>>>>> local filesystem. But this workaround is problematic for users, >>>>>>>> because storage space is not properly measured when there are >>>>>>>> multiple bricks on the same filesystem. >>>>>>>> >>>>>>>> Multi threaded epoll is part of a larger page that talks about >>>>>>>> smallfile performance enhancements, proposed and happening. >>>>>>>> >>>>>>>> Goal: if successful, throughput bottleneck should be either the >>>>>>>> network or the brick filesystem! >>>>>>>> What it doesn't do: multi-thread-epoll does not solve the >>>>>>>> excessive-round-trip protocol problems that Gluster has. >>>>>>>> What it should do: allow Gluster to exploit the mostly untapped >>>>>>>> CPU resources on the Gluster servers and clients. >>>>>>>> How it does it: allow multiple threads to read protocol messages >>>>>>>> and process them at the same time. >>>>>>>> How to observe: multi-thread-epoll should be configurable (how to >>>>>>>> configure? gluster command?), with thread count 1 it should be same >>>>>>>> as RHS 3.0, with thread count 2-4 it should show significantly more >>>>>>>> CPU utilization (threads visible with "top -H"), resulting in >>>>>>>> higher throughput. >>>>>>>> >>>>>>>> **How to observe** >>>>>>>> >>>>>>>> Here are the commands needed to setup an environment to test in on >>>>>>>> RHS 3.0.3: >>>>>>>> rpm -e glusterfs-api glusterfs glusterfs-libs glusterfs-fuse >>>>>>>> glusterfs-geo-replication glusterfs-rdma glusterfs-server >>>>>>>> glusterfs-cli gluster-nagios-common samba-glusterfs vdsm-gluster >>>>>>>> --nodeps >>>>>>>> rhn_register >>>>>>>> yum groupinstall "Development tools" >>>>>>>> git clone https://github.com/gluster/glusterfs.git >>>>>>>> git branch test >>>>>>>> git checkout test >>>>>>>> git fetch http://review.gluster.org/glusterfs >>>>>>>> refs/changes/42/3842/17 && git cherry-pick FETCH_HEAD >>>>>>>> git fetch http://review.gluster.org/glusterfs >>>>>>>> refs/changes/88/9488/2 && git cherry-pick FETCH_HEAD >>>>>>>> yum install openssl openssl-devel >>>>>>>> wget >>>>>>>> ftp://fr2.rpmfind.net/linux/epel/6/x86_64/cmockery2-1.3.8-2.el6.x86_64.rpm >>>>>>>> >>>>>>>> wget >>>>>>>> ftp://fr2.rpmfind.net/linux/epel/6/x86_64/cmockery2-devel-1.3.8-2.el6.x86_64.rpm >>>>>>>> >>>>>>>> yum install cmockery2-1.3.8-2.el6.x86_64.rpm >>>>>>>> cmockery2-devel-1.3.8-2.el6.x86_64.rpm libxml2-devel >>>>>>>> ./autogen.sh >>>>>>>> ./configure >>>>>>>> make >>>>>>>> make install >>>>>>>> >>>>>>>> Verify you are using the upstream with: >>>>>>>> >>>>>>>> # gluster -- version >>>>>>>> >>>>>>>> To enable set multithreaded epoll run the following commands: >>>>>>>> >>>>>>>> From the patch: >>>>>>>> { .key = "client.event-threads", 839 >>>>>>>> .voltype = "protocol/client", 840 >>>>>>>> .op_version = GD_OP_VERSION_3_7_0, 841 >>>>>>>> }, >>>>>>>> { .key = "server.event-threads", 946 >>>>>>>> .voltype = "protocol/server", 947 >>>>>>>> .op_version = GD_OP_VERSION_3_7_0, 948 >>>>>>>> }, >>>>>>>> >>>>>>>> # gluster v set <volname> server.event-threads 4 >>>>>>>> # gluster v set <volname> client.event-threads 4 >>>>>>>> >>>>>>>> Also grab smallfile: >>>>>>>> >>>>>>>> https://github.com/bengland2/smallfile >>>>>>>> >>>>>>>> After git cloneing smallfile run: >>>>>>>> >>>>>>>> python /small-files/smallfile/smallfile_cli.py --operation create >>>>>>>> --threads 8 --file-size 64 --files 10000 --top /gluster-mount >>>>>>>> --pause 1000 --host-set "client1 client2" >>>>>>>> >>>>>>>> Again we will be looking at top + show threads(press H). With 4 >>>>>>>> threads on both clients and servers you should see something >>>>>>>> similar to(this isnt exact, I coped and pasted): >>>>>>>> >>>>>>>> 2871 root 20 0 746m 24m 3004 S 35.0 0.1 14:35.89 glusterfsd >>>>>>>> 2872 root 20 0 746m 24m 3004 S 51.0 0.1 14:35.89 glusterfsd >>>>>>>> 2873 root 20 0 746m 24m 3004 S 43.0 0.1 14:35.89 glusterfsd >>>>>>>> 2874 root 20 0 746m 24m 3004 S 65.0 0.1 14:35.89 glusterfsd >>>>>>>> 4522 root 20 0 747m 24m 3004 S 5.3 0.1 0:02.25 glusterfsd >>>>>>>> 4507 root 20 0 747m 24m 3004 S 5.0 0.1 0:05.91 glusterfsd >>>>>>>> 21200 root 20 0 747m 24m 3004 S 4.6 0.1 0:21.16 glusterfsd >>>>>>>> >>>>>>>> If you have a test env I would be interested to see how multi >>>>>>>> threaded epoll performs, but I am 100% sure its not ready for >>>>>>>> production yet. RH will be supporting it with our 3.0.4(the next >>>>>>>> one) release unless we find show stopping bugs. My testing looks >>>>>>>> very promising though. >>>>>>>> >>>>>>>> Smallfile performance enhancements are one of the key focuses for >>>>>>>> our 3.1 release this summer, we are working very hard to improve >>>>>>>> this as this is the use case for the majority of people. >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Feb 6, 2015 at 11:59 AM, David F. Robinson >>>>>>>> <david.robinson@xxxxxxxxxxxxx> wrote: >>>>>>>> Ben, >>>>>>>> >>>>>>>> I was hoping you might be able to help with two performance >>>>>>>> questions. I was doing some testing of my rsync where I am backing >>>>>>>> up my primary gluster system (distributed + replicated) to my >>>>>>>> backup gluster system (distributed). I tried three tests where I >>>>>>>> rsynced from one of my primary sytems (gfsib02b) to my backup >>>>>>>> machine. The test directory contains roughly 5500 files, most of >>>>>>>> which are small. The script I ran is shown below which repeats the >>>>>>>> tests 3x for each section to check variability in timing. >>>>>>>> >>>>>>>> 1) Writing to the local disk is drastically faster than writing to >>>>>>>> gluster. So, my writes to the backup gluster system are what is >>>>>>>> slowing me down, which makes sense. >>>>>>>> 2) When I write to the backup gluster system (/backup/homegfs), >>>>>>>> the timing goes from 35-seconds to 1min40seconds. The question here >>>>>>>> is whether you could recommend any settings for this volume that >>>>>>>> would improve performance for small file writes? I have included >>>>>>>> the output of 'gluster volume info" below. >>>>>>>> 3) When I did the same tests on the Source_bkp volume, it is >>>>>>>> almost 3x as slow as the homegfs_bkp volume. However, these are >>>>>>>> just different volumes on the same storage system. The volume >>>>>>>> parameters are identical (see below). The performance of these two >>>>>>>> should be identical. Any idea why they wouldn't be? And any >>>>>>>> suggestions for how to fix this? The only thing that I see >>>>>>>> different between the two is the order of the "Options >>>>>>>> reconfigured" section. I assume order of options doesn't matter. >>>>>>>> >>>>>>>> Backup to local hard disk (no gluster writes) >>>>>>>> time /usr/local/bin/rsync -av --numeric-ids --delete >>>>>>>> --block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x" >>>>>>>> gfsib02b:/homegfs/test /temp1 >>>>>>>> time /usr/local/bin/rsync -av --numeric-ids --delete >>>>>>>> --block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x" >>>>>>>> gfsib02b:/homegfs/test /temp2 >>>>>>>> time /usr/local/bin/rsync -av --numeric-ids --delete >>>>>>>> --block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x" >>>>>>>> gfsib02b:/homegfs/test /temp3 >>>>>>>> >>>>>>>> real 0m35.579s >>>>>>>> user 0m31.290s >>>>>>>> sys 0m12.282s >>>>>>>> >>>>>>>> real 0m38.035s >>>>>>>> user 0m31.622s >>>>>>>> sys 0m10.907s >>>>>>>> real 0m38.313s >>>>>>>> user 0m31.458s >>>>>>>> sys 0m10.891s >>>>>>>> Backup to gluster backup system on volume homegfs_bkp >>>>>>>> time /usr/local/bin/rsync -av --numeric-ids --delete >>>>>>>> --block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x" >>>>>>>> gfsib02b:/homegfs/test /backup/homegfs/temp1 >>>>>>>> time /usr/local/bin/rsync -av --numeric-ids --delete >>>>>>>> --block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x" >>>>>>>> gfsib02b:/homegfs/test /backup/homegfs/temp2 >>>>>>>> time /usr/local/bin/rsync -av --numeric-ids --delete >>>>>>>> --block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x" >>>>>>>> gfsib02b:/homegfs/test /backup/homegfs/temp3 >>>>>>>> >>>>>>>> real 1m42.026s >>>>>>>> user 0m32.604s >>>>>>>> sys 0m9.967s >>>>>>>> >>>>>>>> real 1m45.480s >>>>>>>> user 0m32.577s >>>>>>>> sys 0m11.994s >>>>>>>> >>>>>>>> real 1m40.436s >>>>>>>> user 0m32.521s >>>>>>>> sys 0m11.240s >>>>>>>> >>>>>>>> Backup to gluster backup system on volume Source_bkp >>>>>>>> time /usr/local/bin/rsync -av --numeric-ids --delete >>>>>>>> --block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x" >>>>>>>> gfsib02b:/homegfs/test /backup/Source/temp1 >>>>>>>> time /usr/local/bin/rsync -av --numeric-ids --delete >>>>>>>> --block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x" >>>>>>>> gfsib02b:/homegfs/test /backup/Source/temp2 >>>>>>>> time /usr/local/bin/rsync -av --numeric-ids --delete >>>>>>>> --block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x" >>>>>>>> gfsib02b:/homegfs/test /backup/Source/temp3 >>>>>>>> >>>>>>>> real 3m30.491s >>>>>>>> user 0m32.676s >>>>>>>> sys 0m10.776s >>>>>>>> >>>>>>>> real 3m26.076s >>>>>>>> user 0m32.588s >>>>>>>> sys 0m11.048s >>>>>>>> real 3m7.460s >>>>>>>> user 0m32.763s >>>>>>>> sys 0m11.687s >>>>>>>> >>>>>>>> >>>>>>>> Volume Name: Source_bkp >>>>>>>> Type: Distribute >>>>>>>> Volume ID: 1d4c210d-a731-4d39-a0c5-ea0546592c1d >>>>>>>> Status: Started >>>>>>>> Number of Bricks: 2 >>>>>>>> Transport-type: tcp >>>>>>>> Bricks: >>>>>>>> Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/Source_bkp >>>>>>>> Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/Source_bkp >>>>>>>> Options Reconfigured: >>>>>>>> performance.cache-size: 128MB >>>>>>>> performance.io-thread-count: 32 >>>>>>>> server.allow-insecure: on >>>>>>>> network.ping-timeout: 10 >>>>>>>> storage.owner-gid: 100 >>>>>>>> performance.write-behind-window-size: 128MB >>>>>>>> server.manage-gids: on >>>>>>>> changelog.rollover-time: 15 >>>>>>>> changelog.fsync-interval: 3 >>>>>>>> >>>>>>>> Volume Name: homegfs_bkp >>>>>>>> Type: Distribute >>>>>>>> Volume ID: 96de8872-d957-4205-bf5a-076e3f35b294 >>>>>>>> Status: Started >>>>>>>> Number of Bricks: 2 >>>>>>>> Transport-type: tcp >>>>>>>> Bricks: >>>>>>>> Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/homegfs_bkp >>>>>>>> Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/homegfs_bkp >>>>>>>> Options Reconfigured: >>>>>>>> storage.owner-gid: 100 >>>>>>>> performance.io-thread-count: 32 >>>>>>>> server.allow-insecure: on >>>>>>>> network.ping-timeout: 10 >>>>>>>> performance.cache-size: 128MB >>>>>>>> performance.write-behind-window-size: 128MB >>>>>>>> server.manage-gids: on >>>>>>>> changelog.rollover-time: 15 >>>>>>>> changelog.fsync-interval: 3 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ------ Original Message ------ >>>>>>>> From: "Benjamin Turner" <bennyturns@xxxxxxxxx> >>>>>>>> To: "David F. Robinson" <david.robinson@xxxxxxxxxxxxx> >>>>>>>> Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>; >>>>>>>> "gluster-users@xxxxxxxxxxx" <gluster-users@xxxxxxxxxxx> >>>>>>>> Sent: 2/3/2015 7:12:34 PM >>>>>>>> Subject: Re: missing files >>>>>>>> >>>>>>>>> It sounds to me like the files were only copied to one replica, >>>>>>>>> werent there for the initial for the initial ls which triggered a >>>>>>>>> self heal, and were there for the last ls because they were >>>>>>>>> healed. Is there any chance that one of the replicas was down >>>>>>>>> during the rsync? It could be that you lost a brick during copy or >>>>>>>>> something like that. To confirm I would look for disconnects in >>>>>>>>> the brick logs as well as checking glusterfshd.log to verify the >>>>>>>>> missing files were actually healed. >>>>>>>>> >>>>>>>>> -b >>>>>>>>> >>>>>>>>> On Tue, Feb 3, 2015 at 5:37 PM, David F. Robinson >>>>>>>>> <david.robinson@xxxxxxxxxxxxx> wrote: >>>>>>>>> I rsync'd 20-TB over to my gluster system and noticed that I had >>>>>>>>> some directories missing even though the rsync completed normally. >>>>>>>>> The rsync logs showed that the missing files were transferred. >>>>>>>>> >>>>>>>>> I went to the bricks and did an 'ls -al >>>>>>>>> /data/brick*/homegfs/dir/*' the files were on the bricks. After I >>>>>>>>> did this 'ls', the files then showed up on the FUSE mounts. >>>>>>>>> >>>>>>>>> 1) Why are the files hidden on the fuse mount? >>>>>>>>> 2) Why does the ls make them show up on the FUSE mount? >>>>>>>>> 3) How can I prevent this from happening again? >>>>>>>>> >>>>>>>>> Note, I also mounted the gluster volume using NFS and saw the >>>>>>>>> same behavior. The files/directories were not shown until I did >>>>>>>>> the "ls" on the bricks. >>>>>>>>> >>>>>>>>> David >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> =============================== >>>>>>>>> David F. Robinson, Ph.D. >>>>>>>>> President - Corvid Technologies >>>>>>>>> 704.799.6944 x101 [office] >>>>>>>>> 704.252.1310 [cell] >>>>>>>>> 704.799.7974 [fax] >>>>>>>>> David.Robinson@xxxxxxxxxxxxx >>>>>>>>> http://www.corvidtechnologies.com/ >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Gluster-devel mailing list >>>>>>>>> Gluster-devel@xxxxxxxxxxx >>>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel >>>>>> <glusterfs.tgz> >>>>> >>>>> -- >>>>> GlusterFS - http://www.gluster.org >>>>> >>>>> An open source, distributed file system scaling to several >>>>> petabytes, and handling thousands of clients. >>>>> >>>>> My personal twitter: twitter.com/realjustinclift >>>> >>>> >>>> _______________________________________________ >>>> Gluster-devel mailing list >>>> Gluster-devel@xxxxxxxxxxx >>>> http://www.gluster.org/mailman/listinfo/gluster-devel >>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel@xxxxxxxxxxx >>> http://www.gluster.org/mailman/listinfo/gluster-devel >> >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel@xxxxxxxxxxx >> http://www.gluster.org/mailman/listinfo/gluster-devel > _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel