Hi David. Lets start with the basics and go from there. IIRC you
are using LVM with thick provisioning, lets verify the following:
1. You have everything properly aligned for your RAID stripe size,
etc. I have attached the script we package with RHS that I am in
the process of updating. I want to double check you created the PV
/ VG / LV with the proper variables. Have a look at the create_pv,
create_vg, and create_lv(old) functions. You will need to know the
stripe size of your raid and the number of stripe elements(data
disks, not hotspares). Also make sure you mkfs.xfs with:
echo "mkfs -t xfs -f -K -i size=$inode_size -d
sw=$stripe_elements,su=$stripesize -n size=$fs_block_size
/dev/$vgname/$lvname"
We use 512k inodes because some workload use more than the default
inode size and you don't want xattrs bleeding over inodes.
2. Are you running RHEL or Centos? If so I would recommend
tuned_profile=rhs-high-throughput. If you don't have that tuned
profile I'll get you everything it sets.
3. For small files we we recommend the following:
# RAID related variables.
# stripesize - RAID controller stripe unit size
# stripe_elements - the number of data disks
# The --dataalignment option is used while creating the physical
volumeTo
# align I/O at LVM layer
# dataalign -
# RAID6 is recommended when the workload has predominantly larger
# files ie not in kilobytes.
# For RAID6 with 12 disks and 128K stripe element size.
stripesize=128k
stripe_elements=10
dataalign=1280k
# RAID10 is recommended when the workload has predominantly
smaller files
# i.e in kilobytes.
# For RAID10 with 12 disks and 256K stripe element size, uncomment
the
# lines below.
# stripesize=256k
# stripe_elements=6
# dataalign=1536k
4. Jumbo frames everywhere! Check out the effect of jumbo frames,
make sure they are setup properly on your switch and add the
MTU=9000 to your ifcfg files(unless you have it already):
https://rhsummit.files.wordpress.com/2013/07/england_th_0450_rhs_perf_practices-4_neependra.pdf
(see the jumbo frames section here, the whole thing is a good read)
https://rhsummit.files.wordpress.com/2014/04/bengland_h_1100_rhs_performance.pdf
(this is updated for 2014)
5. There is a smallfile enhancement that just landed in master
that is showing me a 60% improvement in writes. This is called
multi threaded epoll and it is looking VERY promising WRT smallfile
performance. Here is a summary:
Hi all. I see alot of discussion on $subject and I wanted to take
a minute to talk about it and what we can do to test / observe the
effects of it. Lets start with a bit of background:
**Background**
-Currently epoll is single threaded on both clients and servers.
*This leads to a "hot thread" which consumes 100% of a CPU core.
*This can be observed by running BenE's smallfile benchmark to
create files, running top(on both clients and servers), and
pressing H to show threads.
*You will be able to see a single glusterfs thread eating 100%
of the CPU:
2871 root 20 0 746m 24m 3004 S 100.0 0.1 14:35.89 glusterfsd
4522 root 20 0 747m 24m 3004 S 5.3 0.1 0:02.25 glusterfsd
4507 root 20 0 747m 24m 3004 S 5.0 0.1 0:05.91 glusterfsd
21200 root 20 0 747m 24m 3004 S 4.6 0.1 0:21.16 glusterfsd
-Single threaded epoll is a bottlenck for high IOP / low metadata
workloads(think smallfile). With single threaded epoll we are CPU
bound by the single thread pegging out a CPU.
So the proposed solution to this problem is to make epoll multi
threaded on both servers and clients. Here is a link to the
upstream proposal:
http://www.gluster.org/community/documentation/index.php/Features/Feature_Smallfile_Perf#multi-thread-epoll
Status: [ http://review.gluster.org/#/c/3842/ based on Anand
Avati's patch ]
Why: remove single-thread-per-brick barrier to higher CPU
utilization by servers
Use case: multi-client and multi-thread applications
Improvement: measured 40% with 2 epoll threads and 100% with 4
epoll threads for small file creates to an SSD
Disadvantage: conflicts with support for SSL sockets, may require
significant code change to support both.
Note: this enhancement also helps high-IOPS applications such as
databases and virtualization which are not metadata-intensive. This
has been measured already using a Fusion I/O SSD performing random
reads and writes -- it was necessary to define multiple bricks per
SSD device to get Gluster to the same order of magnitude IOPS as a
local filesystem. But this workaround is problematic for users,
because storage space is not properly measured when there are
multiple bricks on the same filesystem.
Multi threaded epoll is part of a larger page that talks about
smallfile performance enhancements, proposed and happening.
Goal: if successful, throughput bottleneck should be either the
network or the brick filesystem!
What it doesn't do: multi-thread-epoll does not solve the
excessive-round-trip protocol problems that Gluster has.
What it should do: allow Gluster to exploit the mostly untapped
CPU resources on the Gluster servers and clients.
How it does it: allow multiple threads to read protocol messages
and process them at the same time.
How to observe: multi-thread-epoll should be configurable (how to
configure? gluster command?), with thread count 1 it should be same
as RHS 3.0, with thread count 2-4 it should show significantly more
CPU utilization (threads visible with "top -H"), resulting in
higher throughput.
**How to observe**
Here are the commands needed to setup an environment to test in on
RHS 3.0.3:
rpm -e glusterfs-api glusterfs glusterfs-libs glusterfs-fuse
glusterfs-geo-replication glusterfs-rdma glusterfs-server
glusterfs-cli gluster-nagios-common samba-glusterfs vdsm-gluster
--nodeps
rhn_register
yum groupinstall "Development tools"
git clone https://github.com/gluster/glusterfs.git
git branch test
git checkout test
git fetch http://review.gluster.org/glusterfs
refs/changes/42/3842/17 && git cherry-pick FETCH_HEAD
git fetch http://review.gluster.org/glusterfs
refs/changes/88/9488/2 && git cherry-pick FETCH_HEAD
yum install openssl openssl-devel
wget
ftp://fr2.rpmfind.net/linux/epel/6/x86_64/cmockery2-1.3.8-2.el6.x86_64.rpm
wget
ftp://fr2.rpmfind.net/linux/epel/6/x86_64/cmockery2-devel-1.3.8-2.el6.x86_64.rpm
yum install cmockery2-1.3.8-2.el6.x86_64.rpm
cmockery2-devel-1.3.8-2.el6.x86_64.rpm libxml2-devel
./autogen.sh
./configure
make
make install
Verify you are using the upstream with:
# gluster -- version
To enable set multithreaded epoll run the following commands:
From the patch:
{ .key = "client.event-threads", 839
.voltype = "protocol/client", 840
.op_version = GD_OP_VERSION_3_7_0, 841
},
{ .key = "server.event-threads", 946
.voltype = "protocol/server", 947
.op_version = GD_OP_VERSION_3_7_0, 948
},
# gluster v set <volname> server.event-threads 4
# gluster v set <volname> client.event-threads 4
Also grab smallfile:
https://github.com/bengland2/smallfile
After git cloneing smallfile run:
python /small-files/smallfile/smallfile_cli.py --operation create
--threads 8 --file-size 64 --files 10000 --top /gluster-mount
--pause 1000 --host-set "client1 client2"
Again we will be looking at top + show threads(press H). With 4
threads on both clients and servers you should see something
similar to(this isnt exact, I coped and pasted):
2871 root 20 0 746m 24m 3004 S 35.0 0.1 14:35.89 glusterfsd
2872 root 20 0 746m 24m 3004 S 51.0 0.1 14:35.89 glusterfsd
2873 root 20 0 746m 24m 3004 S 43.0 0.1 14:35.89 glusterfsd
2874 root 20 0 746m 24m 3004 S 65.0 0.1 14:35.89 glusterfsd
4522 root 20 0 747m 24m 3004 S 5.3 0.1 0:02.25 glusterfsd
4507 root 20 0 747m 24m 3004 S 5.0 0.1 0:05.91 glusterfsd
21200 root 20 0 747m 24m 3004 S 4.6 0.1 0:21.16 glusterfsd
If you have a test env I would be interested to see how multi
threaded epoll performs, but I am 100% sure its not ready for
production yet. RH will be supporting it with our 3.0.4(the next
one) release unless we find show stopping bugs. My testing looks
very promising though.
Smallfile performance enhancements are one of the key focuses for
our 3.1 release this summer, we are working very hard to improve
this as this is the use case for the majority of people.
On Fri, Feb 6, 2015 at 11:59 AM, David F. Robinson
<david.robinson@xxxxxxxxxxxxx> wrote:
Ben,
I was hoping you might be able to help with two performance
questions. I was doing some testing of my rsync where I am backing
up my primary gluster system (distributed + replicated) to my
backup gluster system (distributed). I tried three tests where I
rsynced from one of my primary sytems (gfsib02b) to my backup
machine. The test directory contains roughly 5500 files, most of
which are small. The script I ran is shown below which repeats the
tests 3x for each section to check variability in timing.
1) Writing to the local disk is drastically faster than writing to
gluster. So, my writes to the backup gluster system are what is
slowing me down, which makes sense.
2) When I write to the backup gluster system (/backup/homegfs),
the timing goes from 35-seconds to 1min40seconds. The question here
is whether you could recommend any settings for this volume that
would improve performance for small file writes? I have included
the output of 'gluster volume info" below.
3) When I did the same tests on the Source_bkp volume, it is
almost 3x as slow as the homegfs_bkp volume. However, these are
just different volumes on the same storage system. The volume
parameters are identical (see below). The performance of these two
should be identical. Any idea why they wouldn't be? And any
suggestions for how to fix this? The only thing that I see
different between the two is the order of the "Options
reconfigured" section. I assume order of options doesn't matter.
Backup to local hard disk (no gluster writes)
time /usr/local/bin/rsync -av --numeric-ids --delete
--block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x"
gfsib02b:/homegfs/test /temp1
time /usr/local/bin/rsync -av --numeric-ids --delete
--block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x"
gfsib02b:/homegfs/test /temp2
time /usr/local/bin/rsync -av --numeric-ids --delete
--block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x"
gfsib02b:/homegfs/test /temp3
real 0m35.579s
user 0m31.290s
sys 0m12.282s
real 0m38.035s
user 0m31.622s
sys 0m10.907s
real 0m38.313s
user 0m31.458s
sys 0m10.891s
Backup to gluster backup system on volume homegfs_bkp
time /usr/local/bin/rsync -av --numeric-ids --delete
--block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x"
gfsib02b:/homegfs/test /backup/homegfs/temp1
time /usr/local/bin/rsync -av --numeric-ids --delete
--block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x"
gfsib02b:/homegfs/test /backup/homegfs/temp2
time /usr/local/bin/rsync -av --numeric-ids --delete
--block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x"
gfsib02b:/homegfs/test /backup/homegfs/temp3
real 1m42.026s
user 0m32.604s
sys 0m9.967s
real 1m45.480s
user 0m32.577s
sys 0m11.994s
real 1m40.436s
user 0m32.521s
sys 0m11.240s
Backup to gluster backup system on volume Source_bkp
time /usr/local/bin/rsync -av --numeric-ids --delete
--block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x"
gfsib02b:/homegfs/test /backup/Source/temp1
time /usr/local/bin/rsync -av --numeric-ids --delete
--block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x"
gfsib02b:/homegfs/test /backup/Source/temp2
time /usr/local/bin/rsync -av --numeric-ids --delete
--block-size=131072 -e "ssh -T -c arcfour -o Compression=no -x"
gfsib02b:/homegfs/test /backup/Source/temp3
real 3m30.491s
user 0m32.676s
sys 0m10.776s
real 3m26.076s
user 0m32.588s
sys 0m11.048s
real 3m7.460s
user 0m32.763s
sys 0m11.687s
Volume Name: Source_bkp
Type: Distribute
Volume ID: 1d4c210d-a731-4d39-a0c5-ea0546592c1d
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/Source_bkp
Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/Source_bkp
Options Reconfigured:
performance.cache-size: 128MB
performance.io-thread-count: 32
server.allow-insecure: on
network.ping-timeout: 10
storage.owner-gid: 100
performance.write-behind-window-size: 128MB
server.manage-gids: on
changelog.rollover-time: 15
changelog.fsync-interval: 3
Volume Name: homegfs_bkp
Type: Distribute
Volume ID: 96de8872-d957-4205-bf5a-076e3f35b294
Status: Started
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/homegfs_bkp
Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/homegfs_bkp
Options Reconfigured:
storage.owner-gid: 100
performance.io-thread-count: 32
server.allow-insecure: on
network.ping-timeout: 10
performance.cache-size: 128MB
performance.write-behind-window-size: 128MB
server.manage-gids: on
changelog.rollover-time: 15
changelog.fsync-interval: 3
------ Original Message ------
From: "Benjamin Turner" <bennyturns@xxxxxxxxx>
To: "David F. Robinson" <david.robinson@xxxxxxxxxxxxx>
Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>;
"gluster-users@xxxxxxxxxxx" <gluster-users@xxxxxxxxxxx>
Sent: 2/3/2015 7:12:34 PM
Subject: Re: missing files
It sounds to me like the files were only copied to one replica,
werent there for the initial for the initial ls which triggered a
self heal, and were there for the last ls because they were
healed. Is there any chance that one of the replicas was down
during the rsync? It could be that you lost a brick during copy or
something like that. To confirm I would look for disconnects in
the brick logs as well as checking glusterfshd.log to verify the
missing files were actually healed.
-b
On Tue, Feb 3, 2015 at 5:37 PM, David F. Robinson
<david.robinson@xxxxxxxxxxxxx> wrote:
I rsync'd 20-TB over to my gluster system and noticed that I had
some directories missing even though the rsync completed normally.
The rsync logs showed that the missing files were transferred.
I went to the bricks and did an 'ls -al
/data/brick*/homegfs/dir/*' the files were on the bricks. After I
did this 'ls', the files then showed up on the FUSE mounts.
1) Why are the files hidden on the fuse mount?
2) Why does the ls make them show up on the FUSE mount?
3) How can I prevent this from happening again?
Note, I also mounted the gluster volume using NFS and saw the
same behavior. The files/directories were not shown until I did
the "ls" on the bricks.
David
===============================
David F. Robinson, Ph.D.
President - Corvid Technologies
704.799.6944 x101 [office]
704.252.1310 [cell]
704.799.7974 [fax]
David.Robinson@xxxxxxxxxxxxx
http://www.corvidtechnologies.com/
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel