Hello,
I have a problem with very slow Windows Explorer browsing
when there are a large number of directories/files.
In this case, the top level folder has almost 6000
directories,
admittedly large, but it works almost instantaneously when a
Windows Server share was being used.
Migrating to a Samba/GlusterFS share, there is almost a 20
second delay while the explorer window populates the list.
This leaves a bad impression on the storage performance. The
systems are otherwise idle.
To isolate the cause, I've eliminated everything, from
networking, Windows, and have narrowed in on GlusterFS
being the sole cause of most of the directory lag.
I was optimistic on using the GlusterFS VFS libgfapi
instead
of FUSE with Samba, and it does help performance
dramatically in some cases, but it does not help (and
sometimes hurts) when compared to the CIFS FUSE mount
for directory listings.
NFS for directory listings, and small I/O's seems to be
better, but I cannot use NFS, as I need to use CIFS for
Windows clients, need ACL's, Active Directory, etc.
Versions:
CentOS release 6.5 (Final)
# glusterd -V
glusterfs 3.4.2 built on Jan 6 2014 14:31:51
# smbd -V
Version 4.1.4
For testing, I've got a single GlusterFS volume, with a
single ext4 brick, being accessed locally:
# gluster volume info nas-cbs-0005
Volume Name: nas-cbs-0005
Type: Distribute
Volume ID: 5068e9a5-d60f-439c-b319-befbf9a73a50
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: 192.168.5.181:/exports/nas-segment-0004/nas-cbs-0005
Options Reconfigured:
server.allow-insecure: on
nfs.rpc-auth-allow: *
nfs.disable: off
nfs.addr-namelookup: off
The Samba share options are:
[nas-cbs-0005]
path = /samba/nas-cbs-0005/cifs_share
admin users = "localadmin"
valid users = "localadmin"
invalid users =
read list =
write list = "localadmin"
guest ok = yes
read _only_ = no
hide unreadable = yes
hide dot files = yes
available = yes
[nas-cbs-0005-vfs]
path = /
vfs objects = glusterfs
glusterfs:volume = nas-cbs-0005
kernel share modes = No
use sendfile = false
admin users = "localadmin"
valid users = "localadmin"
invalid users =
read list =
write list = "localadmin"
guest ok = yes
read _only_ = no
hide unreadable = yes
hide dot files = yes
available = yes
I've locally mounted the volume three ways, with NFS, Samba
CIFS through a GlusterFS FUSE mount, and VFS libgfapi mount:
# mount
/dev/sdr on /exports/nas-segment-0004 type ext4
(rw,noatime,auto_da_alloc,barrier,nodelalloc,journal_checksum,acl,user_xattr)
/var/lib/glusterd/vols/nas-cbs-0005/nas-cbs-0005-fuse.vol on
/samba/nas-cbs-0005 type fuse.glusterfs
(rw,allow_other,max_read=131072)
//10.10.200.181/nas-cbs-0005
on /mnt/nas-cbs-0005-cifs type cifs
(rw,username=localadmin,password=localadmin)
10.10.200.181:/nas-cbs-0005 on /mnt/nas-cbs-0005 type nfs
(rw,addr=10.10.200.181)
//10.10.200.181/nas-cbs-0005-vfs
on /mnt/nas-cbs-0005-cifs-vfs type cifs
(rw,username=localadmin,password=localadmin)
Directory listing 6000 empty directories benchmark results:
Directory listing the ext4 mount directly is almost
instantaneous of course.
Directory listing the NFS mount is also very fast, less
than a second.
Directory listing the CIFS FUSE mount is so slow, almost
16
seconds!
Directory listing the CIFS VFS libgfapi mount is about
twice
as fast as FUSE, but still slow at 8 seconds.
Unfortunately, due to:
Bug 1004327 - New files are not inheriting ACL from
parent
directory unless "stat-prefetch" is off for
the respective gluster volume
https://bugzilla.redhat.com/show_bug.cgi?id=1004327
I need to have 'stat-prefetch' off. Retesting with this
setting.
Directory listing 6000 empty directories benchmark results
('stat-prefetch' is off):
Accessing the ext4 mount directly is almost
instantaneous of course.
Accessing the NFS mount is still very fast, less than a
second.
Accessing the CIFS FUSE mount is slow, almost 14
seconds, but slightly faster than when 'stat-prefetch' was
on?
Accessing the CIFS VFS libgfapi mount is now about twice
as slow as FUSE, at almost 26 seconds, I guess due
to 'stat- prefetch' being off!
To see if the directory listing problem was due to file
system metadata handling, or small I/O's, did some simple
small block file I/O benchmarks with the same configuration.
64KB Sequential Writes:
NFS small block writes seem slow at about 50 MB/sec.
CIFS FUSE small block writes are more than twice as fast
as
NFS, at about 118 MB/sec.
CIFS VFS libgfapi small block writes are very fast, about
twice as fast as CIFS FUSE, at about 232 MB/sec.
64KB Sequential Reads:
NFS small block reads are very fast, at about 334 MB/sec.
CIFS FUSE small block reads are half of NFS, at about 124
MB/sec.
CIFS VFS libgfapi small block reads are about the same as
CIFS FUSE, at about 127 MB/sec.
4KB Sequential Writes:
NFS very small block writes are very slow at about 4
MB/sec.
CIFS FUSE very small block writes are faster, at about 11
MB/sec.
CIFS VFS libgfapi very small block writes are twice as
fast
as CIFS FUSE, at about 22 MB/sec.
4KB Sequential Reads:
NFS very small block reads are very fast at about 346
MB/sec.
CIFS FUSE very small block reads are less than half as
fast
as NFS, at about 143 MB/sec.
CIFS VFS libgfapi very small block reads a slight bit
slower
than CIFS FUSE, at about 137 MB/sec.
I'm not quite sure how interpret these results. Write
caching is playing a part for sure, but it should apply
equally for both NFS and CIFS I would think. With small file
I/O's, NFS is better at reading than CIFS, and CIFS VFS is
twice as good at writing as CIFS FUSE. Sadly, CIFS VFS is
about the same as CIFS FUSE at reading.
Regarding the directory listing lag problem, I've tried most
of the the GlusterFS volume options that seemed like they
might help, but nothing really did.
Gluster having 'stat-prefetch' on helps, but has to be off
for the bug.
BTW: I've repeated some tests with empty files instead of
directories, and the results were similar. The issue is
not
specific to directories only.
I know that small file reads and file-system metadata
handling is not GlusterFS's strong suit, but is there
*anything* that can be done to help it out? Any ideas?
Should I hope/expect for GlusterFS 3.5.x to improve this
any?
Raw data is below.
Any advice is appreciated. Thanks.
~ Jeff Byers ~
##########################
Directory listing of 6000 empty directories ('stat-prefetch'
is on):
Directory listing the ext4 mount directly is almost
instantaneous of course.
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# time ls -l
/exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/
>/dev/null
real 0m41.235s (Throw away first time for ext4 FS cache
population?)
# time ls -l
/exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/
>/dev/null
real 0m0.110s
# time ls -l
/exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/
>/dev/null
real 0m0.109s
Directory listing the NFS mount is also very fast.
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/
>/dev/null
real 0m44.352s (Throw away first time for ext4 FS cache
population?)
# time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/
>/dev/null
real 0m0.471s
# time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/
>/dev/null
real 0m0.114s
Directory listing the CIFS FUSE mount is so slow, almost 16
seconds!
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
real 0m56.573s (Throw away first time for ext4 FS cache
population?)
# time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
real 0m16.101s
# time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
real 0m15.986s
Directory listing the CIFS VFS libgfapi mount is about twice
as fast as FUSE, but still slow at 8 seconds.
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/
>/dev/null
real 0m48.839s (Throw away first time for ext4 FS cache
population?)
# time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/
>/dev/null
real 0m8.197s
# time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/
>/dev/null
real 0m8.450s
####################
Retesting directory list with Gluster default settings,
including 'stat-prefetch' off, due to:
Bug 1004327 - New files are not inheriting ACL from
parent directory
unless "stat-prefetch" is off for the
respective gluster
volume
https://bugzilla.redhat.com/show_bug.cgi?id=1004327
# gluster volume info nas-cbs-0005
Volume Name: nas-cbs-0005
Type: Distribute
Volume ID: 5068e9a5-d60f-439c-b319-befbf9a73a50
Status: Started
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: 192.168.5.181:/exports/nas-segment-0004/nas-cbs-0005
Options Reconfigured:
performance.stat-prefetch: off
server.allow-insecure: on
nfs.rpc-auth-allow: *
nfs.disable: off
nfs.addr-namelookup: off
Directory listing of 6000 empty directories ('stat-prefetch'
is off):
Accessing the ext4 mount directly is almost instantaneous of
course.
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# time ls -l
/exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/
>/dev/null
real 0m39.483s (Throw away first time for ext4 FS cache
population?)
# time ls -l
/exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/
>/dev/null
real 0m0.136s
# time ls -l
/exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/
>/dev/null
real 0m0.109s
Accessing the NFS mount is also very fast.
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/
>/dev/null
real 0m43.819s (Throw away first time for ext4 FS cache
population?)
# time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/
>/dev/null
real 0m0.342s
# time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/
>/dev/null
real 0m0.200s
Accessing the CIFS FUSE mount is slow, almost 14 seconds!
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
real 0m55.759s (Throw away first time for ext4 FS cache
population?)
# time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
real 0m13.458s
# time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
real 0m13.665s
Accessing the CIFS VFS libgfapi mount is now about twice as
slow as FUSE, at almost 26 seconds due to 'stat-prefetch'
being off!
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/
>/dev/null
real 1m2.821s (Throw away first time for ext4 FS cache
population?)
# time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/
>/dev/null
real 0m25.563s
# time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/
>/dev/null
real 0m26.949s
####################
64KB Writes:
NFS small block writes seem slow at about 50 MB/sec.
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync
if=/dev/zero of=/mnt/nas-cbs-0005/cifs_share/testfile
count=20k
time to transfer data was 27.249756 secs, 49.25 MB/sec
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync
if=/dev/zero of=/mnt/nas-cbs-0005/cifs_share/testfile
count=20k
time to transfer data was 25.893526 secs, 51.83 MB/sec
CIFS FUSE small block writes are more than twice as fast as
NFS, at about 118 MB/sec.
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync
if=/dev/zero of=/mnt/nas-cbs-0005-cifs/testfile count=20k
time to transfer data was 11.509077 secs, 116.62 MB/sec
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync
if=/dev/zero of=/mnt/nas-cbs-0005-cifs/testfile count=20k
time to transfer data was 11.223902 secs, 119.58 MB/sec
CIFS VFS libgfapi small block writes are very fast, about
twice as fast as CIFS FUSE, at about 232 MB/sec.
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync
if=/dev/zero of=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile
count=20k
time to transfer data was 5.704753 secs, 235.27 MB/sec
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync
if=/dev/zero of=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile
count=20k
time to transfer data was 5.862486 secs, 228.94 MB/sec
64KB Reads:
NFS small block reads are very fast, at about 334 MB/sec.
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync
of=/dev/null if=/mnt/nas-cbs-0005/cifs_share/testfile
count=20k
time to transfer data was 3.972426 secs, 337.87 MB/sec
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync
of=/dev/null if=/mnt/nas-cbs-0005/cifs_share/testfile
count=20k
time to transfer data was 4.066978 secs, 330.02 MB/sec
CIFS FUSE small block reads are half of NFS, at about 124
MB/sec.
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync
of=/dev/null if=/mnt/nas-cbs-0005-cifs/testfile count=20k
time to transfer data was 10.837072 secs, 123.85 MB/sec
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync
of=/dev/null if=/mnt/nas-cbs-0005-cifs/testfile count=20k
time to transfer data was 10.716980 secs, 125.24 MB/sec
CIFS VFS libgfapi small block reads are about the same as
CIFS FUSE, at about 127 MB/sec.
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync
of=/dev/null if=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile
count=20k
time to transfer data was 10.397888 secs, 129.08 MB/sec
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=64k bpt=1 iflag=dsync oflag=dsync
of=/dev/null if=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile
count=20k
time to transfer data was 10.696802 secs, 125.47 MB/sec
4KB Writes:
NFS very small block writes are very slow at about 4 MB/sec.
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync
if=/dev/zero of=/mnt/nas-cbs-0005/cifs_share/testfile
count=20k
time to transfer data was 20.450521 secs, 4.10 MB/sec
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync
if=/dev/zero of=/mnt/nas-cbs-0005/cifs_share/testfile
count=20k
time to transfer data was 19.669923 secs, 4.26 MB/sec
CIFS FUSE very small block writes are faster, at about 11
MB/sec.
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync
if=/dev/zero of=/mnt/nas-cbs-0005-cifs/testfile count=20k
time to transfer data was 7.247578 secs, 11.57 MB/sec
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync
if=/dev/zero of=/mnt/nas-cbs-0005-cifs/testfile count=20k
time to transfer data was 7.422002 secs, 11.30 MB/sec
CIFS VFS libgfapi very small block writes are twice as fast
as CIFS FUSE, at about 22 MB/sec.
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync
if=/dev/zero of=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile
count=20k
time to transfer data was 3.766179 secs, 22.27 MB/sec
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync
if=/dev/zero of=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile
count=20k
time to transfer data was 3.761176 secs, 22.30 MB/sec
4KB Reads:
NFS very small block reads are very fast at about 346
MB/sec.
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync
of=/dev/null if=/mnt/nas-cbs-0005/cifs_share/testfile
count=20k
time to transfer data was 0.244960 secs, 342.45 MB/sec
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync
of=/dev/null if=/mnt/nas-cbs-0005/cifs_share/testfile
count=20k
time to transfer data was 0.240472 secs, 348.84 MB/sec
CIFS FUSE very small block reads are less than half as fast
as NFS, at about 143 MB/sec.
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync
of=/dev/null if=/mnt/nas-cbs-0005-cifs/testfile count=20k
time to transfer data was 0.606534 secs, 138.30 MB/sec
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync
of=/dev/null if=/mnt/nas-cbs-0005-cifs/testfile count=20k
time to transfer data was 0.576185 secs, 145.59 MB/sec
CIFS VFS libgfapi very small block reads a slight bit slower
than CIFS FUSE, at about 137 MB/sec.
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync
of=/dev/null if=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile
count=20k
time to transfer data was 0.611328 secs, 137.22 MB/sec
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync
of=/dev/null if=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile
count=20k
time to transfer data was 0.615834 secs, 136.22 MB/sec
EOM