Re: Need advice re some major issues with glusterfind

Raghavendra Gowdappa <rgowdapp@xxxxxxxxxx> · Wed, 21 Oct 2015 00:22:27 -0400 (EDT)

Hi John,

----- Original Message -----
> From: "John Sincock [FLCPTY]" <J.Sincock@xxxxxxxxx>
> To: gluster-devel@xxxxxxxxxxx
> Sent: Wednesday, October 21, 2015 5:53:23 AM
> Subject:  Need advice re some major issues with glusterfind
> 
> Hi Everybody,
> 
> We have recently upgraded our 220 TB gluster to 3.7.4, and we've been trying
> to use the new glusterfind feature but have been having some serious
> problems with it. Overall the glusterfind looks very promising, so I don't
> want to offend anyone by raising these issues.
> 
> If these issues can be resolved or worked around, glusterfind will be a great
> feature.  So I would really appreciate any information or advice:
> 
> 1) What can be done about the vast number of tiny changelogs? We are seeing
> often 5+ small 89 byte changelog files per minute on EACH brick. Larger
> files if busier. We've been generating these changelogs for a few weeks and
> have in excess of 10,000 or 12,000 on most bricks. This makes glusterfinds
> very, very slow, especially on a node which has a lot of bricks, and looks
> unsustainable in the long run. Why are these files so small, and why are
> there so many of them, and how are they supposed to be managed in the long
> run? The sheer number of these files looks sure to impact performance in the
> long run.
> 
> 2) Pgfid xattribute is wreaking havoc with our backup scheme - when gluster
> adds this extended attribute to files it changes the ctime, which we were
> using to determine which files need to be archived. There should be a
> warning added to release notes & upgrade notes, so people can make a plan to
> manage this if required.
> 
> Also, we ran a rebalance immediately after the 3.7.4 upgrade, and the
> rebalance took 5 days or so to complete, which looks like a major speed
> improvement over the more serial rebalance algorithm, so that's good. But I
> was hoping that the rebalance would also have had the side-effect of
> triggering all files to be labelled with the pgfid attribute by the time the
> rebalance completed, or failing that, after creation of an mlocate database
> across our entire gluster (which would have accessed every file, unless it
> is getting the info it needs only from directory inodes). Now it looks like
> ctimes are still being modified, and I think this can only be caused by
> files still being labelled with pgfids.
> 
> How can we force gluster to get this pgfid labelling over and done with, for
> all files that are already on the volume? We can't have gluster continuing
> to add pgfids in bursts here and there, eg when files are read for the first
> time since the upgrade. We need to get it over and done with. We have just
> had to turn off pgfid creation on the volume until we can force gluster to
> get it over and done with in one go.

We are looking into pgfid xattr issue. Its a long weekend here in India. So, kindly expect a delay on update on this issue.

> 
> 3) Files modified just before a glusterfind pre are often not included in the
> changed files list, unless pre command is run again a bit later - I think
> changelogs are missing very recent changes and need to be flushed or
> something before the pre command uses them?
> 
> 4) BUG: Glusterfind follows symlinks off bricks and onto NFS mounted
> directories (and will cause these shares to be mounted if you have autofs
> enabled). Glusterfind should definitely not follow symlinks, but it does.
> For now, we are getting around this by turning off autofs when re run
> glusterfinds, but this should not be necessary. Glusterfind must be fixed so
> it never follows symlinks and never leaves the brick it is currently
> searching.
> 
> 5) We have one of our nodes  with 16 bricks, and on this machine, glusterfind
> pre command seems to get stuck pegging all 8 cores to 100%, an strace of an
> offending processes gives an endless stream of these lseeks and reads and
> very little else. What is going on here? It doesn't look right... :
> 
> lseek(13, 17188864, SEEK_SET)           = 17188864
> read(13,
> "\r\0\0\0\4\0J\0\3\25\2\"\0013\0J\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024)
> = 1024
> lseek(13, 17189888, SEEK_SET)           = 17189888
> read(13,
> "\r\0\0\0\4\0\"\0\3\31\0020\1#\0\"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 1024) = 1024
> lseek(13, 17190912, SEEK_SET)           = 17190912
> read(13,
> "\r\0\0\0\3\0\365\0\3\1\1\372\0\365\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 1024) = 1024
> lseek(13, 17191936, SEEK_SET)           = 17191936
> read(13,
> "\r\0\0\0\4\0F\0\3\17\2\"\0017\0F\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024)
> = 1024
> lseek(13, 17192960, SEEK_SET)           = 17192960
> read(13,
> "\r\0\0\0\4\0006\0\2\371\2\4\1\31\0006\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 1024) = 1024
> lseek(13, 17193984, SEEK_SET)           = 17193984
> read(13,
> "\r\0\0\0\4\0L\0\3\31\2\36\1/\0L\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024)
> = 1024
> 
> I saved one of these straces for 20 or 30 secs or so, and then doing a quick
> analysis of it:
>     cat ~/strace.glusterfind-lseeks2.txt | wc -l
>     2719285
> 2.7 million system calls, and grepping to exclude all the lseeks and reads
> leaves only 24 other syscalls:
> 
> cat ~/strace.glusterfind-lseeks2.txt | grep -v lseek | grep -v read
> Process 28076 attached - interrupt to quit
> write(13,
> "\r\0\0\0\4\0\317\0\3N\2\241\1\322\0\317\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 1024) = 1024
> write(13,
> "\r\0\0\0\4\0_\0\3\5\2\34\1I\0_\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) =
> 1024
> write(13,
> "\r\0\0\0\4\0\24\0\3\10\2\f\1\34\0\24\0\0\0\0\202\3\203\324?\f\0!\31UU?"...,
> 1024) = 1024
> close(15)                               = 0
> munmap(0x7f3570b01000, 4096)            = 0
> lstat("/usr", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> lstat("/usr/var", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> lstat("/usr/var/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> lstat("/usr/var/lib/misc", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> lstat("/usr/var/lib/misc/glusterfsd", {st_mode=S_IFDIR|0755, st_size=4096,
> ...}) = 0
> lstat("/usr/var/lib/misc/glusterfsd/glusterfind", {st_mode=S_IFDIR|0755,
> st_size=4096, ...}) = 0
> lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1",
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00",
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e",
> {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history",
> {st_mode=S_IFDIR|0600, st_size=4096, ...}) = 0
> lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processing",
> {st_mode=S_IFDIR|0600, st_size=249856, ...}) = 0
> lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processing/CHANGELOG.1444388354",
> {st_mode=S_IFREG|0644, st_size=5793, ...}) = 0
> write(6, "[2015-10-16 02:59:53.437769] D ["..., 273) = 273
> rename("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processing/CHANGELOG.1444388354",
> "/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processed/CHANGELOG.1444388354")
> = 0
> open("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processing/CHANGELOG.1444388369",
> O_RDONLY) = 15
> fstat(15, {st_mode=S_IFREG|0644, st_size=4026, ...}) = 0
> fstat(15, {st_mode=S_IFREG|0644, st_size=4026, ...}) = 0
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
> 0x7f3570b01000
> write(13,
> "\r\0\0\0\4\0]\0\3\22\0027\1L\0]\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024)
> = 1024
> Process 28076 detached
> 
> That seems like an enormous number of system calls to process just one
> changelog - especially when most of these changelogs are only 89 bytes long
> and few are larger than about 5 KB, and the largest is about 20KB. We only
> upgraded to 3.7.4 several weeks ago, and we already have 12,000  or so
> changelogs to process on each brick, which will all have to be processed if
> I want to generate a listing which goes back to the time we did the upgrade
> - which I do... If each of the changelogs are being processed in this sort
> of apparently inefficient way, it must be making the process a lot slower
> than it needs to be.
> 
> This is a big problem and makes it almost impossible to use glusterfind for
> what we need to use it for...
> 
> Again, I'm not intending to be negative, just hoping these issues can be
> addressed if possible, and seeking advice or info re managing these issues
> and making glusterfind usable in the meantime.
> 
> Many thanks for any advice.
> 
> John
> 
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxxx
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel