Hi John, ----- Original Message ----- > From: "John Sincock [FLCPTY]" <J.Sincock@xxxxxxxxx> > To: gluster-devel@xxxxxxxxxxx > Sent: Wednesday, October 21, 2015 5:53:23 AM > Subject: Need advice re some major issues with glusterfind > > Hi Everybody, > > We have recently upgraded our 220 TB gluster to 3.7.4, and we've been trying > to use the new glusterfind feature but have been having some serious > problems with it. Overall the glusterfind looks very promising, so I don't > want to offend anyone by raising these issues. > > If these issues can be resolved or worked around, glusterfind will be a great > feature. So I would really appreciate any information or advice: > > 1) What can be done about the vast number of tiny changelogs? We are seeing > often 5+ small 89 byte changelog files per minute on EACH brick. Larger > files if busier. We've been generating these changelogs for a few weeks and > have in excess of 10,000 or 12,000 on most bricks. This makes glusterfinds > very, very slow, especially on a node which has a lot of bricks, and looks > unsustainable in the long run. Why are these files so small, and why are > there so many of them, and how are they supposed to be managed in the long > run? The sheer number of these files looks sure to impact performance in the > long run. > > 2) Pgfid xattribute is wreaking havoc with our backup scheme - when gluster > adds this extended attribute to files it changes the ctime, which we were > using to determine which files need to be archived. There should be a > warning added to release notes & upgrade notes, so people can make a plan to > manage this if required. > > Also, we ran a rebalance immediately after the 3.7.4 upgrade, and the > rebalance took 5 days or so to complete, which looks like a major speed > improvement over the more serial rebalance algorithm, so that's good. But I > was hoping that the rebalance would also have had the side-effect of > triggering all files to be labelled with the pgfid attribute by the time the > rebalance completed, or failing that, after creation of an mlocate database > across our entire gluster (which would have accessed every file, unless it > is getting the info it needs only from directory inodes). Now it looks like > ctimes are still being modified, and I think this can only be caused by > files still being labelled with pgfids. > > How can we force gluster to get this pgfid labelling over and done with, for > all files that are already on the volume? We can't have gluster continuing > to add pgfids in bursts here and there, eg when files are read for the first > time since the upgrade. We need to get it over and done with. We have just > had to turn off pgfid creation on the volume until we can force gluster to > get it over and done with in one go. We are looking into pgfid xattr issue. Its a long weekend here in India. So, kindly expect a delay on update on this issue. > > 3) Files modified just before a glusterfind pre are often not included in the > changed files list, unless pre command is run again a bit later - I think > changelogs are missing very recent changes and need to be flushed or > something before the pre command uses them? > > 4) BUG: Glusterfind follows symlinks off bricks and onto NFS mounted > directories (and will cause these shares to be mounted if you have autofs > enabled). Glusterfind should definitely not follow symlinks, but it does. > For now, we are getting around this by turning off autofs when re run > glusterfinds, but this should not be necessary. Glusterfind must be fixed so > it never follows symlinks and never leaves the brick it is currently > searching. > > 5) We have one of our nodes with 16 bricks, and on this machine, glusterfind > pre command seems to get stuck pegging all 8 cores to 100%, an strace of an > offending processes gives an endless stream of these lseeks and reads and > very little else. What is going on here? It doesn't look right... : > > lseek(13, 17188864, SEEK_SET) = 17188864 > read(13, > "\r\0\0\0\4\0J\0\3\25\2\"\0013\0J\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) > = 1024 > lseek(13, 17189888, SEEK_SET) = 17189888 > read(13, > "\r\0\0\0\4\0\"\0\3\31\0020\1#\0\"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > 1024) = 1024 > lseek(13, 17190912, SEEK_SET) = 17190912 > read(13, > "\r\0\0\0\3\0\365\0\3\1\1\372\0\365\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > 1024) = 1024 > lseek(13, 17191936, SEEK_SET) = 17191936 > read(13, > "\r\0\0\0\4\0F\0\3\17\2\"\0017\0F\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) > = 1024 > lseek(13, 17192960, SEEK_SET) = 17192960 > read(13, > "\r\0\0\0\4\0006\0\2\371\2\4\1\31\0006\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > 1024) = 1024 > lseek(13, 17193984, SEEK_SET) = 17193984 > read(13, > "\r\0\0\0\4\0L\0\3\31\2\36\1/\0L\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) > = 1024 > > I saved one of these straces for 20 or 30 secs or so, and then doing a quick > analysis of it: > cat ~/strace.glusterfind-lseeks2.txt | wc -l > 2719285 > 2.7 million system calls, and grepping to exclude all the lseeks and reads > leaves only 24 other syscalls: > > cat ~/strace.glusterfind-lseeks2.txt | grep -v lseek | grep -v read > Process 28076 attached - interrupt to quit > write(13, > "\r\0\0\0\4\0\317\0\3N\2\241\1\322\0\317\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > 1024) = 1024 > write(13, > "\r\0\0\0\4\0_\0\3\5\2\34\1I\0_\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) = > 1024 > write(13, > "\r\0\0\0\4\0\24\0\3\10\2\f\1\34\0\24\0\0\0\0\202\3\203\324?\f\0!\31UU?"..., > 1024) = 1024 > close(15) = 0 > munmap(0x7f3570b01000, 4096) = 0 > lstat("/usr", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > lstat("/usr/var", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > lstat("/usr/var/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > lstat("/usr/var/lib/misc", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > lstat("/usr/var/lib/misc/glusterfsd", {st_mode=S_IFDIR|0755, st_size=4096, > ...}) = 0 > lstat("/usr/var/lib/misc/glusterfsd/glusterfind", {st_mode=S_IFDIR|0755, > st_size=4096, ...}) = 0 > lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1", > {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00", > {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e", > {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 > lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history", > {st_mode=S_IFDIR|0600, st_size=4096, ...}) = 0 > lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processing", > {st_mode=S_IFDIR|0600, st_size=249856, ...}) = 0 > lstat("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processing/CHANGELOG.1444388354", > {st_mode=S_IFREG|0644, st_size=5793, ...}) = 0 > write(6, "[2015-10-16 02:59:53.437769] D ["..., 273) = 273 > rename("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processing/CHANGELOG.1444388354", > "/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processed/CHANGELOG.1444388354") > = 0 > open("/usr/var/lib/misc/glusterfsd/glusterfind/TestSession1/vol00/f261f90fd73ecb69c6b31f646ce65be0fd129f5e/.history/.processing/CHANGELOG.1444388369", > O_RDONLY) = 15 > fstat(15, {st_mode=S_IFREG|0644, st_size=4026, ...}) = 0 > fstat(15, {st_mode=S_IFREG|0644, st_size=4026, ...}) = 0 > mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = > 0x7f3570b01000 > write(13, > "\r\0\0\0\4\0]\0\3\22\0027\1L\0]\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024) > = 1024 > Process 28076 detached > > That seems like an enormous number of system calls to process just one > changelog - especially when most of these changelogs are only 89 bytes long > and few are larger than about 5 KB, and the largest is about 20KB. We only > upgraded to 3.7.4 several weeks ago, and we already have 12,000 or so > changelogs to process on each brick, which will all have to be processed if > I want to generate a listing which goes back to the time we did the upgrade > - which I do... If each of the changelogs are being processed in this sort > of apparently inefficient way, it must be making the process a lot slower > than it needs to be. > > This is a big problem and makes it almost impossible to use glusterfind for > what we need to use it for... > > Again, I'm not intending to be negative, just hoping these issues can be > addressed if possible, and seeking advice or info re managing these issues > and making glusterfind usable in the meantime. > > Many thanks for any advice. > > John > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxxx > http://www.gluster.org/mailman/listinfo/gluster-devel > _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel