Re: Need advice re some major issues with glusterfind

Aravinda <avishwan@xxxxxxxxxx> · Fri, 23 Oct 2015 15:07:26 +0530

Hi John,

Thanks for trying out glusterfind and reporting issues. To use 
glusterfind without session, Milind is planning to introduce a new 
option called "query".

`glusterfind query` will accept the timestamp as parameter and gets the 
list of files changed after that time.
Patch: http://review.gluster.org/#/c/12362/

Glusterfind is doing crawl in brick backend if a file path is not 
recorded in Changelogs(Only GFID is recorded in Changelogs). This may be 
the reason for the CPU utilization. We are working on the solution which 
simplifies the conversion of GFID to path, so that crawl can be avoided 
to find path.

regards
Aravinda

On 10/23/2015 02:54 PM, Sincock, John [FLCPTY] wrote:
Aaah I seeee, thanks Kotresh :-)
This explains why there are so many files and why I sometimes didn't see some changed files during my testing where I was changing files and then immediately running a glusterfind.

When you say deleting the changelogs is not recommended because it will affect new glusterfind sessions - I assume it will be OK to delete changelogs that are further back into the past than the time period we're interested in? Please let me know if this is the case, or if you meant that removing old changelogs is likely to trigger bugs and cause all our glusterfinds to start failing outright...

We can leave the old changelogs there if we have to, but if we don’t increase the rollover time, the number will become astronomical as time goes on, so I hope we can delete or archive old changelogs for time periods we're no longer interested in.

For our purposes I think it should also be OK to try increasing the rollover time significantly, eg if we have it set to rollover every 10 minutes, then all we have to do is subtract 10 mins from the start time of each glusterfind/backup so it overlaps the end of the previous glusterfind period. In this way, any files changed just before a glusterfind/backup runs, might be missed by the first backup, but they will be caught by the next backup that runs later on. And it wont matter if some changed files get backed up twice -as long as we get at least one backup of every file that does change..

I note that by default there is no easy way to make glusterfind report on changes further back in time than the time you run glusterfind create to start a session - but I've already had some success at getting glusterfind to give results back to earlier times before the session was created (as long as the changelogs exist). I did this by using a script to manually set the time we're interested in in the status file(s) - ie in the main status file on the node running the "pre" command", and for every one of the extra status files stored on every node for each of their bricks :-)

I think my only remaining concern is how cpu-intensive the process is. I've had glusterfinds return very quickly if only reporting on changes for the last hour, or the last 10 hours or so. But if I go back a bit further, the time taken to do the glusterfind seems to really blow out and it sits there pegging all our CPUs at 100% for hours.

But you and Vijay have definitely given me a few tweaks I can look into - I think I will bump-up the changelog rollover a bit, and will follow Vijay's tip to get all our files labelled with pgfid's, and then perhaps the glusterfinds will be less cpu-intensive.

Thanks for the tips (Kotresh & Vijay), and I'll let you know how it goes.

If the glusterfinds are still very cpu-intensive after all the pgfid labelling is done, I'll be happy to do some further testing if it can be of any help to you. Or if you're already trying to find time to work on increasing the efficiency of processing the changelogs, and you know where the improvements need to be made I'll just leave you to it and hope it all goes smoothly for you

Thanks again, and cheerios :-)
John

-----Original Message-----
From: Kotresh Hiremath Ravishankar [mailto:khiremat@xxxxxxxxxx]
Sent: Friday, 23 October 2015 5:24 PM
To: Sincock, John [FLCPTY]
Cc: Vijaikumar Mallikarjuna; gluster-devel@xxxxxxxxxxx
Subject: Re:  Need advice re some major issues with glusterfind

Hi John,

The changelog files are generated every 15 secs recording the changes happened to filesystem within that span.  So every 15 sec, once the new changelog file is generated, it is ready to be consumed by glusterfind or any other consumers. The 15 sec time period is a tune-able.
e.g.,
      gluster vol set <VOLNAME> changelog.rollover-time 300

The above will generate new changelog file every 300 sec instead of 15 sec. Hence reducing the number of changelogs. But glusterfind, will come to know about the changes in filesystem only after 300 secs!

Deleting these changelogs at .glusterfs/changelog/... is not recommeneded. It will affect any new glusterfind session going to be established.

Thanks and Regards,
Kotresh H R1

----- Original Message -----
From: "John Sincock [FLCPTY]" <J.Sincock@xxxxxxxxx>
To: "Vijaikumar Mallikarjuna" <vmallika@xxxxxxxxxx>
Cc: gluster-devel@xxxxxxxxxxx
Sent: Friday, October 23, 2015 9:54:25 AM
Subject: Re:  Need advice re some major issues with
glusterfind

Hi Vijay, pls see below again (I'm wondering if top-posting would be
easier, that's usually what I do, though I know some ppl don’t like
it)

On Wed, Oct 21, 2015 at 5:53 AM, Sincock, John [FLCPTY]
<J.Sincock@xxxxxxxxx>
wrote:
Hi Everybody,

We have recently upgraded our 220 TB gluster to 3.7.4, and we've been
trying to use the new glusterfind feature but have been having some
serious problems with it. Overall the glusterfind looks very
promising, so I don't want to offend anyone by raising these issues.

If these issues can be resolved or worked around, glusterfind will be
a great feature.  So I would really appreciate any information or advice:

1) What can be done about the vast number of tiny changelogs? We are
seeing often 5+ small 89 byte changelog files per minute on EACH
brick. Larger files if busier. We've been generating these changelogs
for a few weeks and have in excess of 10,000 or 12,000 on most bricks.
This makes glusterfinds very, very slow, especially on a node which
has a lot of bricks, and looks unsustainable in the long run. Why are
these files so small, and why are there so many of them, and how are
they supposed to be managed in the long run? The sheer number of these
files looks sure to impact performance in the long run.

2) Pgfid xattribute is wreaking havoc with our backup scheme - when
gluster adds this extended attribute to files it changes the ctime,
which we were using to determine which files need to be archived.
There should be a warning added to release notes & upgrade notes, so
people can make a plan to manage this if required.

Also, we ran a rebalance immediately after the 3.7.4 upgrade, and the
rebalance took 5 days or so to complete, which looks like a major
speed improvement over the more serial rebalance algorithm, so that's
good. But I was hoping that the rebalance would also have had the
side-effect of triggering all files to be labelled with the pgfid
attribute by the time the rebalance completed, or failing that, after
creation of an mlocate database across our entire gluster (which would
have accessed every file, unless it is getting the info it needs only
from directory inodes). Now it looks like ctimes are still being
modified, and I think this can only be caused by files still being labelled with pgfids.

How can we force gluster to get this pgfid labelling over and done
with, for all files that are already on the volume? We can't have
gluster continuing to add pgfids in bursts here and there, eg when
files are read for the first time since the upgrade. We need to get it
over and done with. We have just had to turn off pgfid creation on the
volume until we can force gluster to get it over and done with in one go.

Hi John,

Was quota turned on/off before/after performing re-balance? If the
pgfid is
  missing, this can be healed by performing 'find <mount_point> | xargs
stat', all the files will get looked-up once and the pgfid healing
will happen.
Also could you please provide all the volume files under
'/var/lib/glusterd/vols/<volname>/*.vol'?

Thanks,
Vijay

Hi Vijay

Quota has never been turned on in our gluster, so it can’t be any
quota-related xattrs which are resetting our ctimes, so I’m pretty
sure it must be due to pgfids still being added.

Thanks for the tip re using stat, if that should trigger the pgfid
build on each file, then I will run that when I have a chance. We’ll
have to get our archiving of data back up to date, re-enable pgfid
build option, and then run the stat over a weekend or something, as it will take a while.

I’m still quite concerned about the number of changelogs being
generated. Do you know if there any plans to change the way changelogs
are generated so there aren’t so many of them, and to process them
more efficiently? I think this will be vital to improving performance
of glusterfind in future, as there are currently an enormous number of
these small changelogs being generated on each of our gluster bricks.

Below is the volfile for one brick, the others are all equivalent. We
haven’t tweaked the volume options much, besides increasing the io
thread count to 32, and client/event threads to 6 (since we have a lot
of small files on our gluster (30 million files, a lot of which are
small, and some of which are large to very large):

Hi John,

PGFID xattrs are updated only when update-link-count-parent is enabled
in the brick volume file. This option is enabled when quota is enabled on a volume.
In the volume file you provided below has update-link-count-parent
disabled, I am wondering why PGFID xattrs are updated.

Thanks,
Vijay

Hi Vijay,
somewhere in the 3.7.5 upgrade instructions or the glusterfind
documentation, there was a mention that we should enable a server
option called storage.build-pgfid, which we did as it speeds up
glusterfinds. You cannot see this in the volfile but you can see it
when you do gluster volume info volname. So for our volume we currently have:

Options Reconfigured:
server.allow-insecure: on
nfs.disable: false
performance.io-thread-count: 32
features.quota: off
client.bind-insecure: on

storage.build-pgfid: off

changelog.changelog: on
changelog.capture-del-path: on
server.event-threads: 6
client.event-threads: 6

We've turned storage.build-pgfid OFF now, but we turned it on when we
did the upgrade to 3.7.4, and we had it on until a few days ago. So,
for us, with update-link-count-parent off - storage.build-pgfid
would've been the thing responsible for adding the pgfids to files on our volume.

I should've realised the best thing to do would’ve been to do a stat
on every file, in order to trigger the pgfid build, but at first I
thought the pgfids would be added to every file during the rebalance
which was a priority at the time (we had just added 40TB of new bricks
to a very full volume), and then we hit pgfid/backup issues etc.

I think we can get the pgfid issue resolved now you've confirmed that
a stat will do it (thanks :-) We'll just have to stop our clients
writing to the volume for a day or so while we stat every file on the
volume. Then, if we've stopped our clients writing during that time,
we can re-jig our backups to safely ignore any changed ctimes that've
changed during the day or so we were stating the volume.

I'll let you know how things go with the pgfid's if we can get them
turned back on and added to every file sometime as soon as possible.

I'm definitely more concerned now about the changelog issue. As
mentioned we have an enormous number of these, eg as of now (about 25
days since upgrading to 3.7.4), we have 13000 or so changelogs on each of our bricks:

ls -la /mnt/glusterfs/bricks/1/.glusterfs/changelogs/ | wc -l
13096

And they are very small, about 5 KB on average, and ranging from (many
at
just) 89 bytes, up to 20 KB or so for the larger ones:
du -hs /mnt/glusterfs/bricks/1/.glusterfs/changelogs/
68M     /mnt/glusterfs/bricks/1/.glusterfs/changelogs/

The size of the changelogs is not an issue (68M for almost a month
worth of changes is nothing), but the sheer number of files is, as is
the fact that it seems to be very cpu-intensive to process these files
(eg an strace showed glusterfind taking 2.7 million system calls to
process just one of these small changelogs).

Do you know if anyone is working on reducing the number of these
changelogs and/or processing them more efficiently?

Thanks again for any info!

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel