Hi Vijay, this is an update to the 8 tests I've run so far. In short all is well. I followed your advice and created state dumps every 3 hours. 4 tests ran with the default volume options. The last 4 tests ran with all performance optimizations I could find to increase small file performance. During the run time the dump file size varied from the beginning of the mount ~100KB to ~1GB reflecting the memory footprint of the gluster process. Since every test ran without interruption the memory leak seems to be fixed in 3.12.14-1.el7.x86_64 on CentOS 7. Thanks again for you help. Cheers Richard On 15.10.18 10:48, Richard Neuboeck wrote: > Hi Vijay, > > sorry it took so long. I've upgraded the gluster server and client to > the latest packages 3.12.14-1.el7.x86_64 available in CentOS. > > Incredibly my first test after the update worked perfectly! I'll do > another couple of rsyncs, maybe apply the performance improvements again > and do statedumps all the way. > > I'll report back if there are any more problems or if they are resolved. > > Thanks for the help so far! > Cheers > Richard > > > On 25.09.18 00:39, Vijay Bellur wrote: >> Hello Richard, >> >> Thank you for the logs. >> >> I am wondering if this could be a different memory leak than the one >> addressed in the bug. Would it be possible for you to obtain a >> statedump of the client so that we can understand the memory allocation >> pattern better? Details about gathering a statedump can be found at [1]. >> Please ensure that /var/run/gluster is present before triggering a >> statedump. >> >> Regards, >> Vijay >> >> [1] https://docs.gluster.org/en/v3/Troubleshooting/statedump/ >> >> >> On Fri, Sep 21, 2018 at 12:14 AM Richard Neuboeck <hawk@xxxxxxxxxxxxxxxx >> <mailto:hawk@xxxxxxxxxxxxxxxx>> wrote: >> >> Hi again, >> >> in my limited - non full time programmer - understanding it's a memory >> leak in the gluster fuse client. >> >> Should I reopen the mentioned bugreport or open a new one? Or would the >> community prefer an entirely different approach? >> >> Thanks >> Richard >> >> On 13.09.18 10:07, Richard Neuboeck wrote: >> > Hi, >> > >> > I've created excerpts from the brick and client logs +/- 1 minute to >> > the kill event. Still the logs are ~400-500MB so will put them >> > somewhere to download since I have no idea what I should be looking >> > for and skimming them didn't reveal obvious problems to me. >> > >> > http://www.tbi.univie.ac.at/~hawk/gluster/brick_3min_excerpt.log >> <http://www.tbi.univie.ac.at/%7Ehawk/gluster/brick_3min_excerpt.log> >> > http://www.tbi.univie.ac.at/~hawk/gluster/mnt_3min_excerpt.log >> <http://www.tbi.univie.ac.at/%7Ehawk/gluster/mnt_3min_excerpt.log> >> > >> > I was pointed in the direction of the following Bugreport >> > https://bugzilla.redhat.com/show_bug.cgi?id=1613512 >> > It sounds right but seems to have been addressed already. >> > >> > If there is anything I can do to help solve this problem please let >> > me know. Thanks for your help! >> > >> > Cheers >> > Richard >> > >> > >> > On 9/11/18 10:10 AM, Richard Neuboeck wrote: >> >> Hi, >> >> >> >> since I feared that the logs would fill up the partition (again) I >> >> checked the systems daily and finally found the reason. The glusterfs >> >> process on the client runs out of memory and get's killed by OOM >> after >> >> about four days. Since rsync runs for a couple of days longer till it >> >> ends I never checked the whole time frame in the system logs and >> never >> >> stumbled upon the OOM message. >> >> >> >> Running out of memory on a 128GB RAM system even with a DB occupying >> >> ~40% of that is kind of strange though. Might there be a leak? >> >> >> >> But this would explain the erratic behavior I've experienced over the >> >> last 1.5 years while trying to work with our homes on glusterfs. >> >> >> >> Here is the kernel log message for the killed glusterfs process. >> >> https://gist.github.com/bleuchien/3d2b87985ecb944c60347d5e8660e36a >> >> >> >> I'm checking the brick and client trace logs. But those are >> respectively >> >> 1TB and 2TB in size so searching in them takes a while. I'll be >> creating >> >> gists for both logs about the time when the process died. >> >> >> >> As soon as I have more details I'll post them. >> >> >> >> Here you can see a graphical representation of the memory usage >> of this >> >> system: https://imgur.com/a/4BINtfr >> >> >> >> Cheers >> >> Richard >> >> >> >> >> >> >> >> On 31.08.18 08:13, Raghavendra Gowdappa wrote: >> >>> >> >>> >> >>> On Fri, Aug 31, 2018 at 11:11 AM, Richard Neuboeck >> >>> <hawk@xxxxxxxxxxxxxxxx <mailto:hawk@xxxxxxxxxxxxxxxx> >> <mailto:hawk@xxxxxxxxxxxxxxxx <mailto:hawk@xxxxxxxxxxxxxxxx>>> wrote: >> >>> >> >>> On 08/31/2018 03:50 AM, Raghavendra Gowdappa wrote: >> >>> > +Mohit. +Milind >> >>> > >> >>> > @Mohit/Milind, >> >>> > >> >>> > Can you check logs and see whether you can find anything >> relevant? >> >>> >> >>> From glances at the system logs nothing out of the ordinary >> >>> occurred. However I'll start another rsync and take a closer >> look. >> >>> It will take a few days. >> >>> >> >>> > >> >>> > On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck >> >>> > <hawk@xxxxxxxxxxxxxxxx <mailto:hawk@xxxxxxxxxxxxxxxx> >> <mailto:hawk@xxxxxxxxxxxxxxxx <mailto:hawk@xxxxxxxxxxxxxxxx>> >> >>> <mailto:hawk@xxxxxxxxxxxxxxxx <mailto:hawk@xxxxxxxxxxxxxxxx> >> <mailto:hawk@xxxxxxxxxxxxxxxx <mailto:hawk@xxxxxxxxxxxxxxxx>>>> wrote: >> >>> > >> >>> > Hi, >> >>> > >> >>> > I'm attaching a shortened version since the whole is >> about 5.8GB of >> >>> > the client mount log. It includes the initial mount >> messages and the >> >>> > last two minutes of log entries. >> >>> > >> >>> > It ends very anticlimactic without an obvious error. >> Is there >> >>> > anything specific I should be looking for? >> >>> > >> >>> > >> >>> > Normally I look logs around disconnect msgs to find out >> the reason. >> >>> > But as you said, sometimes one can see just disconnect >> msgs without >> >>> > any reason. That normally points to reason for disconnect >> in the >> >>> > network rather than a Glusterfs initiated disconnect. >> >>> >> >>> The rsync source is serving our homes currently so there are NFS >> >>> connections 24/7. There don't seem to be any network related >> >>> interruptions >> >>> >> >>> >> >>> Can you set diagnostics.client-log-level and >> diagnostics.brick-log-level >> >>> to TRACE and check logs of both ends of connections - client and >> brick? >> >>> To reduce the logsize, I would suggest to logrotate existing >> logs and >> >>> start with fresh logs when you are about to start so that only >> relevant >> >>> logs are captured. Also, can you take strace of client and brick >> process >> >>> using: >> >>> >> >>> strace -o <outputfile> -ff -v -p <pid> >> >>> >> >>> attach both logs and strace. Let's trace through what syscalls >> on socket >> >>> return and then decide whether to inspect tcpdump or not. If you >> don't >> >>> want to repeat tests again, please capture tcpdump too (on both >> ends of >> >>> connection) and send them to us. >> >>> >> >>> >> >>> - a co-worker would be here faster than I could check >> >>> the logs if the connection to home would be broken ;-) >> >>> The three gluster machines are due to this problem reduced >> to only >> >>> testing so there is nothing else running. >> >>> >> >>> >> >>> > >> >>> > Cheers >> >>> > Richard >> >>> > >> >>> > On 08/30/2018 02:40 PM, Raghavendra Gowdappa wrote: >> >>> > > Normally client logs will give a clue on why the >> disconnections are >> >>> > > happening (ping-timeout, wrong port etc). Can you >> look into client >> >>> > > logs to figure out what's happening? If you can't >> find anything, can >> >>> > > you send across client logs? >> >>> > > >> >>> > > On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck >> >>> > > <hawk@xxxxxxxxxxxxxxxx >> <mailto:hawk@xxxxxxxxxxxxxxxx> <mailto:hawk@xxxxxxxxxxxxxxxx >> <mailto:hawk@xxxxxxxxxxxxxxxx>> >> >>> <mailto:hawk@xxxxxxxxxxxxxxxx <mailto:hawk@xxxxxxxxxxxxxxxx> >> <mailto:hawk@xxxxxxxxxxxxxxxx <mailto:hawk@xxxxxxxxxxxxxxxx>>> >> >>> > <mailto:hawk@xxxxxxxxxxxxxxxx >> <mailto:hawk@xxxxxxxxxxxxxxxx> <mailto:hawk@xxxxxxxxxxxxxxxx >> <mailto:hawk@xxxxxxxxxxxxxxxx>> >> >>> <mailto:hawk@xxxxxxxxxxxxxxxx <mailto:hawk@xxxxxxxxxxxxxxxx> >> <mailto:hawk@xxxxxxxxxxxxxxxx <mailto:hawk@xxxxxxxxxxxxxxxx>>>>> >> >>> > wrote: >> >>> > > >> >>> > > Hi Gluster Community, >> >>> > > >> >>> > > I have problems with a glusterfs 'Transport >> endpoint not >> >>> > connected' >> >>> > > connection abort during file transfers that I can >> >>> > replicate (all the >> >>> > > time now) but not pinpoint as to why this is >> happening. >> >>> > > >> >>> > > The volume is set up in replica 3 mode and >> accessed with >> >>> > the fuse >> >>> > > gluster client. Both client and server are >> running CentOS >> >>> > and the >> >>> > > supplied 3.12.11 version of gluster. >> >>> > > >> >>> > > The connection abort happens at different times >> during >> >>> > rsync but >> >>> > > occurs every time I try to sync all our files >> (1.1TB) to >> >>> > the empty >> >>> > > volume. >> >>> > > >> >>> > > Client and server side I don't find errors in >> the gluster >> >>> > log files. >> >>> > > rsync logs the obvious transfer problem. The >> only log that >> >>> > shows >> >>> > > anything related is the server brick log which >> states >> >>> that the >> >>> > > connection is shutting down: >> >>> > > >> >>> > > [2018-08-18 22:40:35.502510] I [MSGID: 115036] >> >>> > > [server.c:527:server_rpc_notify] 0-home-server: >> >>> disconnecting >> >>> > > connection from >> >>> > > >> brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0 >> >>> > > [2018-08-18 22:40:35.502620] W >> >>> > > [inodelk.c:499:pl_inodelk_log_cleanup] >> 0-home-server: >> >>> > releasing lock >> >>> > > on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by >> >>> > > {client=0x7f83ec0b3ce0, pid=110423 >> >>> lk-owner=d0fd5ffb427f0000} >> >>> > > [2018-08-18 22:40:35.502692] W >> >>> > > [entrylk.c:864:pl_entrylk_log_cleanup] >> 0-home-server: >> >>> > releasing lock >> >>> > > on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by >> >>> > > {client=0x7f83ec0b3ce0, pid=110423 >> >>> lk-owner=703dd4cc407f0000} >> >>> > > [2018-08-18 22:40:35.502719] W >> >>> > > [entrylk.c:864:pl_entrylk_log_cleanup] >> 0-home-server: >> >>> > releasing lock >> >>> > > on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by >> >>> > > {client=0x7f83ec0b3ce0, pid=110423 >> >>> lk-owner=703dd4cc407f0000} >> >>> > > [2018-08-18 22:40:35.505950] I [MSGID: 101055] >> >>> > > [client_t.c:443:gf_client_unref] 0-home-server: >> Shutting >> >>> down >> >>> > > connection >> >>> > brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0 >> >>> > > >> >>> > > Since I'm running another replica 3 setup for >> oVirt for a >> >>> > long time >> >>> > > now which is completely stable I thought I made >> a mistake >> >>> > setting >> >>> > > different options at first. However even when I >> reset >> >>> > those options >> >>> > > I'm able to reproduce the connection problem. >> >>> > > >> >>> > > The unoptimized volume setup looks like this: >> >>> > > >> >>> > > Volume Name: home >> >>> > > Type: Replicate >> >>> > > Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8 >> >>> > > Status: Started >> >>> > > Snapshot Count: 0 >> >>> > > Number of Bricks: 1 x 3 = 3 >> >>> > > Transport-type: tcp >> >>> > > Bricks: >> >>> > > Brick1: sphere-four:/srv/gluster_home/brick >> >>> > > Brick2: sphere-five:/srv/gluster_home/brick >> >>> > > Brick3: sphere-six:/srv/gluster_home/brick >> >>> > > Options Reconfigured: >> >>> > > nfs.disable: on >> >>> > > transport.address-family: inet >> >>> > > cluster.quorum-type: auto >> >>> > > cluster.server-quorum-type: server >> >>> > > cluster.server-quorum-ratio: 50% >> >>> > > >> >>> > > >> >>> > > The following additional options were used before: >> >>> > > >> >>> > > performance.cache-size: 5GB >> >>> > > client.event-threads: 4 >> >>> > > server.event-threads: 4 >> >>> > > cluster.lookup-optimize: on >> >>> > > features.cache-invalidation: on >> >>> > > performance.stat-prefetch: on >> >>> > > performance.cache-invalidation: on >> >>> > > network.inode-lru-limit: 50000 >> >>> > > features.cache-invalidation-timeout: 600 >> >>> > > performance.md-cache-timeout: 600 >> >>> > > performance.parallel-readdir: on >> >>> > > >> >>> > > >> >>> > > In this case the gluster servers and also the >> client is >> >>> > using a >> >>> > > bonded network device running in adaptive load >> balancing >> >>> mode. >> >>> > > >> >>> > > I've tried using the debug option for the client >> mount. >> >>> > But except >> >>> > > for a ~0.5TB log file I didn't get information >> that seems >> >>> > > helpful to me. >> >>> > > >> >>> > > Transferring just a couple of GB works without >> problems. >> >>> > > >> >>> > > It may very well be that I'm already blind to >> the obvious >> >>> > but after >> >>> > > many long running tests I can't find the crux in >> the setup. >> >>> > > >> >>> > > Does anyone have an idea as how to approach this >> problem >> >>> > in a way >> >>> > > that sheds some useful information? >> >>> > > >> >>> > > Any help is highly appreciated! >> >>> > > Cheers >> >>> > > Richard >> >>> > > >> >>> > > -- >> >>> > > /dev/null >> >>> > > >> >>> > > >> >>> > > >> >>> > > >> >>> > > _______________________________________________ >> >>> > > Gluster-users mailing list >> >>> > > Gluster-users@xxxxxxxxxxx >> <mailto:Gluster-users@xxxxxxxxxxx> <mailto:Gluster-users@xxxxxxxxxxx >> <mailto:Gluster-users@xxxxxxxxxxx>> >> >>> > <mailto:Gluster-users@xxxxxxxxxxx >> <mailto:Gluster-users@xxxxxxxxxxx> >> >>> <mailto:Gluster-users@xxxxxxxxxxx >> <mailto:Gluster-users@xxxxxxxxxxx>>> >> >>> > <mailto:Gluster-users@xxxxxxxxxxx >> <mailto:Gluster-users@xxxxxxxxxxx> >> >>> <mailto:Gluster-users@xxxxxxxxxxx >> <mailto:Gluster-users@xxxxxxxxxxx>> >> >>> > <mailto:Gluster-users@xxxxxxxxxxx >> <mailto:Gluster-users@xxxxxxxxxxx> >> >>> <mailto:Gluster-users@xxxxxxxxxxx >> <mailto:Gluster-users@xxxxxxxxxxx>>>> >> >>> > > >> https://lists.gluster.org/mailman/listinfo/gluster-users >> >>> <https://lists.gluster.org/mailman/listinfo/gluster-users> >> >>> > <https://lists.gluster.org/mailman/listinfo/gluster-users >> >>> <https://lists.gluster.org/mailman/listinfo/gluster-users>> >> >>> > > >> >>> <https://lists.gluster.org/mailman/listinfo/gluster-users >> >>> <https://lists.gluster.org/mailman/listinfo/gluster-users> >> >>> > <https://lists.gluster.org/mailman/listinfo/gluster-users >> >>> <https://lists.gluster.org/mailman/listinfo/gluster-users>>> >> >>> > > >> >>> > > >> >>> > >> >>> > >> >>> > -- >> >>> > /dev/null >> >>> > >> >>> > >> >>> >> >>> >> >>> -- >> >>> /dev/null >> >>> >> >>> >> >> >> >> >> >> >> >> _______________________________________________ >> >> Gluster-users mailing list >> >> Gluster-users@xxxxxxxxxxx <mailto:Gluster-users@xxxxxxxxxxx> >> >> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> > >> > >> > >> > >> > _______________________________________________ >> > Gluster-users mailing list >> > Gluster-users@xxxxxxxxxxx <mailto:Gluster-users@xxxxxxxxxxx> >> > https://lists.gluster.org/mailman/listinfo/gluster-users >> > >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users@xxxxxxxxxxx <mailto:Gluster-users@xxxxxxxxxxx> >> https://lists.gluster.org/mailman/listinfo/gluster-users >> > > > _______________________________________________ > Gluster-users mailing list > Gluster-users@xxxxxxxxxxx > https://lists.gluster.org/mailman/listinfo/gluster-users >
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users