Hi, I've created excerpts from the brick and client logs +/- 1 minute to the kill event. Still the logs are ~400-500MB so will put them somewhere to download since I have no idea what I should be looking for and skimming them didn't reveal obvious problems to me. http://www.tbi.univie.ac.at/~hawk/gluster/brick_3min_excerpt.log http://www.tbi.univie.ac.at/~hawk/gluster/mnt_3min_excerpt.log I was pointed in the direction of the following Bugreport https://bugzilla.redhat.com/show_bug.cgi?id=1613512 It sounds right but seems to have been addressed already. If there is anything I can do to help solve this problem please let me know. Thanks for your help! Cheers Richard On 9/11/18 10:10 AM, Richard Neuboeck wrote: > Hi, > > since I feared that the logs would fill up the partition (again) I > checked the systems daily and finally found the reason. The glusterfs > process on the client runs out of memory and get's killed by OOM after > about four days. Since rsync runs for a couple of days longer till it > ends I never checked the whole time frame in the system logs and never > stumbled upon the OOM message. > > Running out of memory on a 128GB RAM system even with a DB occupying > ~40% of that is kind of strange though. Might there be a leak? > > But this would explain the erratic behavior I've experienced over the > last 1.5 years while trying to work with our homes on glusterfs. > > Here is the kernel log message for the killed glusterfs process. > https://gist.github.com/bleuchien/3d2b87985ecb944c60347d5e8660e36a > > I'm checking the brick and client trace logs. But those are respectively > 1TB and 2TB in size so searching in them takes a while. I'll be creating > gists for both logs about the time when the process died. > > As soon as I have more details I'll post them. > > Here you can see a graphical representation of the memory usage of this > system: https://imgur.com/a/4BINtfr > > Cheers > Richard > > > > On 31.08.18 08:13, Raghavendra Gowdappa wrote: >> >> >> On Fri, Aug 31, 2018 at 11:11 AM, Richard Neuboeck >> <hawk@xxxxxxxxxxxxxxxx <mailto:hawk@xxxxxxxxxxxxxxxx>> wrote: >> >> On 08/31/2018 03:50 AM, Raghavendra Gowdappa wrote: >> > +Mohit. +Milind >> > >> > @Mohit/Milind, >> > >> > Can you check logs and see whether you can find anything relevant? >> >> From glances at the system logs nothing out of the ordinary >> occurred. However I'll start another rsync and take a closer look. >> It will take a few days. >> >> > >> > On Thu, Aug 30, 2018 at 7:04 PM, Richard Neuboeck >> > <hawk@xxxxxxxxxxxxxxxx <mailto:hawk@xxxxxxxxxxxxxxxx> >> <mailto:hawk@xxxxxxxxxxxxxxxx <mailto:hawk@xxxxxxxxxxxxxxxx>>> wrote: >> > >> > Hi, >> > >> > I'm attaching a shortened version since the whole is about 5.8GB of >> > the client mount log. It includes the initial mount messages and the >> > last two minutes of log entries. >> > >> > It ends very anticlimactic without an obvious error. Is there >> > anything specific I should be looking for? >> > >> > >> > Normally I look logs around disconnect msgs to find out the reason. >> > But as you said, sometimes one can see just disconnect msgs without >> > any reason. That normally points to reason for disconnect in the >> > network rather than a Glusterfs initiated disconnect. >> >> The rsync source is serving our homes currently so there are NFS >> connections 24/7. There don't seem to be any network related >> interruptions >> >> >> Can you set diagnostics.client-log-level and diagnostics.brick-log-level >> to TRACE and check logs of both ends of connections - client and brick? >> To reduce the logsize, I would suggest to logrotate existing logs and >> start with fresh logs when you are about to start so that only relevant >> logs are captured. Also, can you take strace of client and brick process >> using: >> >> strace -o <outputfile> -ff -v -p <pid> >> >> attach both logs and strace. Let's trace through what syscalls on socket >> return and then decide whether to inspect tcpdump or not. If you don't >> want to repeat tests again, please capture tcpdump too (on both ends of >> connection) and send them to us. >> >> >> - a co-worker would be here faster than I could check >> the logs if the connection to home would be broken ;-) >> The three gluster machines are due to this problem reduced to only >> testing so there is nothing else running. >> >> >> > >> > Cheers >> > Richard >> > >> > On 08/30/2018 02:40 PM, Raghavendra Gowdappa wrote: >> > > Normally client logs will give a clue on why the disconnections are >> > > happening (ping-timeout, wrong port etc). Can you look into client >> > > logs to figure out what's happening? If you can't find anything, can >> > > you send across client logs? >> > > >> > > On Wed, Aug 29, 2018 at 6:11 PM, Richard Neuboeck >> > > <hawk@xxxxxxxxxxxxxxxx <mailto:hawk@xxxxxxxxxxxxxxxx> >> <mailto:hawk@xxxxxxxxxxxxxxxx <mailto:hawk@xxxxxxxxxxxxxxxx>> >> > <mailto:hawk@xxxxxxxxxxxxxxxx <mailto:hawk@xxxxxxxxxxxxxxxx> >> <mailto:hawk@xxxxxxxxxxxxxxxx <mailto:hawk@xxxxxxxxxxxxxxxx>>>> >> > wrote: >> > > >> > > Hi Gluster Community, >> > > >> > > I have problems with a glusterfs 'Transport endpoint not >> > connected' >> > > connection abort during file transfers that I can >> > replicate (all the >> > > time now) but not pinpoint as to why this is happening. >> > > >> > > The volume is set up in replica 3 mode and accessed with >> > the fuse >> > > gluster client. Both client and server are running CentOS >> > and the >> > > supplied 3.12.11 version of gluster. >> > > >> > > The connection abort happens at different times during >> > rsync but >> > > occurs every time I try to sync all our files (1.1TB) to >> > the empty >> > > volume. >> > > >> > > Client and server side I don't find errors in the gluster >> > log files. >> > > rsync logs the obvious transfer problem. The only log that >> > shows >> > > anything related is the server brick log which states >> that the >> > > connection is shutting down: >> > > >> > > [2018-08-18 22:40:35.502510] I [MSGID: 115036] >> > > [server.c:527:server_rpc_notify] 0-home-server: >> disconnecting >> > > connection from >> > > brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0 >> > > [2018-08-18 22:40:35.502620] W >> > > [inodelk.c:499:pl_inodelk_log_cleanup] 0-home-server: >> > releasing lock >> > > on eaeb0398-fefd-486d-84a7-f13744d1cf10 held by >> > > {client=0x7f83ec0b3ce0, pid=110423 >> lk-owner=d0fd5ffb427f0000} >> > > [2018-08-18 22:40:35.502692] W >> > > [entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: >> > releasing lock >> > > on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by >> > > {client=0x7f83ec0b3ce0, pid=110423 >> lk-owner=703dd4cc407f0000} >> > > [2018-08-18 22:40:35.502719] W >> > > [entrylk.c:864:pl_entrylk_log_cleanup] 0-home-server: >> > releasing lock >> > > on faa93f7b-6c46-4251-b2b2-abcd2f2613e1 held by >> > > {client=0x7f83ec0b3ce0, pid=110423 >> lk-owner=703dd4cc407f0000} >> > > [2018-08-18 22:40:35.505950] I [MSGID: 101055] >> > > [client_t.c:443:gf_client_unref] 0-home-server: Shutting >> down >> > > connection >> > brax-110405-2018/08/16-08:36:28:575972-home-client-0-0-0 >> > > >> > > Since I'm running another replica 3 setup for oVirt for a >> > long time >> > > now which is completely stable I thought I made a mistake >> > setting >> > > different options at first. However even when I reset >> > those options >> > > I'm able to reproduce the connection problem. >> > > >> > > The unoptimized volume setup looks like this: >> > > >> > > Volume Name: home >> > > Type: Replicate >> > > Volume ID: c92fa4cc-4a26-41ff-8c70-1dd07f733ac8 >> > > Status: Started >> > > Snapshot Count: 0 >> > > Number of Bricks: 1 x 3 = 3 >> > > Transport-type: tcp >> > > Bricks: >> > > Brick1: sphere-four:/srv/gluster_home/brick >> > > Brick2: sphere-five:/srv/gluster_home/brick >> > > Brick3: sphere-six:/srv/gluster_home/brick >> > > Options Reconfigured: >> > > nfs.disable: on >> > > transport.address-family: inet >> > > cluster.quorum-type: auto >> > > cluster.server-quorum-type: server >> > > cluster.server-quorum-ratio: 50% >> > > >> > > >> > > The following additional options were used before: >> > > >> > > performance.cache-size: 5GB >> > > client.event-threads: 4 >> > > server.event-threads: 4 >> > > cluster.lookup-optimize: on >> > > features.cache-invalidation: on >> > > performance.stat-prefetch: on >> > > performance.cache-invalidation: on >> > > network.inode-lru-limit: 50000 >> > > features.cache-invalidation-timeout: 600 >> > > performance.md-cache-timeout: 600 >> > > performance.parallel-readdir: on >> > > >> > > >> > > In this case the gluster servers and also the client is >> > using a >> > > bonded network device running in adaptive load balancing >> mode. >> > > >> > > I've tried using the debug option for the client mount. >> > But except >> > > for a ~0.5TB log file I didn't get information that seems >> > > helpful to me. >> > > >> > > Transferring just a couple of GB works without problems. >> > > >> > > It may very well be that I'm already blind to the obvious >> > but after >> > > many long running tests I can't find the crux in the setup. >> > > >> > > Does anyone have an idea as how to approach this problem >> > in a way >> > > that sheds some useful information? >> > > >> > > Any help is highly appreciated! >> > > Cheers >> > > Richard >> > > >> > > -- >> > > /dev/null >> > > >> > > >> > > >> > > >> > > _______________________________________________ >> > > Gluster-users mailing list >> > > Gluster-users@xxxxxxxxxxx <mailto:Gluster-users@xxxxxxxxxxx> >> > <mailto:Gluster-users@xxxxxxxxxxx >> <mailto:Gluster-users@xxxxxxxxxxx>> >> > <mailto:Gluster-users@xxxxxxxxxxx >> <mailto:Gluster-users@xxxxxxxxxxx> >> > <mailto:Gluster-users@xxxxxxxxxxx >> <mailto:Gluster-users@xxxxxxxxxxx>>> >> > > https://lists.gluster.org/mailman/listinfo/gluster-users >> <https://lists.gluster.org/mailman/listinfo/gluster-users> >> > <https://lists.gluster.org/mailman/listinfo/gluster-users >> <https://lists.gluster.org/mailman/listinfo/gluster-users>> >> > > >> <https://lists.gluster.org/mailman/listinfo/gluster-users >> <https://lists.gluster.org/mailman/listinfo/gluster-users> >> > <https://lists.gluster.org/mailman/listinfo/gluster-users >> <https://lists.gluster.org/mailman/listinfo/gluster-users>>> >> > > >> > > >> > >> > >> > -- >> > /dev/null >> > >> > >> >> >> -- >> /dev/null >> >> > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users@xxxxxxxxxxx > https://lists.gluster.org/mailman/listinfo/gluster-users > -- /dev/null
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users