Would the geo rep process be the gsyncd.py proceses?
It seems like it's the glusterfsd and auxiliary mounts that are holding all the memory right now...
Could this be related to the open-behind bug mentioned here: https://github.com/gluster/glusterfs/issues/1444 and here: https://github.com/gluster/glusterfs/issues/1440 ?
Thanks,
-Matthew
Matthew Benstead
System Administrator
Pacific Climate Impacts Consortium
University of Victoria, UH1
PO Box 1800, STN CSC
Victoria, BC, V8W 2Y2
Phone: 1-250-721-8432
Email: matthewb@xxxxxxx
On 2020-08-14 10:35 p.m., Strahil
Nikolov wrote:
Hey Matthew, Can you check with valgrind the memory leak ? It will be something like: Find the geo rep process via ps and note all parameters it was started with . Next stop geo rep. Then start it with valgrind : valgrind --log-file="filename" --tool=memcheck --leak-check=full <georep process binary> <geo rep parameters> It might help narrowing the problem. Best Regards, Strahil Nikolov На 14 август 2020 г. 20:22:16 GMT+03:00, Matthew Benstead <matthewb@xxxxxxx> написа:Hi, We are building a new storage system, and after geo-replication has been running for a few hours the server runs out of memory and oom-killer starts killing bricks. It runs fine without geo-replication on, and the server has 64GB of RAM. I have stopped geo-replication for now. Any ideas what to tune? [root@storage01 ~]# gluster --version | head -1 glusterfs 7.7 [root@storage01 ~]# cat /etc/centos-release; uname -r CentOS Linux release 7.8.2003 (Core) 3.10.0-1127.10.1.el7.x86_64 [root@storage01 ~]# df -h /storage2/ Filesystem Size Used Avail Use% Mounted on 10.0.231.91:/storage 328T 228T 100T 70% /storage2 [root@storage01 ~]# cat /proc/meminfo | grep MemTotal MemTotal: 65412064 kB [root@storage01 ~]# free -g total used free shared buff/cache available Mem: 62 18 0 0 43 43 Swap: 3 0 3 [root@storage01 ~]# gluster volume info Volume Name: storage Type: Distributed-Replicate Volume ID: cf94a8f2-324b-40b3-bf72-c3766100ea99 Status: Started Snapshot Count: 0 Number of Bricks: 3 x (2 + 1) = 9 Transport-type: tcp Bricks: Brick1: 10.0.231.91:/data/storage_a/storage Brick2: 10.0.231.92:/data/storage_b/storage Brick3: 10.0.231.93:/data/storage_c/storage (arbiter) Brick4: 10.0.231.92:/data/storage_a/storage Brick5: 10.0.231.93:/data/storage_b/storage Brick6: 10.0.231.91:/data/storage_c/storage (arbiter) Brick7: 10.0.231.93:/data/storage_a/storage Brick8: 10.0.231.91:/data/storage_b/storage Brick9: 10.0.231.92:/data/storage_c/storage (arbiter) Options Reconfigured: changelog.changelog: on geo-replication.ignore-pid-check: on geo-replication.indexing: on network.ping-timeout: 10 features.inode-quota: on features.quota: on nfs.disable: on features.quota-deem-statfs: on storage.fips-mode-rchecksum: on performance.readdir-ahead: on performance.parallel-readdir: on cluster.lookup-optimize: on client.event-threads: 4 server.event-threads: 4 performance.cache-size: 256MB You can see the memory spike and reduce as bricks are killed - this happened twice in the graph below: You can see two brick processes are down: [root@storage01 ~]# gluster volume status Status of volume: storage Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.0.231.91:/data/storage_a/storage N/A N/A N N/A Brick 10.0.231.92:/data/storage_b/storage 49152 0 Y 1627 Brick 10.0.231.93:/data/storage_c/storage 49152 0 Y 259966 Brick 10.0.231.92:/data/storage_a/storage 49153 0 Y 1642 Brick 10.0.231.93:/data/storage_b/storage 49153 0 Y 259975 Brick 10.0.231.91:/data/storage_c/storage 49153 0 Y 20656 Brick 10.0.231.93:/data/storage_a/storage 49154 0 Y 259983 Brick 10.0.231.91:/data/storage_b/storage N/A N/A N N/A Brick 10.0.231.92:/data/storage_c/storage 49154 0 Y 1655 Self-heal Daemon on localhost N/A N/A Y 20690 Quota Daemon on localhost N/A N/A Y 172136 Self-heal Daemon on 10.0.231.93 N/A N/A Y 260010 Quota Daemon on 10.0.231.93 N/A N/A Y 128115 Self-heal Daemon on 10.0.231.92 N/A N/A Y 1702 Quota Daemon on 10.0.231.92 N/A N/A Y 128564 Task Status of Volume storage ------------------------------------------------------------------------------ There are no active volume tasks Logs: [2020-08-13 20:58:22.186540] I [MSGID: 106143] [glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick (null) on port 49154 [2020-08-13 20:58:22.196110] I [MSGID: 106005] [glusterd-handler.c:5960:__glusterd_brick_rpc_notify] 0-management: Brick 10.0.231.91:/data/storage_b/storage has disconnected from glusterd. [2020-08-13 20:58:22.196752] I [MSGID: 106143] [glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick /data/storage_b/storage on port 49154 [2020-08-13 21:05:23.418966] I [MSGID: 106143] [glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick (null) on port 49152 [2020-08-13 21:05:23.420881] I [MSGID: 106005] [glusterd-handler.c:5960:__glusterd_brick_rpc_notify] 0-management: Brick 10.0.231.91:/data/storage_a/storage has disconnected from glusterd. [2020-08-13 21:05:23.421334] I [MSGID: 106143] [glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick /data/storage_a/storage on port 49152 [Thu Aug 13 13:58:17 2020] Out of memory: Kill process 20664 (glusterfsd) score 422 or sacrifice child [Thu Aug 13 13:58:17 2020] Killed process 20664 (glusterfsd), UID 0, total-vm:32884384kB, anon-rss:29625096kB, file-rss:0kB, shmem-rss:0kB [Thu Aug 13 14:05:18 2020] Out of memory: Kill process 20647 (glusterfsd) score 467 or sacrifice child [Thu Aug 13 14:05:18 2020] Killed process 20647 (glusterfsd), UID 0, total-vm:36265116kB, anon-rss:32767744kB, file-rss:520kB, shmem-rss:0kB0 glustershd logs: [2020-08-13 20:58:22.181368] W [socket.c:775:__socket_rwv] 0-storage-client-7: readv on 10.0.231.91:49154 failed (No data available) [2020-08-13 20:58:22.185413] I [MSGID: 114018] [client.c:2347:client_rpc_notify] 0-storage-client-7: disconnected from storage-client-7. Client process will keep trying to connect to glusterd until brick's port is available [2020-08-13 20:58:25.211872] E [MSGID: 114058] [client-handshake.c:1455:client_query_portmap_cbk] 0-storage-client-7: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2020-08-13 20:58:25.211934] I [MSGID: 114018] [client.c:2347:client_rpc_notify] 0-storage-client-7: disconnected from storage-client-7. Client process will keep trying to connect to glusterd until brick's port is available [2020-08-13 21:00:28.386633] I [socket.c:865:__socket_shutdown] 0-storage-client-7: intentional socket shutdown(8) [2020-08-13 21:02:34.565373] I [socket.c:865:__socket_shutdown] 0-storage-client-7: intentional socket shutdown(8) [2020-08-13 21:02:58.000263] W [MSGID: 114031] [client-rpc-fops_v2.c:920:client4_0_getxattr_cbk] 0-storage-client-7: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001). Key: trusted.glusterfs.pathinfo [Transport endpoint is not connected] [2020-08-13 21:02:58.000460] W [MSGID: 114029] [client-rpc-fops_v2.c:4469:client4_0_getxattr] 0-storage-client-7: failed to send the fop [2020-08-13 21:04:40.733823] I [socket.c:865:__socket_shutdown] 0-storage-client-7: intentional socket shutdown(8) [2020-08-13 21:05:23.418987] W [socket.c:775:__socket_rwv] 0-storage-client-0: readv on 10.0.231.91:49152 failed (No data available) [2020-08-13 21:05:23.419365] I [MSGID: 114018] [client.c:2347:client_rpc_notify] 0-storage-client-0: disconnected from storage-client-0. Client process will keep trying to connect to glusterd until brick's port is available [2020-08-13 21:05:26.423218] E [MSGID: 114058] [client-handshake.c:1455:client_query_portmap_cbk] 0-storage-client-0: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running. [2020-08-13 21:06:46.919942] I [socket.c:865:__socket_shutdown] 0-storage-client-7: intentional socket shutdown(8) [2020-08-13 21:05:26.423274] I [MSGID: 114018] [client.c:2347:client_rpc_notify] 0-storage-client-0: disconnected from storage-client-0. Client process will keep trying to connect to glusterd until brick's port is available [2020-08-13 21:07:29.667896] I [socket.c:865:__socket_shutdown] 0-storage-client-0: intentional socket shutdown(8) [2020-08-13 21:08:05.660858] I [MSGID: 100041] [glusterfsd-mgmt.c:1111:glusterfs_handle_svc_attach] 0-glusterfs: received attach request for volfile-id=shd/storage [2020-08-13 21:08:05.660948] I [MSGID: 100040] [glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing [2020-08-13 21:08:05.661326] I [rpc-clnt.c:1963:rpc_clnt_reconfig] 0-storage-client-7: changing port to 49154 (from 0) [2020-08-13 21:08:05.664638] I [MSGID: 114057] [client-handshake.c:1375:select_server_supported_programs] 0-storage-client-7: Using Program GlusterFS 4.x v1, Num (1298437), Version (400) [2020-08-13 21:08:05.665266] I [MSGID: 114046] [client-handshake.c:1105:client_setvolume_cbk] 0-storage-client-7: Connected to storage-client-7, attached to remote volume '/data/storage_b/storage'. [2020-08-13 21:08:05.713533] I [rpc-clnt.c:1963:rpc_clnt_reconfig] 0-storage-client-0: changing port to 49152 (from 0) [2020-08-13 21:08:05.716535] I [MSGID: 114057] [client-handshake.c:1375:select_server_supported_programs] 0-storage-client-0: Using Program GlusterFS 4.x v1, Num (1298437), Version (400) [2020-08-13 21:08:05.717224] I [MSGID: 114046] [client-handshake.c:1105:client_setvolume_cbk] 0-storage-client-0: Connected to storage-client-0, attached to remote volume '/data/storage_a/storage'. Thanks, -Matthew
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users