Hey Matthew, Can you check with valgrind the memory leak ? It will be something like: Find the geo rep process via ps and note all parameters it was started with . Next stop geo rep. Then start it with valgrind : valgrind --log-file="filename" --tool=memcheck --leak-check=full <georep process binary> <geo rep parameters> It might help narrowing the problem. Best Regards, Strahil Nikolov На 14 август 2020 г. 20:22:16 GMT+03:00, Matthew Benstead <matthewb@xxxxxxx> написа: >Hi, > >We are building a new storage system, and after geo-replication has >been >running for a few hours the server runs out of memory and oom-killer >starts killing bricks. It runs fine without geo-replication on, and the > >server has 64GB of RAM. I have stopped geo-replication for now. > >Any ideas what to tune? > >[root@storage01 ~]# gluster --version | head -1 >glusterfs 7.7 > >[root@storage01 ~]# cat /etc/centos-release; uname -r >CentOS Linux release 7.8.2003 (Core) >3.10.0-1127.10.1.el7.x86_64 > >[root@storage01 ~]# df -h /storage2/ >Filesystem Size Used Avail Use% Mounted on >10.0.231.91:/storage 328T 228T 100T 70% /storage2 > >[root@storage01 ~]# cat /proc/meminfo | grep MemTotal >MemTotal: 65412064 kB > >[root@storage01 ~]# free -g > total used free shared buff/cache >available >Mem: 62 18 0 0 43 43 >Swap: 3 0 3 > > >[root@storage01 ~]# gluster volume info > >Volume Name: storage >Type: Distributed-Replicate >Volume ID: cf94a8f2-324b-40b3-bf72-c3766100ea99 >Status: Started >Snapshot Count: 0 >Number of Bricks: 3 x (2 + 1) = 9 >Transport-type: tcp >Bricks: >Brick1: 10.0.231.91:/data/storage_a/storage >Brick2: 10.0.231.92:/data/storage_b/storage >Brick3: 10.0.231.93:/data/storage_c/storage (arbiter) >Brick4: 10.0.231.92:/data/storage_a/storage >Brick5: 10.0.231.93:/data/storage_b/storage >Brick6: 10.0.231.91:/data/storage_c/storage (arbiter) >Brick7: 10.0.231.93:/data/storage_a/storage >Brick8: 10.0.231.91:/data/storage_b/storage >Brick9: 10.0.231.92:/data/storage_c/storage (arbiter) >Options Reconfigured: >changelog.changelog: on >geo-replication.ignore-pid-check: on >geo-replication.indexing: on >network.ping-timeout: 10 >features.inode-quota: on >features.quota: on >nfs.disable: on >features.quota-deem-statfs: on >storage.fips-mode-rchecksum: on >performance.readdir-ahead: on >performance.parallel-readdir: on >cluster.lookup-optimize: on >client.event-threads: 4 >server.event-threads: 4 >performance.cache-size: 256MB > >You can see the memory spike and reduce as bricks are killed - this >happened twice in the graph below: > > > >You can see two brick processes are down: > >[root@storage01 ~]# gluster volume status >Status of volume: storage >Gluster process TCP Port RDMA Port Online > Pid >------------------------------------------------------------------------------ >Brick 10.0.231.91:/data/storage_a/storage N/A N/A N > N/A >Brick 10.0.231.92:/data/storage_b/storage 49152 0 Y > 1627 >Brick 10.0.231.93:/data/storage_c/storage 49152 0 Y > 259966 >Brick 10.0.231.92:/data/storage_a/storage 49153 0 Y > 1642 >Brick 10.0.231.93:/data/storage_b/storage 49153 0 Y > 259975 >Brick 10.0.231.91:/data/storage_c/storage 49153 0 Y > 20656 >Brick 10.0.231.93:/data/storage_a/storage 49154 0 Y > 259983 >Brick 10.0.231.91:/data/storage_b/storage N/A N/A N > N/A >Brick 10.0.231.92:/data/storage_c/storage 49154 0 Y > 1655 >Self-heal Daemon on localhost N/A N/A Y > 20690 >Quota Daemon on localhost N/A N/A Y > 172136 >Self-heal Daemon on 10.0.231.93 N/A N/A Y > 260010 >Quota Daemon on 10.0.231.93 N/A N/A Y > 128115 >Self-heal Daemon on 10.0.231.92 N/A N/A Y > 1702 >Quota Daemon on 10.0.231.92 N/A N/A Y > 128564 > >Task Status of Volume storage >------------------------------------------------------------------------------ >There are no active volume tasks > >Logs: > >[2020-08-13 20:58:22.186540] I [MSGID: 106143] >[glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick >(null) on port 49154 >[2020-08-13 20:58:22.196110] I [MSGID: 106005] >[glusterd-handler.c:5960:__glusterd_brick_rpc_notify] 0-management: >Brick 10.0.231.91:/data/storage_b/storage has disconnected from >glusterd. >[2020-08-13 20:58:22.196752] I [MSGID: 106143] >[glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick >/data/storage_b/storage on port 49154 > >[2020-08-13 21:05:23.418966] I [MSGID: 106143] >[glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick >(null) on port 49152 >[2020-08-13 21:05:23.420881] I [MSGID: 106005] >[glusterd-handler.c:5960:__glusterd_brick_rpc_notify] 0-management: >Brick 10.0.231.91:/data/storage_a/storage has disconnected from >glusterd. >[2020-08-13 21:05:23.421334] I [MSGID: 106143] >[glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick >/data/storage_a/storage on port 49152 > > > >[Thu Aug 13 13:58:17 2020] Out of memory: Kill process 20664 >(glusterfsd) score 422 or sacrifice child >[Thu Aug 13 13:58:17 2020] Killed process 20664 (glusterfsd), UID 0, >total-vm:32884384kB, anon-rss:29625096kB, file-rss:0kB, shmem-rss:0kB > >[Thu Aug 13 14:05:18 2020] Out of memory: Kill process 20647 >(glusterfsd) score 467 or sacrifice child >[Thu Aug 13 14:05:18 2020] Killed process 20647 (glusterfsd), UID 0, >total-vm:36265116kB, anon-rss:32767744kB, file-rss:520kB, >shmem-rss:0kB0 > > > >glustershd logs: > >[2020-08-13 20:58:22.181368] W [socket.c:775:__socket_rwv] >0-storage-client-7: readv on 10.0.231.91:49154 failed (No data >available) >[2020-08-13 20:58:22.185413] I [MSGID: 114018] >[client.c:2347:client_rpc_notify] 0-storage-client-7: disconnected from >storage-client-7. Client process will keep trying to connect to >glusterd until brick's port is available >[2020-08-13 20:58:25.211872] E [MSGID: 114058] >[client-handshake.c:1455:client_query_portmap_cbk] 0-storage-client-7: >failed to get the port number for remote subvolume. Please run 'gluster >volume status' on server to see if brick process is running. >[2020-08-13 20:58:25.211934] I [MSGID: 114018] >[client.c:2347:client_rpc_notify] 0-storage-client-7: disconnected from >storage-client-7. Client process will keep trying to connect to >glusterd until brick's port is available >[2020-08-13 21:00:28.386633] I [socket.c:865:__socket_shutdown] >0-storage-client-7: intentional socket shutdown(8) >[2020-08-13 21:02:34.565373] I [socket.c:865:__socket_shutdown] >0-storage-client-7: intentional socket shutdown(8) >[2020-08-13 21:02:58.000263] W [MSGID: 114031] >[client-rpc-fops_v2.c:920:client4_0_getxattr_cbk] 0-storage-client-7: >remote operation failed. Path: / >(00000000-0000-0000-0000-000000000001). Key: trusted.glusterfs.pathinfo >[Transport endpoint is not connected] >[2020-08-13 21:02:58.000460] W [MSGID: 114029] >[client-rpc-fops_v2.c:4469:client4_0_getxattr] 0-storage-client-7: >failed to send the fop >[2020-08-13 21:04:40.733823] I [socket.c:865:__socket_shutdown] >0-storage-client-7: intentional socket shutdown(8) >[2020-08-13 21:05:23.418987] W [socket.c:775:__socket_rwv] >0-storage-client-0: readv on 10.0.231.91:49152 failed (No data >available) >[2020-08-13 21:05:23.419365] I [MSGID: 114018] >[client.c:2347:client_rpc_notify] 0-storage-client-0: disconnected from >storage-client-0. Client process will keep trying to connect to >glusterd until brick's port is available >[2020-08-13 21:05:26.423218] E [MSGID: 114058] >[client-handshake.c:1455:client_query_portmap_cbk] 0-storage-client-0: >failed to get the port number for remote subvolume. Please run 'gluster >volume status' on server to see if brick process is running. >[2020-08-13 21:06:46.919942] I [socket.c:865:__socket_shutdown] >0-storage-client-7: intentional socket shutdown(8) >[2020-08-13 21:05:26.423274] I [MSGID: 114018] >[client.c:2347:client_rpc_notify] 0-storage-client-0: disconnected from >storage-client-0. Client process will keep trying to connect to >glusterd until brick's port is available >[2020-08-13 21:07:29.667896] I [socket.c:865:__socket_shutdown] >0-storage-client-0: intentional socket shutdown(8) >[2020-08-13 21:08:05.660858] I [MSGID: 100041] >[glusterfsd-mgmt.c:1111:glusterfs_handle_svc_attach] 0-glusterfs: >received attach request for volfile-id=shd/storage >[2020-08-13 21:08:05.660948] I [MSGID: 100040] >[glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in >volfile, continuing >[2020-08-13 21:08:05.661326] I [rpc-clnt.c:1963:rpc_clnt_reconfig] >0-storage-client-7: changing port to 49154 (from 0) >[2020-08-13 21:08:05.664638] I [MSGID: 114057] >[client-handshake.c:1375:select_server_supported_programs] >0-storage-client-7: Using Program GlusterFS 4.x v1, Num (1298437), >Version (400) >[2020-08-13 21:08:05.665266] I [MSGID: 114046] >[client-handshake.c:1105:client_setvolume_cbk] 0-storage-client-7: >Connected to storage-client-7, attached to remote volume >'/data/storage_b/storage'. >[2020-08-13 21:08:05.713533] I [rpc-clnt.c:1963:rpc_clnt_reconfig] >0-storage-client-0: changing port to 49152 (from 0) >[2020-08-13 21:08:05.716535] I [MSGID: 114057] >[client-handshake.c:1375:select_server_supported_programs] >0-storage-client-0: Using Program GlusterFS 4.x v1, Num (1298437), >Version (400) >[2020-08-13 21:08:05.717224] I [MSGID: 114046] >[client-handshake.c:1105:client_setvolume_cbk] 0-storage-client-0: >Connected to storage-client-0, attached to remote volume >'/data/storage_a/storage'. > > >Thanks, > -Matthew ________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users