Re: Geo-replication causes OOM

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks Strahil,

Would the geo rep process be the gsyncd.py proceses?

It seems like it's the glusterfsd and auxiliary mounts that are holding all the memory right now...

Could this be related to the open-behind bug mentioned here: https://github.com/gluster/glusterfs/issues/1444  and here: https://github.com/gluster/glusterfs/issues/1440 ?

Thanks,
 -Matthew

Matthew Benstead
System Administrator
Pacific Climate Impacts Consortium
University of Victoria, UH1
PO Box 1800, STN CSC
Victoria, BC, V8W 2Y2
Phone: 1-250-721-8432
Email: matthewb@xxxxxxx
On 2020-08-14 10:35 p.m., Strahil Nikolov wrote:
Hey Matthew,

Can you check with valgrind the memory leak ?

It will be something like:
Find the geo rep process via ps and note  all parameters it was started with .
Next stop geo rep.

Then start it with valgrind :
valgrind --log-file="filename"  --tool=memcheck --leak-check=full  <georep process binary> <geo rep parameters>

It might help narrowing the problem.

Best Regards,
Strahil Nikolov

На 14 август 2020 г. 20:22:16 GMT+03:00, Matthew Benstead <matthewb@xxxxxxx> написа:
Hi,

We are building a new storage system, and after geo-replication has
been 
running for a few hours the server runs out of memory and oom-killer 
starts killing bricks. It runs fine without geo-replication on, and the

server has 64GB of RAM. I have stopped geo-replication for now.

Any ideas what to tune?

[root@storage01 ~]# gluster --version | head -1
glusterfs 7.7

[root@storage01 ~]# cat /etc/centos-release; uname -r
CentOS Linux release 7.8.2003 (Core)
3.10.0-1127.10.1.el7.x86_64

[root@storage01 ~]# df -h /storage2/
Filesystem            Size  Used Avail Use% Mounted on
10.0.231.91:/storage  328T  228T  100T  70% /storage2

[root@storage01 ~]# cat /proc/meminfo  | grep MemTotal
MemTotal:       65412064 kB

[root@storage01 ~]# free -g
              total        used        free      shared buff/cache   
available
Mem:             62          18           0           0 43          43
Swap:             3           0           3


[root@storage01 ~]# gluster volume info

Volume Name: storage
Type: Distributed-Replicate
Volume ID: cf94a8f2-324b-40b3-bf72-c3766100ea99
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x (2 + 1) = 9
Transport-type: tcp
Bricks:
Brick1: 10.0.231.91:/data/storage_a/storage
Brick2: 10.0.231.92:/data/storage_b/storage
Brick3: 10.0.231.93:/data/storage_c/storage (arbiter)
Brick4: 10.0.231.92:/data/storage_a/storage
Brick5: 10.0.231.93:/data/storage_b/storage
Brick6: 10.0.231.91:/data/storage_c/storage (arbiter)
Brick7: 10.0.231.93:/data/storage_a/storage
Brick8: 10.0.231.91:/data/storage_b/storage
Brick9: 10.0.231.92:/data/storage_c/storage (arbiter)
Options Reconfigured:
changelog.changelog: on
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
network.ping-timeout: 10
features.inode-quota: on
features.quota: on
nfs.disable: on
features.quota-deem-statfs: on
storage.fips-mode-rchecksum: on
performance.readdir-ahead: on
performance.parallel-readdir: on
cluster.lookup-optimize: on
client.event-threads: 4
server.event-threads: 4
performance.cache-size: 256MB

You can see the memory spike and reduce as bricks are killed - this 
happened twice in the graph below:



You can see two brick processes are down:

[root@storage01 ~]# gluster volume status
Status of volume: storage
Gluster process                             TCP Port  RDMA Port  Online
Pid
------------------------------------------------------------------------------
Brick 10.0.231.91:/data/storage_a/storage   N/A       N/A        N     
N/A
Brick 10.0.231.92:/data/storage_b/storage   49152     0          Y     
1627
Brick 10.0.231.93:/data/storage_c/storage   49152     0          Y     
259966
Brick 10.0.231.92:/data/storage_a/storage   49153     0          Y     
1642
Brick 10.0.231.93:/data/storage_b/storage   49153     0          Y     
259975
Brick 10.0.231.91:/data/storage_c/storage   49153     0          Y     
20656
Brick 10.0.231.93:/data/storage_a/storage   49154     0          Y     
259983
Brick 10.0.231.91:/data/storage_b/storage   N/A       N/A        N     
N/A
Brick 10.0.231.92:/data/storage_c/storage   49154     0          Y     
1655
Self-heal Daemon on localhost               N/A       N/A        Y     
20690
Quota Daemon on localhost                   N/A       N/A        Y     
172136
Self-heal Daemon on 10.0.231.93             N/A       N/A        Y     
260010
Quota Daemon on 10.0.231.93                 N/A       N/A        Y     
128115
Self-heal Daemon on 10.0.231.92             N/A       N/A        Y     
1702
Quota Daemon on 10.0.231.92                 N/A       N/A        Y     
128564

Task Status of Volume storage
------------------------------------------------------------------------------
There are no active volume tasks

Logs:

[2020-08-13 20:58:22.186540] I [MSGID: 106143]
[glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick
(null) on port 49154
[2020-08-13 20:58:22.196110] I [MSGID: 106005]
[glusterd-handler.c:5960:__glusterd_brick_rpc_notify] 0-management:
Brick 10.0.231.91:/data/storage_b/storage has disconnected from
glusterd.
[2020-08-13 20:58:22.196752] I [MSGID: 106143]
[glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick
/data/storage_b/storage on port 49154

[2020-08-13 21:05:23.418966] I [MSGID: 106143]
[glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick
(null) on port 49152
[2020-08-13 21:05:23.420881] I [MSGID: 106005]
[glusterd-handler.c:5960:__glusterd_brick_rpc_notify] 0-management:
Brick 10.0.231.91:/data/storage_a/storage has disconnected from
glusterd.
[2020-08-13 21:05:23.421334] I [MSGID: 106143]
[glusterd-pmap.c:389:pmap_registry_remove] 0-pmap: removing brick
/data/storage_a/storage on port 49152



[Thu Aug 13 13:58:17 2020] Out of memory: Kill process 20664
(glusterfsd) score 422 or sacrifice child
[Thu Aug 13 13:58:17 2020] Killed process 20664 (glusterfsd), UID 0,
total-vm:32884384kB, anon-rss:29625096kB, file-rss:0kB, shmem-rss:0kB

[Thu Aug 13 14:05:18 2020] Out of memory: Kill process 20647
(glusterfsd) score 467 or sacrifice child
[Thu Aug 13 14:05:18 2020] Killed process 20647 (glusterfsd), UID 0,
total-vm:36265116kB, anon-rss:32767744kB, file-rss:520kB,
shmem-rss:0kB0



glustershd logs:

[2020-08-13 20:58:22.181368] W [socket.c:775:__socket_rwv]
0-storage-client-7: readv on 10.0.231.91:49154 failed (No data
available)
[2020-08-13 20:58:22.185413] I [MSGID: 114018]
[client.c:2347:client_rpc_notify] 0-storage-client-7: disconnected from
storage-client-7. Client process will keep trying to connect to
glusterd until brick's port is available
[2020-08-13 20:58:25.211872] E [MSGID: 114058]
[client-handshake.c:1455:client_query_portmap_cbk] 0-storage-client-7:
failed to get the port number for remote subvolume. Please run 'gluster
volume status' on server to see if brick process is running.
[2020-08-13 20:58:25.211934] I [MSGID: 114018]
[client.c:2347:client_rpc_notify] 0-storage-client-7: disconnected from
storage-client-7. Client process will keep trying to connect to
glusterd until brick's port is available
[2020-08-13 21:00:28.386633] I [socket.c:865:__socket_shutdown]
0-storage-client-7: intentional socket shutdown(8)
[2020-08-13 21:02:34.565373] I [socket.c:865:__socket_shutdown]
0-storage-client-7: intentional socket shutdown(8)
[2020-08-13 21:02:58.000263] W [MSGID: 114031]
[client-rpc-fops_v2.c:920:client4_0_getxattr_cbk] 0-storage-client-7:
remote operation failed. Path: /
(00000000-0000-0000-0000-000000000001). Key: trusted.glusterfs.pathinfo
[Transport endpoint is not connected]
[2020-08-13 21:02:58.000460] W [MSGID: 114029]
[client-rpc-fops_v2.c:4469:client4_0_getxattr] 0-storage-client-7:
failed to send the fop
[2020-08-13 21:04:40.733823] I [socket.c:865:__socket_shutdown]
0-storage-client-7: intentional socket shutdown(8)
[2020-08-13 21:05:23.418987] W [socket.c:775:__socket_rwv]
0-storage-client-0: readv on 10.0.231.91:49152 failed (No data
available)
[2020-08-13 21:05:23.419365] I [MSGID: 114018]
[client.c:2347:client_rpc_notify] 0-storage-client-0: disconnected from
storage-client-0. Client process will keep trying to connect to
glusterd until brick's port is available
[2020-08-13 21:05:26.423218] E [MSGID: 114058]
[client-handshake.c:1455:client_query_portmap_cbk] 0-storage-client-0:
failed to get the port number for remote subvolume. Please run 'gluster
volume status' on server to see if brick process is running.
[2020-08-13 21:06:46.919942] I [socket.c:865:__socket_shutdown]
0-storage-client-7: intentional socket shutdown(8)
[2020-08-13 21:05:26.423274] I [MSGID: 114018]
[client.c:2347:client_rpc_notify] 0-storage-client-0: disconnected from
storage-client-0. Client process will keep trying to connect to
glusterd until brick's port is available
[2020-08-13 21:07:29.667896] I [socket.c:865:__socket_shutdown]
0-storage-client-0: intentional socket shutdown(8)
[2020-08-13 21:08:05.660858] I [MSGID: 100041]
[glusterfsd-mgmt.c:1111:glusterfs_handle_svc_attach] 0-glusterfs:
received attach request for volfile-id=shd/storage
[2020-08-13 21:08:05.660948] I [MSGID: 100040]
[glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in
volfile, continuing
[2020-08-13 21:08:05.661326] I [rpc-clnt.c:1963:rpc_clnt_reconfig]
0-storage-client-7: changing port to 49154 (from 0)
[2020-08-13 21:08:05.664638] I [MSGID: 114057]
[client-handshake.c:1375:select_server_supported_programs]
0-storage-client-7: Using Program GlusterFS 4.x v1, Num (1298437),
Version (400)
[2020-08-13 21:08:05.665266] I [MSGID: 114046]
[client-handshake.c:1105:client_setvolume_cbk] 0-storage-client-7:
Connected to storage-client-7, attached to remote volume
'/data/storage_b/storage'.
[2020-08-13 21:08:05.713533] I [rpc-clnt.c:1963:rpc_clnt_reconfig]
0-storage-client-0: changing port to 49152 (from 0)
[2020-08-13 21:08:05.716535] I [MSGID: 114057]
[client-handshake.c:1375:select_server_supported_programs]
0-storage-client-0: Using Program GlusterFS 4.x v1, Num (1298437),
Version (400)
[2020-08-13 21:08:05.717224] I [MSGID: 114046]
[client-handshake.c:1105:client_setvolume_cbk] 0-storage-client-0:
Connected to storage-client-0, attached to remote volume
'/data/storage_a/storage'.


Thanks,
 -Matthew

________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux