On 12/28/2015 02:32 PM, Soumya Koduri wrote:
----- Original Message -----
From: "Pranith Kumar Karampuri" <pkarampu@xxxxxxxxxx>
To: "Oleksandr Natalenko" <oleksandr@xxxxxxxxxxxxxx>, "Soumya Koduri" <skoduri@xxxxxxxxxx>
Cc: gluster-users@xxxxxxxxxxx, gluster-devel@xxxxxxxxxxx
Sent: Monday, December 28, 2015 9:32:07 AM
Subject: Re: [Gluster-devel] Memory leak in GlusterFS FUSE client
On 12/26/2015 04:45 AM, Oleksandr Natalenko wrote:
Also, here is valgrind output with our custom tool, that does GlusterFS
volume
traversing (with simple stats) just like find tool. In this case
NFS-Ganesha
is not used.
https://gist.github.com/e4602a50d3c98f7a2766
hi Oleksandr,
I went through the code. Both NFS Ganesha and the custom tool use
gfapi and the leak is stemming from that. I am not very familiar with
this part of code but there seems to be one inode_unref() that is
missing in failure path of resolution. Not sure if that is corresponding
to the leaks.
Soumya,
Could this be the issue? review.gluster.org seems to be down. So
couldn't send the patch. Please ping me on IRC.
diff --git a/api/src/glfs-resolve.c b/api/src/glfs-resolve.c
index b5efcba..52b538b 100644
--- a/api/src/glfs-resolve.c
+++ b/api/src/glfs-resolve.c
@@ -467,9 +467,11 @@ priv_glfs_resolve_at (struct glfs *fs, xlator_t
*subvol, inode_t *at,
}
}
- if (parent && next_component)
+ if (parent && next_component) {
+ inode_unref (parent);
+ parent = NULL;
/* resolution failed mid-way */
goto out;
+ }
/* At this point, all components up to the last parent directory
have been resolved successfully (@parent). Resolution of
basename
yes. This could be one of the reasons. There are few leaks with respect to inode references in gfAPI. See below.
On GlusterFS side, looks like majority of the leaks are related to inodes and their contexts. Possible reasons which I can think of are:
1) When there is a graph switch, old inode table and their entries are not purged (this is a known issue). There was an effort put to fix this issue. But I think it had other side-effects and hence not been applied. Maybe we should revive those changes again.
2) With regard to above, old entries can be purged in case if any request comes with the reference to old inode (as part of 'glfs_resolve_inode'), provided their reference counts are properly decremented. But this is not happening at the moment in gfapi.
3) Applications should hold and release their reference as needed and required. There are certain fixes needed in this area as well (including the fix provided by Pranith above).
From code-inspection, have made changes to fix few leaks of case (2) & (3) with respect to gfAPI.
http://review.gluster.org/#/c/13096 (yet to test the changes)
I haven't yet narrowed down any suspects pertaining to only NFS-Ganesha. Will re-check and update.
I tried similar tests but with smaller set of files. I could see the
inode_ctx leak even without graph switches involved. I suspect that
could be because valgrind checks for memory leaks during the exit of the
program. We call 'glfs_fini()' to cleanup the memory being used by
gfapi during exit. Those inode_ctx leaks are result of some inodes being
left during inode_table cleanup. I have submitted below patch to address
this issue.
http://review.gluster.org/13125
However this shall help only if there are volume un-exports being
involved or program being exited. It still doesn't address the actual
RAM being consumed by the application when active.
Thanks,
Soumya
Thanks,
Soumya
Pranith
One may see GlusterFS-related leaks here as well.
On пʼятниця, 25 грудня 2015 р. 20:28:13 EET Soumya Koduri wrote:
On 12/24/2015 09:17 PM, Oleksandr Natalenko wrote:
Another addition: it seems to be GlusterFS API library memory leak
because NFS-Ganesha also consumes huge amount of memory while doing
ordinary "find . -type f" via NFSv4.2 on remote client. Here is memory
usage:
===
root 5416 34.2 78.5 2047176 1480552 ? Ssl 12:02 117:54
/usr/bin/ganesha.nfsd -L /var/log/ganesha.log -f
/etc/ganesha/ganesha.conf -N NIV_EVENT
===
1.4G is too much for simple stat() :(.
Ideas?
nfs-ganesha also has cache layer which can scale to millions of entries
depending on the number of files/directories being looked upon. However
there are parameters to tune it. So either try stat with few entries or
add below block in nfs-ganesha.conf file, set low limits and check the
difference. That may help us narrow down how much memory actually
consumed by core nfs-ganesha and gfAPI.
CACHEINODE {
Cache_Size(uint32, range 1 to UINT32_MAX, default 32633); # cache size
Entries_HWMark(uint32, range 1 to UINT32_MAX, default 100000); #Max no.
of entries in the cache.
}
Thanks,
Soumya
24.12.2015 16:32, Oleksandr Natalenko написав:
Still actual issue for 3.7.6. Any suggestions?
24.09.2015 10:14, Oleksandr Natalenko написав:
In our GlusterFS deployment we've encountered something like memory
leak in GlusterFS FUSE client.
We use replicated (×2) GlusterFS volume to store mail (exim+dovecot,
maildir format). Here is inode stats for both bricks and mountpoint:
===
Brick 1 (Server 1):
Filesystem Inodes IUsed
IFree IUse% Mounted on
/dev/mapper/vg_vd1_misc-lv08_mail 578768144 10954918
567813226 2% /bricks/r6sdLV08_vd1_mail
Brick 2 (Server 2):
Filesystem Inodes IUsed
IFree IUse% Mounted on
/dev/mapper/vg_vd0_misc-lv07_mail 578767984 10954913
567813071 2% /bricks/r6sdLV07_vd0_mail
Mountpoint (Server 3):
Filesystem Inodes IUsed IFree
IUse% Mounted on
glusterfs.xxx:mail 578767760 10954915 567812845
2% /var/spool/mail/virtual
===
glusterfs.xxx domain has two A records for both Server 1 and Server 2.
Here is volume info:
===
Volume Name: mail
Type: Replicate
Volume ID: f564e85c-7aa6-4170-9417-1f501aa98cd2
Status: Started
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: server1.xxx:/bricks/r6sdLV08_vd1_mail/mail
Brick2: server2.xxx:/bricks/r6sdLV07_vd0_mail/mail
Options Reconfigured:
nfs.rpc-auth-allow: 1.2.4.0/24,4.5.6.0/24
features.cache-invalidation-timeout: 10
performance.stat-prefetch: off
performance.quick-read: on
performance.read-ahead: off
performance.flush-behind: on
performance.write-behind: on
performance.io-thread-count: 4
performance.cache-max-file-size: 1048576
performance.cache-size: 67108864
performance.readdir-ahead: off
===
Soon enough after mounting and exim/dovecot start, glusterfs client
process begins to consume huge amount of RAM:
===
user@server3 ~$ ps aux | grep glusterfs | grep mail
root 28895 14.4 15.0 15510324 14908868 ? Ssl Sep03 4310:05
/usr/sbin/glusterfs --fopen-keep-cache --direct-io-mode=disable
--volfile-server=glusterfs.xxx --volfile-id=mail
/var/spool/mail/virtual
===
That is, ~15 GiB of RAM.
Also we've tried to use mountpoint withing separate KVM VM with 2 or 3
GiB of RAM, and soon after starting mail daemons got OOM killer for
glusterfs client process.
Mounting same share via NFS works just fine. Also, we have much less
iowait and loadavg on client side with NFS.
Also, we've tried to change IO threads count and cache size in order
to limit memory usage with no luck. As you can see, total cache size
is 4×64==256 MiB (compare to 15 GiB).
Enabling-disabling stat-prefetch, read-ahead and readdir-ahead didn't
help as well.
Here are volume memory stats:
===
Memory status for volume : mail
----------------------------------------------
Brick : server1.xxx:/bricks/r6sdLV08_vd1_mail/mail
Mallinfo
--------
Arena : 36859904
Ordblks : 10357
Smblks : 519
Hblks : 21
Hblkhd : 30515200
Usmblks : 0
Fsmblks : 53440
Uordblks : 18604144
Fordblks : 18255760
Keepcost : 114112
Mempool Stats
-------------
Name HotCount ColdCount PaddedSizeof
AllocCount MaxAlloc Misses Max-StdAlloc
---- -------- --------- ------------
---------- -------- -------- ------------
mail-server:fd_t 0 1024 108
30773120 137 0 0
mail-server:dentry_t 16110 274 84
235676148 16384 1106499 1152
mail-server:inode_t 16363 21 156
237216876 16384 1876651 1169
mail-trash:fd_t 0 1024 108
0 0 0 0
mail-trash:dentry_t 0 32768 84
0 0 0 0
mail-trash:inode_t 4 32764 156
4 4 0 0
mail-trash:trash_local_t 0 64 8628
0 0 0 0
mail-changetimerecorder:gf_ctr_local_t 0 64
16540 0 0 0 0
mail-changelog:rpcsvc_request_t 0 8 2828
0 0 0 0
mail-changelog:changelog_local_t 0 64 116
0 0 0 0
mail-bitrot-stub:br_stub_local_t 0 512 84
79204 4 0 0
mail-locks:pl_local_t 0 32 148
6812757 4 0 0
mail-upcall:upcall_local_t 0 512 108
0 0 0 0
mail-marker:marker_local_t 0 128 332
64980 3 0 0
mail-quota:quota_local_t 0 64 476
0 0 0 0
mail-server:rpcsvc_request_t 0 512 2828
45462533 34 0 0
glusterfs:struct saved_frame 0 8 124
2 2 0 0
glusterfs:struct rpc_req 0 8 588
2 2 0 0
glusterfs:rpcsvc_request_t 1 7 2828
2 1 0 0
glusterfs:log_buf_t 5 251 140
3452 6 0 0
glusterfs:data_t 242 16141 52
480115498 664 0 0
glusterfs:data_pair_t 230 16153 68
179483528 275 0 0
glusterfs:dict_t 23 4073 140
303751675 627 0 0
glusterfs:call_stub_t 0 1024 3764
45290655 34 0 0
glusterfs:call_stack_t 1 1023 1708
43598469 34 0 0
glusterfs:call_frame_t 1 4095 172
336219655 184 0 0
----------------------------------------------
Brick : server2.xxx:/bricks/r6sdLV07_vd0_mail/mail
Mallinfo
--------
Arena : 38174720
Ordblks : 9041
Smblks : 507
Hblks : 21
Hblkhd : 30515200
Usmblks : 0
Fsmblks : 51712
Uordblks : 19415008
Fordblks : 18759712
Keepcost : 114848
Mempool Stats
-------------
Name HotCount ColdCount PaddedSizeof
AllocCount MaxAlloc Misses Max-StdAlloc
---- -------- --------- ------------
---------- -------- -------- ------------
mail-server:fd_t 0 1024 108
2373075 133 0 0
mail-server:dentry_t 14114 2270 84
3513654 16384 2300 267
mail-server:inode_t 16374 10 156
6766642 16384 194635 1279
mail-trash:fd_t 0 1024 108
0 0 0 0
mail-trash:dentry_t 0 32768 84
0 0 0 0
mail-trash:inode_t 4 32764 156
4 4 0 0
mail-trash:trash_local_t 0 64 8628
0 0 0 0
mail-changetimerecorder:gf_ctr_local_t 0 64
16540 0 0 0 0
mail-changelog:rpcsvc_request_t 0 8 2828
0 0 0 0
mail-changelog:changelog_local_t 0 64 116
0 0 0 0
mail-bitrot-stub:br_stub_local_t 0 512 84
71354 4 0 0
mail-locks:pl_local_t 0 32 148
8135032 4 0 0
mail-upcall:upcall_local_t 0 512 108
0 0 0 0
mail-marker:marker_local_t 0 128 332
65005 3 0 0
mail-quota:quota_local_t 0 64 476
0 0 0 0
mail-server:rpcsvc_request_t 0 512 2828
12882393 30 0 0
glusterfs:struct saved_frame 0 8 124
2 2 0 0
glusterfs:struct rpc_req 0 8 588
2 2 0 0
glusterfs:rpcsvc_request_t 1 7 2828
2 1 0 0
glusterfs:log_buf_t 5 251 140
3443 6 0 0
glusterfs:data_t 242 16141 52
138743429 290 0 0
glusterfs:data_pair_t 230 16153 68
126649864 270 0 0
glusterfs:dict_t 23 4073 140
20356289 63 0 0
glusterfs:call_stub_t 0 1024 3764
13678560 31 0 0
glusterfs:call_stack_t 1 1023 1708
11011561 30 0 0
glusterfs:call_frame_t 1 4095 172
125764190 193 0 0
----------------------------------------------
===
So, my questions are:
1) what one should do to limit GlusterFS FUSE client memory usage?
2) what one should do to prevent client high loadavg because of high
iowait because of multiple concurrent volume users?
Server/client OS is CentOS 7.1, GlusterFS server version is 3.7.3,
GlusterFS client version is 3.7.4.
Any additional info needed?
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users