Hi John...
Thank you for replying.
Here is the result of the tests you asked but I do not see nothing
abnormal. Actually, your suggestions made me see that:
1) ceph-fuse 9.2.0 is presenting the same behaviour but with less
memory consumption, probably, less enought so that it doesn't
brake ceph-fuse in our machines with less memory.
2) I see a tremendous number of ceph-fuse threads launched
(around 160).
# ps -T -p 3230 -o command,ppid,pid,spid,vsize,rss,%mem,%cpu |
wc -l
157
# ps -T -p 3230 -o command,ppid,pid,spid,vsize,rss,%mem,%cpu |
head -n 10
COMMAND PPID PID SPID VSZ RSS %MEM
%CPU
ceph-fuse --id mount_user - 1 3230 3230 9935240 339780
0.6 0.0
ceph-fuse --id mount_user - 1 3230 3231 9935240 339780
0.6 0.1
ceph-fuse --id mount_user - 1 3230 3232 9935240 339780
0.6 0.0
ceph-fuse --id mount_user - 1 3230 3233 9935240 339780
0.6 0.0
ceph-fuse --id mount_user - 1 3230 3234 9935240 339780
0.6 0.0
ceph-fuse --id mount_user - 1 3230 3235 9935240 339780
0.6 0.0
ceph-fuse --id mount_user - 1 3230 3236 9935240 339780
0.6 0.0
ceph-fuse --id mount_user - 1 3230 3237 9935240 339780
0.6 0.0
ceph-fuse --id mount_user - 1 3230 3238 9935240 339780
0.6 0.0
I do not see a way to actually limit the number of ceph-fuse
threads launched or to limit the max vm size each thread should
take.
Do you know how to limit those options.
Cheers
Goncalo
1.> Try running ceph-fuse with valgrind --tool=memcheck to see
if it's leaking
I have launched ceph-fuse with valgrind in the cluster where there
is sufficient memory available, and therefore, there is no object
cacher segfault.
$ valgrind --log-file=/tmp/valgrind-ceph-fuse-10.2.2.txt
--tool=memcheck ceph-fuse --id mount_user -k
/etc/ceph/ceph.client.mount_user.keyring -m X.X.X.8:6789 -r
/cephfs /coepp/cephfs
This is the output which I get once I unmount the file system
after user application execution
# cat valgrind-ceph-fuse-10.2.2.txt
==12123== Memcheck, a memory error detector
==12123== Copyright (C) 2002-2012, and GNU GPL'd, by Julian
Seward et al.
==12123== Using Valgrind-3.8.1 and LibVEX; rerun with -h for
copyright info
==12123== Command: ceph-fuse --id mount_user -k
/etc/ceph/ceph.client.mount_user.keyring -m 192.231.127.8:6789
-r /cephfs /coepp/cephfs
==12123== Parent PID: 11992
==12123==
==12123==
==12123== HEAP SUMMARY:
==12123== in use at exit: 29,129 bytes in 397 blocks
==12123== total heap usage: 14,824 allocs, 14,427 frees,
648,030 bytes allocated
==12123==
==12123== LEAK SUMMARY:
==12123== definitely lost: 16 bytes in 1 blocks
==12123== indirectly lost: 0 bytes in 0 blocks
==12123== possibly lost: 11,705 bytes in 273 blocks
==12123== still reachable: 17,408 bytes in 123 blocks
==12123== suppressed: 0 bytes in 0 blocks
==12123== Rerun with --leak-check=full to see details of leaked
memory
==12123==
==12123== For counts of detected and suppressed errors, rerun
with: -v
==12123== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 8
from 6)
==12126==
==12126== HEAP SUMMARY:
==12126== in use at exit: 9,641 bytes in 73 blocks
==12126== total heap usage: 31,363,579 allocs, 31,363,506
frees, 41,389,143,617 bytes allocated
==12126==
==12126== LEAK SUMMARY:
==12126== definitely lost: 28 bytes in 1 blocks
==12126== indirectly lost: 0 bytes in 0 blocks
==12126== possibly lost: 0 bytes in 0 blocks
==12126== still reachable: 9,613 bytes in 72 blocks
==12126== suppressed: 0 bytes in 0 blocks
==12126== Rerun with --leak-check=full to see details of leaked
memory
==12126==
==12126== For counts of detected and suppressed errors, rerun
with: -v
==12126== ERROR SUMMARY: 0 errors from 0 contexts (suppressed:
17 from 9)
--- * ---
2.> Inspect inode count (ceph daemon <path to asok>
status) to see if it's obeying its limit
This is the output I get once ceph-fuse is mounted but no user
application is running
# ceph daemon /var/run/ceph/ceph-client.mount_user.asok status
{
"metadata": {
"ceph_sha1": "45107e21c568dd033c2f0a3107dec8f0b0e58374",
"ceph_version": "ceph version 10.2.2
(45107e21c568dd033c2f0a3107dec8f0b0e58374)",
"entity_id": "mount_user",
"hostname": "<some host name>",
"mount_point": "\/coepp\/cephfs",
"root": "\/cephfs"
},
"dentry_count": 0,
"dentry_pinned_count": 0,
"inode_count": 2,
"mds_epoch": 817,
"osd_epoch": 1005,
"osd_epoch_barrier": 0
}
This is already when ceph-fuse reached 10g of virtual memory, and
user applications are hammering the filesystem.
# ceph daemon /var/run/ceph/ceph-client.mount_user.asok status
{
"metadata": {
"ceph_sha1": "45107e21c568dd033c2f0a3107dec8f0b0e58374",
"ceph_version": "ceph version 10.2.2
(45107e21c568dd033c2f0a3107dec8f0b0e58374)",
"entity_id": "mount_user",
"hostname": "<some host name>",
"mount_point": "\/coepp\/cephfs",
"root": "\/cephfs"
},
"dentry_count": 13,
"dentry_pinned_count": 2,
"inode_count": 15,
"mds_epoch": 817,
"osd_epoch": 1005,
"osd_epoch_barrier": 1005
}
Once I kill the applications I get
# ceph daemon /var/run/ceph/ceph-client.mount_user.asok status
{
"metadata": {
"ceph_sha1": "45107e21c568dd033c2f0a3107dec8f0b0e58374",
"ceph_version": "ceph version 10.2.2
(45107e21c568dd033c2f0a3107dec8f0b0e58374)",
"entity_id": "mount_user",
"hostname": "<some host name>",
"mount_point": "\/coepp\/cephfs",
"root": "\/cephfs"
},
"dentry_count": 38,
"dentry_pinned_count": 3,
"inode_count": 40,
"mds_epoch": 817,
"osd_epoch": 1005,
"osd_epoch_barrier": 1005
}
--- * ---
3.> Enable objectcacher debug (debug objectcacher = 10) and
look at the output (from the "trim" lines) to see if it's obeying
its limit
I've mounted ceph-fuse with debug objectcacher = 10, and filled
the host with user applications. I killed the applications when I
saw ceph-fuse virtual
memory stabilize at around 10g.
Greping for the trim lines in the log, this is the structure I've
found:
2016-07-11 01:55:46.314888 7f04c57fb700 10 objectcacher trim
start: bytes: max 209715200 clean 0, objects: max 1000 current 1
2016-07-11 01:55:46.314891 7f04c57fb700 10 objectcacher trim
finish: max 209715200 clean 0, objects: max 1000 current 1
2016-07-11 01:55:46.315009 7f04c75fe700 10 objectcacher trim
start: bytes: max 209715200 clean 0, objects: max 1000 current 2
2016-07-11 01:55:46.315012 7f04c75fe700 10 objectcacher trim
finish: max 209715200 clean 0, objects: max 1000 current 2
<... snip ... >
2016-07-11 01:56:09.444853 7f04c75fe700 10 objectcacher trim
start: bytes: max 209715200 clean 204608008, objects: max 1000
current 55
2016-07-11 01:56:09.444855 7f04c75fe700 10 objectcacher trim
finish: max 209715200 clean 204608008, objects: max 1000 current
55
2016-07-11 01:56:09.445010 7f04c57fb700 10 objectcacher trim
start: bytes: max 209715200 clean 204608008, objects: max 1000
current 55
2016-07-11 01:56:09.445011 7f04c57fb700 10 objectcacher trim
finish: max 209715200 clean 204608008, objects: max 1000 current
55
2016-07-11 01:56:09.798269 7f04c75fe700 10 objectcacher trim
start: bytes: max 209715200 clean 210943832, objects: max 1000
current 55
2016-07-11 01:56:09.798272 7f04c75fe700 10 objectcacher trim
trimming bh[ 0x7f04a8016100 96~59048 0x7f04a8014cd0 (59048) v 3
clean firstbyte=1] waiters = {}
2016-07-11 01:56:09.798284 7f04c75fe700 10 objectcacher trim
trimming bh[ 0x7f04b4011550 96~59048 0x7f04b4010430 (59048) v 4
clean firstbyte=1] waiters = {}
2016-07-11 01:56:09.798294 7f04c75fe700 10 objectcacher trim
trimming bh[ 0x7f04b001bea0 61760~4132544 0x7f04b4010430 (4132544)
v 24 clean firstbyte=71] waiters = {}
2016-07-11 01:56:09.798395 7f04c75fe700 10 objectcacher trim
finish: max 209715200 clean 206693192, objects: max 1000 current
55
2016-07-11 01:56:09.798687 7f04c57fb700 10 objectcacher trim
start: bytes: max 209715200 clean 206693192, objects: max 1000
current 55
2016-07-11 01:56:09.798689 7f04c57fb700 10 objectcacher trim
finish: max 209715200 clean 206693192, objects: max 1000 current
55
<... snip ...>
2016-07-11 01:56:10.494928 7f04c75fe700 10 objectcacher trim
start: bytes: max 209715200 clean 210806408, objects: max 1000
current 55
2016-07-11 01:56:10.494931 7f04c75fe700 10 objectcacher trim
trimming bh[ 0x7f04b401a760 61760~4132544 0x7f04a8014cd0 (4132544)
v 32 clean firstbyte=71] waiters = {}
2016-07-11 01:56:10.495058 7f04c75fe700 10 objectcacher trim
finish: max 209715200 clean 206673864, objects: max 1000 current
55
<... snip ...>
2016-07-11 01:57:08.333503 7f04c6bfd700 10 objectcacher trim
start: bytes: max 209715200 clean 211528796, objects: max 1000
current 187
2016-07-11 01:57:08.333507 7f04c6bfd700 10 objectcacher trim
trimming bh[ 0x7f04b0b370e0 0~4194304 0x7f04b09f2630 (4194304) v
404 clean firstbyte=84] waiters = {}
2016-07-11 01:57:08.333708 7f04c6bfd700 10 objectcacher trim
finish: max 209715200 clean 207334492, objects: max 1000 current
187
2016-07-11 01:57:08.616143 7f04c61fc700 10 objectcacher trim
start: bytes: max 209715200 clean 209949683, objects: max 1000
current 188
2016-07-11 01:57:08.616146 7f04c61fc700 10 objectcacher trim
trimming bh[ 0x7f04a8bfdd60 0~4194304 0x7f04a8bfe660 (4194304) v
407 clean firstbyte=84] waiters = {}
2016-07-11 01:57:08.616303 7f04c61fc700 10 objectcacher trim
finish: max 209715200 clean 205755379, objects: max 1000 current
188
2016-07-11 01:57:08.936060 7f04c57fb700 10 objectcacher trim
start: bytes: max 209715200 clean 205760010, objects: max 1000
current 189
2016-07-11 01:57:08.936063 7f04c57fb700 10 objectcacher trim
finish: max 209715200 clean 205760010, objects: max 1000 current
189
2016-07-11 01:58:02.918322 7f04f27f4e40 10 objectcacher
release trimming object[100003dffd9.00000000/head oset
0x7f04d4045c98 wr 566/566]
2016-07-11 01:58:02.918335 7f04f27f4e40 10 objectcacher
release trimming object[100003dffd5.00000000/head oset
0x7f04d403e378 wr 564/564]
<... snip...>
2016-07-11 01:58:02.924699 7f04f27f4e40 10 objectcacher
release trimming object[100003dffc4.0000000f/head oset
0x7f04d402b308 wr 557/557]
2016-07-11 01:58:02.924717 7f04f27f4e40 10 objectcacher
release trimming object[100003dffc5.00000000/head oset
0x7f04d40026b8 wr 541/541]
2016-07-11 01:58:02.924769 7f04f27f4e40 10 objectcacher
release trimming object[100003dffc8.00000000/head oset
0x7f04d4027818 wr 547/547]
<... snip...>
2016-07-11 01:58:02.925879 7f04f27f4e40 10 objectcacher
release_set on 0x7f04d401a568 dne
2016-07-11 01:58:02.925881 7f04f27f4e40 10 objectcacher
release_set on 0x7f04d401b078 dne
2016-07-11 01:58:02.957626 7f04e57fb700 10 objectcacher
flusher finish
So, if I am understanding this correctly, every time the
client_oc_size bytes of cached data is above 200M bytes, it is
trimmed and the values is well kepted near its limit.
--- * ---
4.> See if fuse_disable_pagecache setting makes a difference
It doesn't seem to make a difference. I've set in ceph config
# grep fuse /etc/ceph/ceph.conf
fuse_disable_pagecache = true
on this client (I guess I do not have to do it in the overall
cluster). Then, I've remounted cephfs via ceph-fuse and filled the
host with user applications.
Almost immediatly this is what I got:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
28681 root 20 0 8543m 248m 5948 S 4.0 0.5 0:02.73
ceph-fuse
5369 root 20 0 3131m 231m 12m S 0.0 0.5 26:22.90
dsm_om_connsvcd
1429 goncalo 20 0 1595m 98m 32m R 99.5 0.2 1:04.34
python
1098 goncalo 20 0 1596m 86m 20m R 99.9 0.2 1:04.29
python
994 goncalo 20 0 1594m 86m 20m R 99.9 0.2 1:04.16
python
31928 goncalo 20 0 1595m 86m 19m R 99.9 0.2 1:04.76
python
16852 goncalo 20 0 1596m 86m 19m R 99.9 0.2 1:06.16
python
16846 goncalo 20 0 1594m 84m 19m R 99.9 0.2 1:06.05
python
29595 goncalo 20 0 1594m 83m 19m R 100.2 0.2 1:05.57
python
29312 goncalo 20 0 1594m 83m 19m R 99.9 0.2 1:05.01
python
31979 goncalo 20 0 1595m 82m 19m R 100.2 0.2 1:04.82
python
29333 goncalo 20 0 1594m 82m 19m R 99.5 0.2 1:04.94
python
29609 goncalo 20 0 1594m 82m 19m R 99.9 0.2 1:05.07
python
5.> Also, is the version of fuse the same on the nodes running
9.2.0 vs. the nodes running 10.2.2?
In 10.2.2 I've compiled with fuse 2.9.7 while in 9.2.0 I've
compiled against the default sl6 fuse libs version 2.8.7. However,
as I said before, I am seeing the same issue with 9.2.0 (although
with a bit less of used virtual memory in total).
On 07/08/2016 10:53 PM, John Spray
wrote:
On Fri, Jul 8, 2016 at 8:01 AM, Goncalo Borges
<goncalo.borges@xxxxxxxxxxxxx> wrote:
Hi Brad, Patrick, All...
I think I've understood this second problem. In summary, it is memory
related.
This is how I found the source of the problem:
1./ I copied and adapted the user application to run in another cluster of
ours. The idea was for me to understand the application and run it myself to
collect logs and so on...
2./ Once I submit it to this other cluster, every thing went fine. I was
hammering cephfs from multiple nodes without problems. This pointed to
something different between the two clusters.
3./ I've started to look better to the segmentation fault message, and
assuming that the names of the methods and functions do mean something, the
log seems related to issues on the management of objects in cache. This
pointed to a memory related problem.
4./ On the cluster where the application run successfully, machines have
48GB of RAM and 96GB of SWAP (don't know why we have such a large SWAP size,
it is a legacy setup).
# top
top - 00:34:01 up 23 days, 22:21, 1 user, load average: 12.06, 12.12,
10.40
Tasks: 683 total, 13 running, 670 sleeping, 0 stopped, 0 zombie
Cpu(s): 49.7%us, 0.6%sy, 0.0%ni, 49.7%id, 0.1%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 49409308k total, 29692548k used, 19716760k free, 433064k buffers
Swap: 98301948k total, 0k used, 98301948k free, 26742484k cached
5./ I have noticed that ceph-fuse (in 10.2.2) consumes about 1.5 GB of
virtual memory when there is no applications using the filesystem.
7152 root 20 0 1108m 12m 5496 S 0.0 0.0 0:00.04 ceph-fuse
When I only have one instance of the user application running, ceph-fuse (in
10.2.2) slowly rises with time up to 10 GB of memory usage.
if I submit a large number of user applications simultaneously, ceph-fuse
goes very fast to ~10GB.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18563 root 20 0 10.0g 328m 5724 S 4.0 0.7 1:38.00 ceph-fuse
4343 root 20 0 3131m 237m 12m S 0.0 0.5 28:24.56 dsm_om_connsvcd
5536 goncalo 20 0 1599m 99m 32m R 99.9 0.2 31:35.46 python
31427 goncalo 20 0 1597m 89m 20m R 99.9 0.2 31:35.88 python
20504 goncalo 20 0 1599m 89m 20m R 100.2 0.2 31:34.29 python
20508 goncalo 20 0 1599m 89m 20m R 99.9 0.2 31:34.20 python
4973 goncalo 20 0 1599m 89m 20m R 99.9 0.2 31:35.70 python
1331 goncalo 20 0 1597m 88m 20m R 99.9 0.2 31:35.72 python
20505 goncalo 20 0 1597m 88m 20m R 99.9 0.2 31:34.46 python
20507 goncalo 20 0 1599m 87m 20m R 99.9 0.2 31:34.37 python
28375 goncalo 20 0 1597m 86m 20m R 99.9 0.2 31:35.52 python
20503 goncalo 20 0 1597m 85m 20m R 100.2 0.2 31:34.09 python
20506 goncalo 20 0 1597m 84m 20m R 99.5 0.2 31:34.42 python
20502 goncalo 20 0 1597m 83m 20m R 99.9 0.2 31:34.32 python
6./ On the machines where the user had the segfault, we have 16 GB of RAM
and 1GB of SWAP
Mem: 16334244k total, 3590100k used, 12744144k free, 221364k buffers
Swap: 1572860k total, 10512k used, 1562348k free, 2937276k cached
7./ I think what is happening is that once the user submits his sets of
jobs, the memory usage goes to the very limit on this type machine, and the
raise is actually to fast that ceph-fuse segfaults before OOM Killer can
kill it.
8./ We have run the user application in the same type of machines but with
64 GB of RAM and 1GB of SWAP, and everything goes fine also here.
So, in conclusion, our second problem (besides the locks which was fixed by
Pat patch) is the memory usage profile of ceph-fuse in 10.2.2 which seems to
be very different than what it was in ceph-fuse 9.2.0.
Are there any ideas how can we limit the virtual memory usage of ceph-fuse
in 10.2.2?
The fuse client is designed to limit its cache sizes:
client_cache_size (default 16384) inodes of cached metadata
client_oc_size (default 200MB) bytes of cached data
We do run the fuse client with valgrind during testing, so it it is
showing memory leaks in normal usage on your system then that's news.
The top output you've posted seems to show that ceph-fuse only
actually has 328MB resident though?
If you can reproduce the memory growth, then it would be good to:
* Try running ceph-fuse with valgrind --tool=memcheck to see if it's leaking
* Inspect inode count (ceph daemon <path to asok> status) to see if
it's obeying its limit
* Enable objectcacher debug (debug objectcacher = 10) and look at the
output (from the "trim" lines) to see if it's obeying its limit
* See if fuse_disable_pagecache setting makes a difference
Also, is the version of fuse the same on the nodes running 9.2.0 vs.
the nodes running 10.2.2?
John
Cheers
Goncalo
On 07/08/2016 09:54 AM, Brad Hubbard wrote:
Hi Goncalo,
If possible it would be great if you could capture a core file for this with
full debugging symbols (preferably glibc debuginfo as well). How you do
that will depend on the ceph version and your OS but we can offfer help
if required I'm sure.
Once you have the core do the following.
$ gdb /path/to/ceph-fuse core.XXXX
(gdb) set pag off
(gdb) set log on
(gdb) thread apply all bt
(gdb) thread apply all bt full
Then quit gdb and you should find a file called gdb.txt in your
working directory.
If you could attach that file to http://tracker.ceph.com/issues/16610
Cheers,
Brad
On Fri, Jul 8, 2016 at 12:06 AM, Patrick Donnelly <pdonnell@xxxxxxxxxx>
wrote:
On Thu, Jul 7, 2016 at 2:01 AM, Goncalo Borges
<goncalo.borges@xxxxxxxxxxxxx> wrote:
Unfortunately, the other user application breaks ceph-fuse again (It is a
completely different application then in my previous test).
We have tested it in 4 machines with 4 cores. The user is submitting 16
single core jobs which are all writing different output files (one per job)
to a common dir in cephfs. The first 4 jobs run happily and never break
ceph-fuse. But the remaining 12 jobs, running in the remaining 3 machines,
trigger a segmentation fault, which is completely different from the other
case.
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (()+0x297fe2) [0x7f54402b7fe2]
2: (()+0xf7e0) [0x7f543ecf77e0]
3: (ObjectCacher::bh_write_scattered(std::list<ObjectCacher::BufferHead*,
std::allocator<ObjectCacher::BufferHead*> >&)+0x36) [0x7f5440268086]
4: (ObjectCacher::bh_write_adjacencies(ObjectCacher::BufferHead*,
std::chrono::time_point<ceph::time_detail::real_clock,
std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, long*,
int*)+0x22c) [0x7f5440268a3c]
5: (ObjectCacher::flush(long)+0x1ef) [0x7f5440268cef]
6: (ObjectCacher::flusher_entry()+0xac4) [0x7f5440269a34]
7: (ObjectCacher::FlusherThread::entry()+0xd) [0x7f5440275c6d]
8: (()+0x7aa1) [0x7f543ecefaa1]
9: (clone()+0x6d) [0x7f543df6893d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
This one looks like a very different problem. I've created an issue
here: http://tracker.ceph.com/issues/16610
Thanks for the report and debug log!
--
Patrick Donnelly
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW 2006
T: +61 2 93511937
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW 2006
T: +61 2 93511937
|