On Fri, Jul 8, 2016 at 8:01 AM, Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> wrote: > Hi Brad, Patrick, All... > > I think I've understood this second problem. In summary, it is memory > related. > > This is how I found the source of the problem: > > 1./ I copied and adapted the user application to run in another cluster of > ours. The idea was for me to understand the application and run it myself to > collect logs and so on... > > 2./ Once I submit it to this other cluster, every thing went fine. I was > hammering cephfs from multiple nodes without problems. This pointed to > something different between the two clusters. > > 3./ I've started to look better to the segmentation fault message, and > assuming that the names of the methods and functions do mean something, the > log seems related to issues on the management of objects in cache. This > pointed to a memory related problem. > > 4./ On the cluster where the application run successfully, machines have > 48GB of RAM and 96GB of SWAP (don't know why we have such a large SWAP size, > it is a legacy setup). > > # top > top - 00:34:01 up 23 days, 22:21, 1 user, load average: 12.06, 12.12, > 10.40 > Tasks: 683 total, 13 running, 670 sleeping, 0 stopped, 0 zombie > Cpu(s): 49.7%us, 0.6%sy, 0.0%ni, 49.7%id, 0.1%wa, 0.0%hi, 0.0%si, > 0.0%st > Mem: 49409308k total, 29692548k used, 19716760k free, 433064k buffers > Swap: 98301948k total, 0k used, 98301948k free, 26742484k cached > > 5./ I have noticed that ceph-fuse (in 10.2.2) consumes about 1.5 GB of > virtual memory when there is no applications using the filesystem. > > 7152 root 20 0 1108m 12m 5496 S 0.0 0.0 0:00.04 ceph-fuse > > When I only have one instance of the user application running, ceph-fuse (in > 10.2.2) slowly rises with time up to 10 GB of memory usage. > > if I submit a large number of user applications simultaneously, ceph-fuse > goes very fast to ~10GB. > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 18563 root 20 0 10.0g 328m 5724 S 4.0 0.7 1:38.00 ceph-fuse > 4343 root 20 0 3131m 237m 12m S 0.0 0.5 28:24.56 dsm_om_connsvcd > 5536 goncalo 20 0 1599m 99m 32m R 99.9 0.2 31:35.46 python > 31427 goncalo 20 0 1597m 89m 20m R 99.9 0.2 31:35.88 python > 20504 goncalo 20 0 1599m 89m 20m R 100.2 0.2 31:34.29 python > 20508 goncalo 20 0 1599m 89m 20m R 99.9 0.2 31:34.20 python > 4973 goncalo 20 0 1599m 89m 20m R 99.9 0.2 31:35.70 python > 1331 goncalo 20 0 1597m 88m 20m R 99.9 0.2 31:35.72 python > 20505 goncalo 20 0 1597m 88m 20m R 99.9 0.2 31:34.46 python > 20507 goncalo 20 0 1599m 87m 20m R 99.9 0.2 31:34.37 python > 28375 goncalo 20 0 1597m 86m 20m R 99.9 0.2 31:35.52 python > 20503 goncalo 20 0 1597m 85m 20m R 100.2 0.2 31:34.09 python > 20506 goncalo 20 0 1597m 84m 20m R 99.5 0.2 31:34.42 python > 20502 goncalo 20 0 1597m 83m 20m R 99.9 0.2 31:34.32 python > > 6./ On the machines where the user had the segfault, we have 16 GB of RAM > and 1GB of SWAP > > Mem: 16334244k total, 3590100k used, 12744144k free, 221364k buffers > Swap: 1572860k total, 10512k used, 1562348k free, 2937276k cached > > 7./ I think what is happening is that once the user submits his sets of > jobs, the memory usage goes to the very limit on this type machine, and the > raise is actually to fast that ceph-fuse segfaults before OOM Killer can > kill it. > > 8./ We have run the user application in the same type of machines but with > 64 GB of RAM and 1GB of SWAP, and everything goes fine also here. > > > So, in conclusion, our second problem (besides the locks which was fixed by > Pat patch) is the memory usage profile of ceph-fuse in 10.2.2 which seems to > be very different than what it was in ceph-fuse 9.2.0. > > Are there any ideas how can we limit the virtual memory usage of ceph-fuse > in 10.2.2? The fuse client is designed to limit its cache sizes: client_cache_size (default 16384) inodes of cached metadata client_oc_size (default 200MB) bytes of cached data We do run the fuse client with valgrind during testing, so it it is showing memory leaks in normal usage on your system then that's news. The top output you've posted seems to show that ceph-fuse only actually has 328MB resident though? If you can reproduce the memory growth, then it would be good to: * Try running ceph-fuse with valgrind --tool=memcheck to see if it's leaking * Inspect inode count (ceph daemon <path to asok> status) to see if it's obeying its limit * Enable objectcacher debug (debug objectcacher = 10) and look at the output (from the "trim" lines) to see if it's obeying its limit * See if fuse_disable_pagecache setting makes a difference Also, is the version of fuse the same on the nodes running 9.2.0 vs. the nodes running 10.2.2? John > Cheers > Goncalo > > > > On 07/08/2016 09:54 AM, Brad Hubbard wrote: > > Hi Goncalo, > > If possible it would be great if you could capture a core file for this with > full debugging symbols (preferably glibc debuginfo as well). How you do > that will depend on the ceph version and your OS but we can offfer help > if required I'm sure. > > Once you have the core do the following. > > $ gdb /path/to/ceph-fuse core.XXXX > (gdb) set pag off > (gdb) set log on > (gdb) thread apply all bt > (gdb) thread apply all bt full > > Then quit gdb and you should find a file called gdb.txt in your > working directory. > If you could attach that file to http://tracker.ceph.com/issues/16610 > > Cheers, > Brad > > On Fri, Jul 8, 2016 at 12:06 AM, Patrick Donnelly <pdonnell@xxxxxxxxxx> > wrote: > > On Thu, Jul 7, 2016 at 2:01 AM, Goncalo Borges > <goncalo.borges@xxxxxxxxxxxxx> wrote: > > Unfortunately, the other user application breaks ceph-fuse again (It is a > completely different application then in my previous test). > > We have tested it in 4 machines with 4 cores. The user is submitting 16 > single core jobs which are all writing different output files (one per job) > to a common dir in cephfs. The first 4 jobs run happily and never break > ceph-fuse. But the remaining 12 jobs, running in the remaining 3 machines, > trigger a segmentation fault, which is completely different from the other > case. > > ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) > 1: (()+0x297fe2) [0x7f54402b7fe2] > 2: (()+0xf7e0) [0x7f543ecf77e0] > 3: (ObjectCacher::bh_write_scattered(std::list<ObjectCacher::BufferHead*, > std::allocator<ObjectCacher::BufferHead*> >&)+0x36) [0x7f5440268086] > 4: (ObjectCacher::bh_write_adjacencies(ObjectCacher::BufferHead*, > std::chrono::time_point<ceph::time_detail::real_clock, > std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, long*, > int*)+0x22c) [0x7f5440268a3c] > 5: (ObjectCacher::flush(long)+0x1ef) [0x7f5440268cef] > 6: (ObjectCacher::flusher_entry()+0xac4) [0x7f5440269a34] > 7: (ObjectCacher::FlusherThread::entry()+0xd) [0x7f5440275c6d] > 8: (()+0x7aa1) [0x7f543ecefaa1] > 9: (clone()+0x6d) [0x7f543df6893d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > This one looks like a very different problem. I've created an issue > here: http://tracker.ceph.com/issues/16610 > > Thanks for the report and debug log! > > -- > Patrick Donnelly > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Goncalo Borges > Research Computing > ARC Centre of Excellence for Particle Physics at the Terascale > School of Physics A28 | University of Sydney, NSW 2006 > T: +61 2 93511937 > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com