Hi Brad, Patrick, All... I think I've understood this second problem. In summary, it is memory related. This is how I found the source of the problem:
So, in conclusion, our second problem (besides the locks which was fixed by Pat patch) is the memory usage profile of ceph-fuse in 10.2.2 which seems to be very different than what it was in ceph-fuse 9.2.0. Are there any ideas how can we limit the virtual memory usage of ceph-fuse in 10.2.2?Cheers Goncalo
On 07/08/2016 09:54 AM, Brad Hubbard
wrote:
Hi Goncalo, If possible it would be great if you could capture a core file for this with full debugging symbols (preferably glibc debuginfo as well). How you do that will depend on the ceph version and your OS but we can offfer help if required I'm sure. Once you have the core do the following. $ gdb /path/to/ceph-fuse core.XXXX (gdb) set pag off (gdb) set log on (gdb) thread apply all bt (gdb) thread apply all bt full Then quit gdb and you should find a file called gdb.txt in your working directory. If you could attach that file to http://tracker.ceph.com/issues/16610 Cheers, Brad On Fri, Jul 8, 2016 at 12:06 AM, Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:On Thu, Jul 7, 2016 at 2:01 AM, Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> wrote:Unfortunately, the other user application breaks ceph-fuse again (It is a completely different application then in my previous test). We have tested it in 4 machines with 4 cores. The user is submitting 16 single core jobs which are all writing different output files (one per job) to a common dir in cephfs. The first 4 jobs run happily and never break ceph-fuse. But the remaining 12 jobs, running in the remaining 3 machines, trigger a segmentation fault, which is completely different from the other case. ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) 1: (()+0x297fe2) [0x7f54402b7fe2] 2: (()+0xf7e0) [0x7f543ecf77e0] 3: (ObjectCacher::bh_write_scattered(std::list<ObjectCacher::BufferHead*, std::allocator<ObjectCacher::BufferHead*> >&)+0x36) [0x7f5440268086] 4: (ObjectCacher::bh_write_adjacencies(ObjectCacher::BufferHead*, std::chrono::time_point<ceph::time_detail::real_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, long*, int*)+0x22c) [0x7f5440268a3c] 5: (ObjectCacher::flush(long)+0x1ef) [0x7f5440268cef] 6: (ObjectCacher::flusher_entry()+0xac4) [0x7f5440269a34] 7: (ObjectCacher::FlusherThread::entry()+0xd) [0x7f5440275c6d] 8: (()+0x7aa1) [0x7f543ecefaa1] 9: (clone()+0x6d) [0x7f543df6893d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.This one looks like a very different problem. I've created an issue here: http://tracker.ceph.com/issues/16610 Thanks for the report and debug log! -- Patrick Donnelly _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Goncalo Borges Research Computing ARC Centre of Excellence for Particle Physics at the Terascale School of Physics A28 | University of Sydney, NSW 2006 T: +61 2 93511937 |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com