Re: ceph-fuse segfaults ( jewel 10.2.2)

"Yan, Zheng" <ukernel@xxxxxxxxx> · Tue, 19 Jul 2016 17:08:49 +0800

On Tue, Jul 19, 2016 at 1:03 PM, Goncalo Borges
<goncalo.borges@xxxxxxxxxxxxx> wrote:
> Hi All...
>
> We do have some good news.
>
> As promised, I've recompiled ceph 10.2.2 (in an intel processor without
> AVX2) with and without the patch provided by Zheng. It turns out that
> Zheng's patch is the solution for the segfaults we saw in ObjectCacher when
> ceph-fuse runs in AMD 62xx processors.
>
> To convince ourselves that the problem was really solved, we executed 40
> jobs (with the user application where the ObjectCacher segfault was seen for
> the first time) in a dozen of AMD 62XX VMs, and none failed. Before,
> ceph-fuse was always segfaulting a couple of minutes after job startup.
>
> Thank you for all the help. With all the bits and pieces from everyone we
> were able to nail this one.
>
> I am a bit surprised that no other complains appeared in the mailing list
> for both of the issues we saw: first the locking issue and then the
> ObjectCacher issue. This makes me think that we are using ceph-fuse in a
> different way than others (probably exposing it to real user applications
> more heavily than other communities). If you actually need a beta tester
> next time, I think it is also in our best interest to participate.
>
> I do have a last question.
>
> While searching / googling for tips, I saw an issue claiming that
> 'fuse_disable_pagecache' should be set to true in ceph.conf. Can you briefly
> explain is this is correct and what is the con of not using it? (just or me
> to understand it).

For ceph-fuse, there are two caches, one is in ceph-fuse, another one
is kernel pagecache. When multiple clients read/write a file at the
same time, ceph-fuse needs to disable cache and let reads/writes go to
OSDs directly. ceph-fuse can disable its own cache, but there is no
way to disable the kernel pagecache dynamically.  So client may read
stale data from the kernel pagecache.

>
> Thank you in Advance
>
> Cheers
>
> Goncalo
>
>
>
> On 07/15/2016 01:35 PM, Goncalo Borges wrote:
>
> Thanks Zheng...
>
> Now that we have identified the exact context when the segfault appears
> (only in AMD 62XX) I think it should be safe to understand in each situation
> does the crash appears.
>
> My current compilation is ongoing and I will then test it.
>
> If it fails, I will recompile including your patch.
>
> Will report here afterwards.
>
> Thanks for the feedback.
>
> Cheers
>
> Goncalo
>
>
> On 07/15/2016 01:19 PM, Yan, Zheng wrote:
>
> On Fri, Jul 15, 2016 at 9:35 AM, Goncalo Borges
> <goncalo.borges@xxxxxxxxxxxxx> wrote:
>
> Hi All...
>
> I've seen that Zheng, Brad, Pat and Greg already updated or made some
> comments on the bug issue. Zheng also proposes a simple patch. However, I do
> have a bit more information. We do think we have identified the source of
> the problem and that we can correct it. Therefore, I would propose that you
> hold any work on the issue until we test our hypothesis. I'll try to
> summarize it:
>
> 1./ After being convinced that the ceph-fuse segfault we saw in specific VMs
> was not memory related, I decided to run the user application in multiple
> zones of the openstack cloud we use. We scale up our resources by using a
> public funded openstack cloud which spawns machines (using always the same
> image) in multiple availability zones. In the majority of the cases we limit
> our VMs to (normally) the same availability zone because it seats in the
> same data center as our infrastructure. This experiment showed that
> ceph-fuse does not segfaults in other availability zones with multiple VMS
> of different sizes and types. So the problem was restricted to the
> availability zone we normally use as our default one.
>
> 2./ I've them created new VMs of multiple sizes and types  in our 'default'
> availability zone and rerun the user application. This new experiment,
> running in newly created VMs, showed ceph-fuse segfaults independent of the
> VM types but not in all VMs. For example, in this new test, ceph-fuse was
> segfaulting in some 4 and 8 core VMs but not in all.
>
> 3./ I've then decided to inspect the CPU types, and the breakthrough was
> that I got a 100% correlation of ceph-fuse segfaults with AMD 62xx processor
> VMs. This availability zone has only 2 types of hypervisors: an old one with
> AMD 62xx processors, and a new one with Intel processors. If my jobs run in
> a VM with Intel, everything is ok. If my jobs run in AMD 62xx, ceph-fuse
> segfaults. Actually, the segfault is almost immediate in 4 core AMD 62xx VMs
> but takes much more time in 8-core AMD62xx VMs.
>
> 4./ I've then crosschecked what processors were used in the successful jobs
> executed in the other availability zones: Several types of intel, AMD 63xx
> but not AMD 62xx processors.
>
> 5./ Talking with my awesome colleague Sean, he remembered some discussions
> about applications segfaulting in AMD processors when compiled in an Intel
> processor with AVX2 extension. Actually, I compiled ceph 10.2.2 in an intel
> processor with AVX2 but ceph 9.2.0 was compiled several months ago on an
> intel processor without AVX2. The reason for the change is simply because we
> upgraded our infrastructure.
>
> 6./ Then, we compared the cpuflags between AMD 63xx and AMD62xx. if you look
> carefully, 63xx has 'fma f16c tbm bmi1' and 62xx has 'svm'. According to my
> colleague, fma and f16c are both AMD extensions which make AMD more
> compatible with the AVX extension by Intel.
>
> 63xx
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
> pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb lm
> rep_good extd_apicid unfair_spinlock pni pclmulqdq ssse3 fma cx16 sse4_1
> sse4_2 x2apic popcnt aes xsave avx f16c hypervisor lahf_lm cmp_legacy
> cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw xop fma4 tbm bmi1
>
> 62xx
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
> pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb lm
> rep_good extd_apicid unfair_spinlock pni pclmulqdq ssse3 cx16 sse4_1 sse4_2
> x2apic popcnt aes xsave avx hypervisor lahf_lm cmp_legacy svm cr8_legacy abm
> sse4a misalignsse 3dnowprefetch osvw xop fma4
>
>
> All of the previous arguments may explain why we can use 9.2.0 in AMD 62xx,
> and why 10.2.2 works in AMD 63xx but not in AMD 62xx.
>
> So, we are hopping that compiling 10.2.2 in an intel processor without the
> AVX extensions will solve our problem.
>
> Does this make sense?
>
> I have a different theory. ObjectCacher::flush() checks
> "bh->last_write <= cutoff" to decide if it should write buffer head.
> But ObjectCacher::bh_write_adjacencies() checks "bh->last_write <
> cutoff". (cutoff is the time clock when ObjectCacher::flush() starts
> executing). If there is only one dirty buffer head and its last_write
> is equal to cutoff, the segfault happens. For some hardware
> limitations, AMD 62xx CPU may unable to provide high precision time
> clock. This explains the segfault only happens in AMD 62xx. The code
> that causes the segfault was introduced in jewel release. So ceph-fuse
> 9.2.0 does not have this problem.
>
>
> Regards
> Yan, Zheng
>
>
>
>
> The compilation takes a while but I will update the issue once I have
> finished this last experiment (in the next few days)
>
> Cheers
> Goncalo
>
>
>
> On 07/12/2016 09:45 PM, Goncalo Borges wrote:
>
> Hi All...
>
> Thank you for continuing to follow this already very long thread.
>
> Pat and Greg are correct in their assumption regarding the 10gb virtual
> memory footprint I see for ceph-fuse process in our cluster with 12 core (24
> because of hyperthreading) machines and 96 gb of RAM. The source is glibc >
> 1.10. I can reduce / tune virtual memory threads usage by setting
> MALLOC_ARENA_MAX = 4 (the default is 8 on 64 bits machines) before mounting
> the filesystem with ceph-fuse. So, there is no memory leak on ceph-fuse :-)
>
> The bad news is that, while reading the arena malloc glibc explanation, it
> became obvious that the virtual memory footprint scales with tje numer of
> cores. Therefore the 10gb virtual memory i was seeing in the resources with
> 12 cores (24 because of hyperthreading) could not / would not be the same in
> the VMs where I get the segfault since they have only 4 cores.
>
> So, at this point, I know that:
> a./ The segfault is always appearing in a set of VMs with 16 GB of RAM and 4
> cores.
> b./ The segfault is not appearing in a set of VMs (in principle identical to
> the 16 GB ones) but with 16 cores and 64 GB of RAM.
> c./ the segfault is not appearing in a physicall cluster with machines with
> 96 GB of RAM and 12 cores (24 because of hyperthreading)
> and I am not so sure anymore that this is memory related.
>
> For further debugging, I've updated
>    http://tracker.ceph.com/issues/16610
> with a summary of my finding plus some log files:
>   - The gdb.txt I get after running
>   $ gdb /path/to/ceph-fuse core.XXXX
>   (gdb) set pag off
>   (gdb) set log on
>   (gdb) thread apply all bt
>   (gdb) thread apply all bt full
>   as advised by Brad
> - The debug.out (gzipped) I get after running ceph-fuse in debug mode with
> 'debug client 20' and 'debug objectcacher = 20'
>
> Cheers
> Goncalo
> ________________________________________
> From: Gregory Farnum [gfarnum@xxxxxxxxxx]
> Sent: 12 July 2016 03:07
> To: Goncalo Borges
> Cc: John Spray; ceph-users
> Subject: Re:  ceph-fuse segfaults ( jewel 10.2.2)
>
> Oh, is this one of your custom-built packages? Are they using
> tcmalloc? That difference between VSZ and RSS looks like a glibc
> malloc problem.
> -Greg
>
> On Mon, Jul 11, 2016 at 12:04 AM, Goncalo Borges
> <goncalo.borges@xxxxxxxxxxxxx> wrote:
>
> Hi John...
>
> Thank you for replying.
>
> Here is the result of the tests you asked but I do not see nothing abnormal.
> Actually, your suggestions made me see that:
>
> 1) ceph-fuse 9.2.0 is presenting the same behaviour but with less memory
> consumption, probably, less enought so that it doesn't brake ceph-fuse in
> our machines with less memory.
>
> 2) I see a tremendous number of  ceph-fuse threads launched (around 160).
>
> # ps -T -p 3230 -o command,ppid,pid,spid,vsize,rss,%mem,%cpu | wc -l
> 157
>
> # ps -T -p 3230 -o command,ppid,pid,spid,vsize,rss,%mem,%cpu | head -n 10
> COMMAND                      PPID   PID  SPID    VSZ   RSS %MEM %CPU
> ceph-fuse --id mount_user -     1  3230  3230 9935240 339780  0.6 0.0
> ceph-fuse --id mount_user -     1  3230  3231 9935240 339780  0.6 0.1
> ceph-fuse --id mount_user -     1  3230  3232 9935240 339780  0.6 0.0
> ceph-fuse --id mount_user -     1  3230  3233 9935240 339780  0.6 0.0
> ceph-fuse --id mount_user -     1  3230  3234 9935240 339780  0.6 0.0
> ceph-fuse --id mount_user -     1  3230  3235 9935240 339780  0.6 0.0
> ceph-fuse --id mount_user -     1  3230  3236 9935240 339780  0.6 0.0
> ceph-fuse --id mount_user -     1  3230  3237 9935240 339780  0.6 0.0
> ceph-fuse --id mount_user -     1  3230  3238 9935240 339780  0.6 0.0
>
>
> I do not see a way to actually limit the number of ceph-fuse threads
> launched  or to limit the max vm size each thread should take.
>
> Do you know how to limit those options.
>
> Cheers
>
> Goncalo
>
>
>
>
> 1.> Try running ceph-fuse with valgrind --tool=memcheck to see if it's
> leaking
>
> I have launched ceph-fuse with valgrind in the cluster where there is
> sufficient memory available, and therefore, there is no object cacher
> segfault.
>
>     $ valgrind --log-file=/tmp/valgrind-ceph-fuse-10.2.2.txt --tool=memcheck
> ceph-fuse --id mount_user -k /etc/ceph/ceph.client.mount_user.keyring -m
> X.X.X.8:6789 -r /cephfs /coepp/cephfs
>
> This is the output which I get once I unmount the file system after user
> application execution
>
> # cat valgrind-ceph-fuse-10.2.2.txt
> ==12123== Memcheck, a memory error detector
> ==12123== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.
> ==12123== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
> ==12123== Command: ceph-fuse --id mount_user -k
> /etc/ceph/ceph.client.mount_user.keyring -m 192.231.127.8:6789 -r /cephfs
> /coepp/cephfs
> ==12123== Parent PID: 11992
> ==12123==
> ==12123==
> ==12123== HEAP SUMMARY:
> ==12123==     in use at exit: 29,129 bytes in 397 blocks
> ==12123==   total heap usage: 14,824 allocs, 14,427 frees, 648,030 bytes
> allocated
> ==12123==
> ==12123== LEAK SUMMARY:
> ==12123==    definitely lost: 16 bytes in 1 blocks
> ==12123==    indirectly lost: 0 bytes in 0 blocks
> ==12123==      possibly lost: 11,705 bytes in 273 blocks
> ==12123==    still reachable: 17,408 bytes in 123 blocks
> ==12123==         suppressed: 0 bytes in 0 blocks
> ==12123== Rerun with --leak-check=full to see details of leaked memory
> ==12123==
> ==12123== For counts of detected and suppressed errors, rerun with: -v
> ==12123== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 8 from 6)
> ==12126==
> ==12126== HEAP SUMMARY:
> ==12126==     in use at exit: 9,641 bytes in 73 blocks
> ==12126==   total heap usage: 31,363,579 allocs, 31,363,506 frees,
> 41,389,143,617 bytes allocated
> ==12126==
> ==12126== LEAK SUMMARY:
> ==12126==    definitely lost: 28 bytes in 1 blocks
> ==12126==    indirectly lost: 0 bytes in 0 blocks
> ==12126==      possibly lost: 0 bytes in 0 blocks
> ==12126==    still reachable: 9,613 bytes in 72 blocks
> ==12126==         suppressed: 0 bytes in 0 blocks
> ==12126== Rerun with --leak-check=full to see details of leaked memory
> ==12126==
> ==12126== For counts of detected and suppressed errors, rerun with: -v
> ==12126== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 17 from 9)
>
> --- * ---
>
> 2.>  Inspect inode count (ceph daemon <path to asok> status) to see if it's
> obeying its limit
>
> This is the output I get once ceph-fuse is mounted but no user application
> is running
>
>     # ceph daemon /var/run/ceph/ceph-client.mount_user.asok status
>     {
>         "metadata": {
>         "ceph_sha1": "45107e21c568dd033c2f0a3107dec8f0b0e58374",
>         "ceph_version": "ceph version 10.2.2
> (45107e21c568dd033c2f0a3107dec8f0b0e58374)",
>         "entity_id": "mount_user",
>         "hostname": "<some host name>",
>         "mount_point": "\/coepp\/cephfs",
>         "root": "\/cephfs"
>         },
>         "dentry_count": 0,
>         "dentry_pinned_count": 0,
>         "inode_count": 2,
>         "mds_epoch": 817,
>         "osd_epoch": 1005,
>         "osd_epoch_barrier": 0
>     }
>
>
> This is already when ceph-fuse reached 10g of virtual memory, and user
> applications are hammering the filesystem.
>
>     # ceph daemon /var/run/ceph/ceph-client.mount_user.asok status
>     {
>         "metadata": {
>         "ceph_sha1": "45107e21c568dd033c2f0a3107dec8f0b0e58374",
>         "ceph_version": "ceph version 10.2.2
> (45107e21c568dd033c2f0a3107dec8f0b0e58374)",
>         "entity_id": "mount_user",
>         "hostname": "<some host name>",
>         "mount_point": "\/coepp\/cephfs",
>         "root": "\/cephfs"
>         },
>         "dentry_count": 13,
>         "dentry_pinned_count": 2,
>         "inode_count": 15,
>         "mds_epoch": 817,
>         "osd_epoch": 1005,
>         "osd_epoch_barrier": 1005
>     }
>
> Once I kill the applications I get
>
>     # ceph daemon /var/run/ceph/ceph-client.mount_user.asok status
>     {
>         "metadata": {
>         "ceph_sha1": "45107e21c568dd033c2f0a3107dec8f0b0e58374",
>         "ceph_version": "ceph version 10.2.2
> (45107e21c568dd033c2f0a3107dec8f0b0e58374)",
>         "entity_id": "mount_user",
>         "hostname": "<some host name>",
>         "mount_point": "\/coepp\/cephfs",
>         "root": "\/cephfs"
>         },
>         "dentry_count": 38,
>         "dentry_pinned_count": 3,
>         "inode_count": 40,
>         "mds_epoch": 817,
>         "osd_epoch": 1005,
>         "osd_epoch_barrier": 1005
>     }
>
> --- * ---
>
> 3.>  Enable objectcacher debug (debug objectcacher = 10) and look at the
> output (from the "trim" lines) to see if it's obeying its limit
>
> I've mounted ceph-fuse with debug objectcacher = 10, and filled the host
> with user applications. I killed the applications when I saw ceph-fuse
> virtual
> memory stabilize at around 10g.
>
> Greping for the trim lines in the log, this is the structure I've found:
>
>     2016-07-11 01:55:46.314888 7f04c57fb700 10 objectcacher trim  start:
> bytes: max 209715200  clean 0, objects: max 1000 current 1
>     2016-07-11 01:55:46.314891 7f04c57fb700 10 objectcacher trim finish:
> max 209715200  clean 0, objects: max 1000 current 1
>     2016-07-11 01:55:46.315009 7f04c75fe700 10 objectcacher trim  start:
> bytes: max 209715200  clean 0, objects: max 1000 current 2
>     2016-07-11 01:55:46.315012 7f04c75fe700 10 objectcacher trim finish:
> max 209715200  clean 0, objects: max 1000 current 2
>     <... snip ... >
>     2016-07-11 01:56:09.444853 7f04c75fe700 10 objectcacher trim  start:
> bytes: max 209715200  clean 204608008, objects: max 1000 current 55
>     2016-07-11 01:56:09.444855 7f04c75fe700 10 objectcacher trim finish:
> max 209715200  clean 204608008, objects: max 1000 current 55
>     2016-07-11 01:56:09.445010 7f04c57fb700 10 objectcacher trim  start:
> bytes: max 209715200  clean 204608008, objects: max 1000 current 55
>     2016-07-11 01:56:09.445011 7f04c57fb700 10 objectcacher trim finish:
> max 209715200  clean 204608008, objects: max 1000 current 55
>     2016-07-11 01:56:09.798269 7f04c75fe700 10 objectcacher trim  start:
> bytes: max 209715200  clean 210943832, objects: max 1000 current 55
>     2016-07-11 01:56:09.798272 7f04c75fe700 10 objectcacher trim trimming
> bh[ 0x7f04a8016100 96~59048 0x7f04a8014cd0 (59048) v 3 clean firstbyte=1]
> waiters = {}
>     2016-07-11 01:56:09.798284 7f04c75fe700 10 objectcacher trim trimming
> bh[ 0x7f04b4011550 96~59048 0x7f04b4010430 (59048) v 4 clean firstbyte=1]
> waiters = {}
>     2016-07-11 01:56:09.798294 7f04c75fe700 10 objectcacher trim trimming
> bh[ 0x7f04b001bea0 61760~4132544 0x7f04b4010430 (4132544) v 24 clean
> firstbyte=71] waiters = {}
>     2016-07-11 01:56:09.798395 7f04c75fe700 10 objectcacher trim finish:
> max 209715200  clean 206693192, objects: max 1000 current 55
>     2016-07-11 01:56:09.798687 7f04c57fb700 10 objectcacher trim  start:
> bytes: max 209715200  clean 206693192, objects: max 1000 current 55
>     2016-07-11 01:56:09.798689 7f04c57fb700 10 objectcacher trim finish:
> max 209715200  clean 206693192, objects: max 1000 current 55
>     <... snip ...>
>     2016-07-11 01:56:10.494928 7f04c75fe700 10 objectcacher trim  start:
> bytes: max 209715200  clean 210806408, objects: max 1000 current 55
>     2016-07-11 01:56:10.494931 7f04c75fe700 10 objectcacher trim trimming
> bh[ 0x7f04b401a760 61760~4132544 0x7f04a8014cd0 (4132544) v 32 clean
> firstbyte=71] waiters = {}
>     2016-07-11 01:56:10.495058 7f04c75fe700 10 objectcacher trim finish:
> max 209715200  clean 206673864, objects: max 1000 current 55
>     <... snip ...>
>     2016-07-11 01:57:08.333503 7f04c6bfd700 10 objectcacher trim  start:
> bytes: max 209715200  clean 211528796, objects: max 1000 current 187
>     2016-07-11 01:57:08.333507 7f04c6bfd700 10 objectcacher trim trimming
> bh[ 0x7f04b0b370e0 0~4194304 0x7f04b09f2630 (4194304) v 404 clean
> firstbyte=84] waiters = {}
>     2016-07-11 01:57:08.333708 7f04c6bfd700 10 objectcacher trim finish:
> max 209715200  clean 207334492, objects: max 1000 current 187
>     2016-07-11 01:57:08.616143 7f04c61fc700 10 objectcacher trim  start:
> bytes: max 209715200  clean 209949683, objects: max 1000 current 188
>     2016-07-11 01:57:08.616146 7f04c61fc700 10 objectcacher trim trimming
> bh[ 0x7f04a8bfdd60 0~4194304 0x7f04a8bfe660 (4194304) v 407 clean
> firstbyte=84] waiters = {}
>     2016-07-11 01:57:08.616303 7f04c61fc700 10 objectcacher trim finish:
> max 209715200  clean 205755379, objects: max 1000 current 188
>     2016-07-11 01:57:08.936060 7f04c57fb700 10 objectcacher trim  start:
> bytes: max 209715200  clean 205760010, objects: max 1000 current 189
>     2016-07-11 01:57:08.936063 7f04c57fb700 10 objectcacher trim finish:
> max 209715200  clean 205760010, objects: max 1000 current 189
>     2016-07-11 01:58:02.918322 7f04f27f4e40 10 objectcacher release trimming
> object[100003dffd9.00000000/head oset 0x7f04d4045c98 wr 566/566]
>     2016-07-11 01:58:02.918335 7f04f27f4e40 10 objectcacher release trimming
> object[100003dffd5.00000000/head oset 0x7f04d403e378 wr 564/564]
>     <... snip...>
>     2016-07-11 01:58:02.924699 7f04f27f4e40 10 objectcacher release trimming
> object[100003dffc4.0000000f/head oset 0x7f04d402b308 wr 557/557]
>     2016-07-11 01:58:02.924717 7f04f27f4e40 10 objectcacher release trimming
> object[100003dffc5.00000000/head oset 0x7f04d40026b8 wr 541/541]
>     2016-07-11 01:58:02.924769 7f04f27f4e40 10 objectcacher release trimming
> object[100003dffc8.00000000/head oset 0x7f04d4027818 wr 547/547]
>     <... snip...>
>     2016-07-11 01:58:02.925879 7f04f27f4e40 10 objectcacher release_set on
> 0x7f04d401a568 dne
>     2016-07-11 01:58:02.925881 7f04f27f4e40 10 objectcacher release_set on
> 0x7f04d401b078 dne
>     2016-07-11 01:58:02.957626 7f04e57fb700 10 objectcacher flusher finish
>
> So, if I am understanding this correctly, every time the client_oc_size
> bytes of cached data is above 200M bytes, it is trimmed and the values is
> well kepted near its limit.
>
>
> --- * ---
>
> 4.> See if fuse_disable_pagecache setting makes a difference
>
> It doesn't seem to make a difference. I've set in ceph config
>
>     # grep fuse /etc/ceph/ceph.conf
>     fuse_disable_pagecache = true
>
> on this client (I guess I do not have to do it in the overall cluster).
> Then, I've remounted cephfs via ceph-fuse and filled the host with user
> applications.
>
> Almost immediatly this is what I got:
>
>       PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>     28681 root      20   0 8543m 248m 5948 S  4.0  0.5   0:02.73 ceph-fuse
>      5369 root      20   0 3131m 231m  12m S  0.0  0.5  26:22.90
> dsm_om_connsvcd
>      1429 goncalo   20   0 1595m  98m  32m R 99.5  0.2   1:04.34 python
>      1098 goncalo   20   0 1596m  86m  20m R 99.9  0.2   1:04.29 python
>       994 goncalo   20   0 1594m  86m  20m R 99.9  0.2   1:04.16 python
>     31928 goncalo   20   0 1595m  86m  19m R 99.9  0.2   1:04.76 python
>     16852 goncalo   20   0 1596m  86m  19m R 99.9  0.2   1:06.16 python
>     16846 goncalo   20   0 1594m  84m  19m R 99.9  0.2   1:06.05 python
>     29595 goncalo   20   0 1594m  83m  19m R 100.2  0.2   1:05.57 python
>     29312 goncalo   20   0 1594m  83m  19m R 99.9  0.2   1:05.01 python
>     31979 goncalo   20   0 1595m  82m  19m R 100.2  0.2   1:04.82 python
>     29333 goncalo   20   0 1594m  82m  19m R 99.5  0.2   1:04.94 python
>     29609 goncalo   20   0 1594m  82m  19m R 99.9  0.2   1:05.07 python
>
>
> 5.> Also, is the version of fuse the same on the nodes running 9.2.0 vs. the
> nodes running 10.2.2?
>
> In 10.2.2 I've compiled with fuse 2.9.7 while in 9.2.0 I've compiled against
> the default sl6 fuse libs version 2.8.7. However, as I said before, I am
> seeing the same issue with 9.2.0 (although with a bit less of used virtual
> memory in total).
>
>
>
>
> On 07/08/2016 10:53 PM, John Spray wrote:
>
> On Fri, Jul 8, 2016 at 8:01 AM, Goncalo Borges
> <goncalo.borges@xxxxxxxxxxxxx> wrote:
>
> Hi Brad, Patrick, All...
>
> I think I've understood this second problem. In summary, it is memory
> related.
>
> This is how I found the source of the problem:
>
> 1./ I copied and adapted the user application to run in another cluster of
> ours. The idea was for me to understand the application and run it myself to
> collect logs and so on...
>
> 2./ Once I submit it to this other cluster, every thing went fine. I was
> hammering cephfs from multiple nodes without problems. This pointed to
> something different between the two clusters.
>
> 3./ I've started to look better to the segmentation fault message, and
> assuming that the names of the methods and functions do mean something, the
> log seems related to issues on the management of objects in cache. This
> pointed to a memory related problem.
>
> 4./ On the cluster where the application run successfully, machines have
> 48GB of RAM and 96GB of SWAP (don't know why we have such a large SWAP size,
> it is a legacy setup).
>
> # top
> top - 00:34:01 up 23 days, 22:21,  1 user,  load average: 12.06, 12.12,
> 10.40
> Tasks: 683 total,  13 running, 670 sleeping,   0 stopped,   0 zombie
> Cpu(s): 49.7%us,  0.6%sy,  0.0%ni, 49.7%id,  0.1%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Mem:  49409308k total, 29692548k used, 19716760k free,   433064k buffers
> Swap: 98301948k total,        0k used, 98301948k free, 26742484k cached
>
> 5./ I have noticed that ceph-fuse (in 10.2.2) consumes about 1.5 GB of
> virtual memory when there is no applications using the filesystem.
>
>  7152 root      20   0 1108m  12m 5496 S  0.0  0.0   0:00.04 ceph-fuse
>
> When I only have one instance of the user application running, ceph-fuse (in
> 10.2.2) slowly rises with time up to 10 GB of memory usage.
>
> if I submit a large number of user applications simultaneously, ceph-fuse
> goes very fast to ~10GB.
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 18563 root      20   0 10.0g 328m 5724 S  4.0  0.7   1:38.00 ceph-fuse
>  4343 root      20   0 3131m 237m  12m S  0.0  0.5  28:24.56 dsm_om_connsvcd
>  5536 goncalo   20   0 1599m  99m  32m R 99.9  0.2  31:35.46 python
> 31427 goncalo   20   0 1597m  89m  20m R 99.9  0.2  31:35.88 python
> 20504 goncalo   20   0 1599m  89m  20m R 100.2  0.2  31:34.29 python
> 20508 goncalo   20   0 1599m  89m  20m R 99.9  0.2  31:34.20 python
>  4973 goncalo   20   0 1599m  89m  20m R 99.9  0.2  31:35.70 python
>  1331 goncalo   20   0 1597m  88m  20m R 99.9  0.2  31:35.72 python
> 20505 goncalo   20   0 1597m  88m  20m R 99.9  0.2  31:34.46 python
> 20507 goncalo   20   0 1599m  87m  20m R 99.9  0.2  31:34.37 python
> 28375 goncalo   20   0 1597m  86m  20m R 99.9  0.2  31:35.52 python
> 20503 goncalo   20   0 1597m  85m  20m R 100.2  0.2  31:34.09 python
> 20506 goncalo   20   0 1597m  84m  20m R 99.5  0.2  31:34.42 python
> 20502 goncalo   20   0 1597m  83m  20m R 99.9  0.2  31:34.32 python
>
> 6./ On the machines where the user had the segfault, we have 16 GB of RAM
> and 1GB of SWAP
>
> Mem:  16334244k total,  3590100k used, 12744144k free,   221364k buffers
> Swap:  1572860k total,    10512k used,  1562348k free,  2937276k cached
>
> 7./ I think what is happening is that once the user submits his sets of
> jobs, the memory usage goes to the very limit on this type machine, and the
> raise is actually to fast that ceph-fuse segfaults before OOM Killer can
> kill it.
>
> 8./ We have run the user application in the same type of machines but with
> 64 GB of RAM and 1GB of SWAP, and everything goes fine also here.
>
>
> So, in conclusion, our second problem (besides the locks which was fixed by
> Pat patch) is the memory usage profile of ceph-fuse in 10.2.2 which seems to
> be very different than what it was in ceph-fuse 9.2.0.
>
> Are there any ideas how can we limit the virtual memory usage of ceph-fuse
> in 10.2.2?
>
> The fuse client is designed to limit its cache sizes:
> client_cache_size (default 16384) inodes of cached metadata
> client_oc_size (default 200MB) bytes of cached data
>
> We do run the fuse client with valgrind during testing, so it it is
> showing memory leaks in normal usage on your system then that's news.
>
> The top output you've posted seems to show that ceph-fuse only
> actually has 328MB resident though?
>
> If you can reproduce the memory growth, then it would be good to:
>  * Try running ceph-fuse with valgrind --tool=memcheck to see if it's
> leaking
>  * Inspect inode count (ceph daemon <path to asok> status) to see if
> it's obeying its limit
>  * Enable objectcacher debug (debug objectcacher = 10) and look at the
> output (from the "trim" lines) to see if it's obeying its limit
>  * See if fuse_disable_pagecache setting makes a difference
>
> Also, is the version of fuse the same on the nodes running 9.2.0 vs.
> the nodes running 10.2.2?
>
> John
>
> Cheers
> Goncalo
>
>
>
> On 07/08/2016 09:54 AM, Brad Hubbard wrote:
>
> Hi Goncalo,
>
> If possible it would be great if you could capture a core file for this with
> full debugging symbols (preferably glibc debuginfo as well). How you do
> that will depend on the ceph version and your OS but we can offfer help
> if required I'm sure.
>
> Once you have the core do the following.
>
> $ gdb /path/to/ceph-fuse core.XXXX
> (gdb) set pag off
> (gdb) set log on
> (gdb) thread apply all bt
> (gdb) thread apply all bt full
>
> Then quit gdb and you should find a file called gdb.txt in your
> working directory.
> If you could attach that file to http://tracker.ceph.com/issues/16610
>
> Cheers,
> Brad
>
> On Fri, Jul 8, 2016 at 12:06 AM, Patrick Donnelly <pdonnell@xxxxxxxxxx>
> wrote:
>
> On Thu, Jul 7, 2016 at 2:01 AM, Goncalo Borges
> <goncalo.borges@xxxxxxxxxxxxx> wrote:
>
> Unfortunately, the other user application breaks ceph-fuse again (It is a
> completely different application then in my previous test).
>
> We have tested it in 4 machines with 4 cores. The user is submitting 16
> single core jobs which are all writing different output files (one per job)
> to a common dir in cephfs. The first 4 jobs run happily and never break
> ceph-fuse. But the remaining 12 jobs, running in the remaining 3 machines,
> trigger a segmentation fault, which is completely different from the other
> case.
>
> ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
> 1: (()+0x297fe2) [0x7f54402b7fe2]
> 2: (()+0xf7e0) [0x7f543ecf77e0]
> 3: (ObjectCacher::bh_write_scattered(std::list<ObjectCacher::BufferHead*,
> std::allocator<ObjectCacher::BufferHead*> >&)+0x36) [0x7f5440268086]
> 4: (ObjectCacher::bh_write_adjacencies(ObjectCacher::BufferHead*,
> std::chrono::time_point<ceph::time_detail::real_clock,
> std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, long*,
> int*)+0x22c) [0x7f5440268a3c]
> 5: (ObjectCacher::flush(long)+0x1ef) [0x7f5440268cef]
> 6: (ObjectCacher::flusher_entry()+0xac4) [0x7f5440269a34]
> 7: (ObjectCacher::FlusherThread::entry()+0xd) [0x7f5440275c6d]
> 8: (()+0x7aa1) [0x7f543ecefaa1]
>  9: (clone()+0x6d) [0x7f543df6893d]
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
>
> This one looks like a very different problem. I've created an issue
> here: http://tracker.ceph.com/issues/16610
>
> Thanks for the report and debug log!
>
> --
> Patrick Donnelly
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Goncalo Borges
> Research Computing
> ARC Centre of Excellence for Particle Physics at the Terascale
> School of Physics A28 | University of Sydney, NSW  2006
> T: +61 2 93511937
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> Goncalo Borges
> Research Computing
> ARC Centre of Excellence for Particle Physics at the Terascale
> School of Physics A28 | University of Sydney, NSW  2006
> T: +61 2 93511937
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> Goncalo Borges
> Research Computing
> ARC Centre of Excellence for Particle Physics at the Terascale
> School of Physics A28 | University of Sydney, NSW  2006
> T: +61 2 93511937
>
>
> --
> Goncalo Borges
> Research Computing
> ARC Centre of Excellence for Particle Physics at the Terascale
> School of Physics A28 | University of Sydney, NSW  2006
> T: +61 2 93511937
>
>
> --
> Goncalo Borges
> Research Computing
> ARC Centre of Excellence for Particle Physics at the Terascale
> School of Physics A28 | University of Sydney, NSW  2006
> T: +61 2 93511937
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com