Re: ceph-fuse segfaults ( jewel 10.2.2)

Brad Hubbard <bhubbard@xxxxxxxxxx> · Fri, 15 Jul 2016 12:10:27 +1000

On Fri, Jul 15, 2016 at 11:35:10AM +1000, Goncalo Borges wrote:
> Hi All...
> 
> I've seen that Zheng, Brad, Pat and Greg already updated or made some
> comments on the bug issue. Zheng also proposes a simple patch. However, I do
> have a bit more information. We do think we have identified the source of
> the problem and that we can correct it. Therefore, I would propose that you
> hold any work on the issue until we test our hypothesis. I'll try to
> summarize it:
> 
> 1./ After being convinced that the ceph-fuse segfault we saw in specific VMs
> was not memory related, I decided to run the user application in multiple
> zones of the openstack cloud we use. We scale up our resources by using a
> public funded openstack cloud which spawns machines (using always the same
> image) in multiple availability zones. In the majority of the cases we limit
> our VMs to (normally) the same availability zone because it seats in the
> same data center as our infrastructure. This experiment showed that
> ceph-fuse does not segfaults in other availability zones with multiple VMS
> of different sizes and types. So the problem was restricted to the
> availability zone we normally use as our default one.
> 
> 2./ I've them created new VMs of multiple sizes and types  in our 'default'
> availability zone and rerun the user application. This new experiment,
> running in newly created VMs, showed ceph-fuse segfaults independent of the
> VM types but not in all VMs. For example, in this new test, ceph-fuse was
> segfaulting in some 4 and 8 core VMs but not in all.
> 
> 3./ I've then decided to inspect the CPU types, and the breakthrough was
> that I got a 100% correlation of ceph-fuse segfaults with AMD 62xx processor
> VMs. This availability zone has only 2 types of hypervisors: an old one with
> AMD 62xx processors, and a new one with Intel processors. If my jobs run in
> a VM with Intel, everything is ok. If my jobs run in AMD 62xx, ceph-fuse
> segfaults. Actually, the segfault is almost immediate in 4 core AMD 62xx VMs
> but takes much more time in 8-core AMD62xx VMs.
> 
> 4./ I've then crosschecked what processors were used in the successful jobs
> executed in the other availability zones: Several types of intel, AMD 63xx
> but not AMD 62xx processors.
> 
> 5./ Talking with my awesome colleague Sean, he remembered some discussions
> about applications segfaulting in AMD processors when compiled in an Intel
> processor with AVX2 extension. Actually, I compiled ceph 10.2.2 in an intel
> processor with AVX2 but ceph 9.2.0 was compiled several months ago on an
> intel processor without AVX2. The reason for the change is simply because we
> upgraded our infrastructure.
> 
> 6./ Then, we compared the cpuflags between AMD 63xx and AMD62xx. if you look
> carefully, 63xx has 'fma f16c tbm bmi1' and 62xx has 'svm'. According to my
> colleague, fma and f16c are both AMD extensions which make AMD more
> compatible with the AVX extension by Intel.
> 
>    *63xx*
>    flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
>    pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt
>    pdpe1gb lm rep_good extd_apicid unfair_spinlock pni pclmulqdq ssse3
>    fma cx16 sse4_1 sse4_2 x2apic popcnt aes xsave avx f16c hypervisor
>    lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch
>    osvw xop fma4 tbm bmi1
> 
>    *62xx*
>    flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
>    pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt
>    pdpe1gb lm rep_good extd_apicid unfair_spinlock pni pclmulqdq ssse3
>    cx16 sse4_1 sse4_2 x2apic popcnt aes xsave avx hypervisor lahf_lm
>    cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw
>    xop fma4
> 
> 
> All of the previous arguments may explain why we can use 9.2.0 in AMD 62xx,
> and why 10.2.2 works in AMD 63xx but not in AMD 62xx.
> 
> So, we are hopping that compiling 10.2.2 in an intel processor without the
> AVX extensions will solve our problem.
> 
> Does this make sense?
> 
> The compilation takes a while but I will update the issue once I have
> finished this last experiment (in the next few days)

Wow, great analysis, well done.

So is this a CPU bug then? Is it documented anywhere you know of?

I can't see where we use the AVX2 extensions directly so I assume this has to
be at a lower level than the ceph code?

-- 
Cheers,
Brad

> 
> Cheers
> Goncalo
> 
> 
> 
> On 07/12/2016 09:45 PM, Goncalo Borges wrote:
> > Hi All...
> > 
> > Thank you for continuing to follow this already very long thread.
> > 
> > Pat and Greg are correct in their assumption regarding the 10gb virtual memory footprint I see for ceph-fuse process in our cluster with 12 core (24 because of hyperthreading) machines and 96 gb of RAM. The source is glibc > 1.10. I can reduce / tune virtual memory threads usage by setting MALLOC_ARENA_MAX = 4 (the default is 8 on 64 bits machines) before mounting the filesystem with ceph-fuse. So, there is no memory leak on ceph-fuse :-)
> > 
> > The bad news is that, while reading the arena malloc glibc explanation, it became obvious that the virtual memory footprint scales with tje numer of cores. Therefore the 10gb virtual memory i was seeing in the resources with 12 cores (24 because of hyperthreading) could not / would not be the same in the VMs where I get the segfault since they have only 4 cores.
> > 
> > So, at this point, I know that:
> > a./ The segfault is always appearing in a set of VMs with 16 GB of RAM and 4 cores.
> > b./ The segfault is not appearing in a set of VMs (in principle identical to the 16 GB ones) but with 16 cores and 64 GB of RAM.
> > c./ the segfault is not appearing in a physicall cluster with machines with 96 GB of RAM and 12 cores (24 because of hyperthreading)
> > and I am not so sure anymore that this is memory related.
> > 
> > For further debugging, I've updated
> >     http://tracker.ceph.com/issues/16610
> > with a summary of my finding plus some log files:
> >    - The gdb.txt I get after running
> >    $ gdb /path/to/ceph-fuse core.XXXX
> >    (gdb) set pag off
> >    (gdb) set log on
> >    (gdb) thread apply all bt
> >    (gdb) thread apply all bt full
> >    as advised by Brad
> > - The debug.out (gzipped) I get after running ceph-fuse in debug mode with 'debug client 20' and 'debug objectcacher = 20'
> > 
> > Cheers
> > Goncalo
> > ________________________________________
> > From: Gregory Farnum [gfarnum@xxxxxxxxxx]
> > Sent: 12 July 2016 03:07
> > To: Goncalo Borges
> > Cc: John Spray; ceph-users
> > Subject: Re:  ceph-fuse segfaults ( jewel 10.2.2)
> > 
> > Oh, is this one of your custom-built packages? Are they using
> > tcmalloc? That difference between VSZ and RSS looks like a glibc
> > malloc problem.
> > -Greg
> > 
> > On Mon, Jul 11, 2016 at 12:04 AM, Goncalo Borges
> > <goncalo.borges@xxxxxxxxxxxxx> wrote:
> > > Hi John...
> > > 
> > > Thank you for replying.
> > > 
> > > Here is the result of the tests you asked but I do not see nothing abnormal.
> > > Actually, your suggestions made me see that:
> > > 
> > > 1) ceph-fuse 9.2.0 is presenting the same behaviour but with less memory
> > > consumption, probably, less enought so that it doesn't brake ceph-fuse in
> > > our machines with less memory.
> > > 
> > > 2) I see a tremendous number of  ceph-fuse threads launched (around 160).
> > > 
> > > # ps -T -p 3230 -o command,ppid,pid,spid,vsize,rss,%mem,%cpu | wc -l
> > > 157
> > > 
> > > # ps -T -p 3230 -o command,ppid,pid,spid,vsize,rss,%mem,%cpu | head -n 10
> > > COMMAND                      PPID   PID  SPID    VSZ   RSS %MEM %CPU
> > > ceph-fuse --id mount_user -     1  3230  3230 9935240 339780  0.6 0.0
> > > ceph-fuse --id mount_user -     1  3230  3231 9935240 339780  0.6 0.1
> > > ceph-fuse --id mount_user -     1  3230  3232 9935240 339780  0.6 0.0
> > > ceph-fuse --id mount_user -     1  3230  3233 9935240 339780  0.6 0.0
> > > ceph-fuse --id mount_user -     1  3230  3234 9935240 339780  0.6 0.0
> > > ceph-fuse --id mount_user -     1  3230  3235 9935240 339780  0.6 0.0
> > > ceph-fuse --id mount_user -     1  3230  3236 9935240 339780  0.6 0.0
> > > ceph-fuse --id mount_user -     1  3230  3237 9935240 339780  0.6 0.0
> > > ceph-fuse --id mount_user -     1  3230  3238 9935240 339780  0.6 0.0
> > > 
> > > 
> > > I do not see a way to actually limit the number of ceph-fuse threads
> > > launched  or to limit the max vm size each thread should take.
> > > 
> > > Do you know how to limit those options.
> > > 
> > > Cheers
> > > 
> > > Goncalo
> > > 
> > > 
> > > 
> > > 
> > > 1.> Try running ceph-fuse with valgrind --tool=memcheck to see if it's
> > > leaking
> > > 
> > > I have launched ceph-fuse with valgrind in the cluster where there is
> > > sufficient memory available, and therefore, there is no object cacher
> > > segfault.
> > > 
> > >      $ valgrind --log-file=/tmp/valgrind-ceph-fuse-10.2.2.txt --tool=memcheck
> > > ceph-fuse --id mount_user -k /etc/ceph/ceph.client.mount_user.keyring -m
> > > X.X.X.8:6789 -r /cephfs /coepp/cephfs
> > > 
> > > This is the output which I get once I unmount the file system after user
> > > application execution
> > > 
> > > # cat valgrind-ceph-fuse-10.2.2.txt
> > > ==12123== Memcheck, a memory error detector
> > > ==12123== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.
> > > ==12123== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
> > > ==12123== Command: ceph-fuse --id mount_user -k
> > > /etc/ceph/ceph.client.mount_user.keyring -m 192.231.127.8:6789 -r /cephfs
> > > /coepp/cephfs
> > > ==12123== Parent PID: 11992
> > > ==12123==
> > > ==12123==
> > > ==12123== HEAP SUMMARY:
> > > ==12123==     in use at exit: 29,129 bytes in 397 blocks
> > > ==12123==   total heap usage: 14,824 allocs, 14,427 frees, 648,030 bytes
> > > allocated
> > > ==12123==
> > > ==12123== LEAK SUMMARY:
> > > ==12123==    definitely lost: 16 bytes in 1 blocks
> > > ==12123==    indirectly lost: 0 bytes in 0 blocks
> > > ==12123==      possibly lost: 11,705 bytes in 273 blocks
> > > ==12123==    still reachable: 17,408 bytes in 123 blocks
> > > ==12123==         suppressed: 0 bytes in 0 blocks
> > > ==12123== Rerun with --leak-check=full to see details of leaked memory
> > > ==12123==
> > > ==12123== For counts of detected and suppressed errors, rerun with: -v
> > > ==12123== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 8 from 6)
> > > ==12126==
> > > ==12126== HEAP SUMMARY:
> > > ==12126==     in use at exit: 9,641 bytes in 73 blocks
> > > ==12126==   total heap usage: 31,363,579 allocs, 31,363,506 frees,
> > > 41,389,143,617 bytes allocated
> > > ==12126==
> > > ==12126== LEAK SUMMARY:
> > > ==12126==    definitely lost: 28 bytes in 1 blocks
> > > ==12126==    indirectly lost: 0 bytes in 0 blocks
> > > ==12126==      possibly lost: 0 bytes in 0 blocks
> > > ==12126==    still reachable: 9,613 bytes in 72 blocks
> > > ==12126==         suppressed: 0 bytes in 0 blocks
> > > ==12126== Rerun with --leak-check=full to see details of leaked memory
> > > ==12126==
> > > ==12126== For counts of detected and suppressed errors, rerun with: -v
> > > ==12126== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 17 from 9)
> > > 
> > > --- * ---
> > > 
> > > 2.>  Inspect inode count (ceph daemon <path to asok> status) to see if it's
> > > obeying its limit
> > > 
> > > This is the output I get once ceph-fuse is mounted but no user application
> > > is running
> > > 
> > >      # ceph daemon /var/run/ceph/ceph-client.mount_user.asok status
> > >      {
> > >          "metadata": {
> > >          "ceph_sha1": "45107e21c568dd033c2f0a3107dec8f0b0e58374",
> > >          "ceph_version": "ceph version 10.2.2
> > > (45107e21c568dd033c2f0a3107dec8f0b0e58374)",
> > >          "entity_id": "mount_user",
> > >          "hostname": "<some host name>",
> > >          "mount_point": "\/coepp\/cephfs",
> > >          "root": "\/cephfs"
> > >          },
> > >          "dentry_count": 0,
> > >          "dentry_pinned_count": 0,
> > >          "inode_count": 2,
> > >          "mds_epoch": 817,
> > >          "osd_epoch": 1005,
> > >          "osd_epoch_barrier": 0
> > >      }
> > > 
> > > 
> > > This is already when ceph-fuse reached 10g of virtual memory, and user
> > > applications are hammering the filesystem.
> > > 
> > >      # ceph daemon /var/run/ceph/ceph-client.mount_user.asok status
> > >      {
> > >          "metadata": {
> > >          "ceph_sha1": "45107e21c568dd033c2f0a3107dec8f0b0e58374",
> > >          "ceph_version": "ceph version 10.2.2
> > > (45107e21c568dd033c2f0a3107dec8f0b0e58374)",
> > >          "entity_id": "mount_user",
> > >          "hostname": "<some host name>",
> > >          "mount_point": "\/coepp\/cephfs",
> > >          "root": "\/cephfs"
> > >          },
> > >          "dentry_count": 13,
> > >          "dentry_pinned_count": 2,
> > >          "inode_count": 15,
> > >          "mds_epoch": 817,
> > >          "osd_epoch": 1005,
> > >          "osd_epoch_barrier": 1005
> > >      }
> > > 
> > > Once I kill the applications I get
> > > 
> > >      # ceph daemon /var/run/ceph/ceph-client.mount_user.asok status
> > >      {
> > >          "metadata": {
> > >          "ceph_sha1": "45107e21c568dd033c2f0a3107dec8f0b0e58374",
> > >          "ceph_version": "ceph version 10.2.2
> > > (45107e21c568dd033c2f0a3107dec8f0b0e58374)",
> > >          "entity_id": "mount_user",
> > >          "hostname": "<some host name>",
> > >          "mount_point": "\/coepp\/cephfs",
> > >          "root": "\/cephfs"
> > >          },
> > >          "dentry_count": 38,
> > >          "dentry_pinned_count": 3,
> > >          "inode_count": 40,
> > >          "mds_epoch": 817,
> > >          "osd_epoch": 1005,
> > >          "osd_epoch_barrier": 1005
> > >      }
> > > 
> > > --- * ---
> > > 
> > > 3.>  Enable objectcacher debug (debug objectcacher = 10) and look at the
> > > output (from the "trim" lines) to see if it's obeying its limit
> > > 
> > > I've mounted ceph-fuse with debug objectcacher = 10, and filled the host
> > > with user applications. I killed the applications when I saw ceph-fuse
> > > virtual
> > > memory stabilize at around 10g.
> > > 
> > > Greping for the trim lines in the log, this is the structure I've found:
> > > 
> > >      2016-07-11 01:55:46.314888 7f04c57fb700 10 objectcacher trim  start:
> > > bytes: max 209715200  clean 0, objects: max 1000 current 1
> > >      2016-07-11 01:55:46.314891 7f04c57fb700 10 objectcacher trim finish:
> > > max 209715200  clean 0, objects: max 1000 current 1
> > >      2016-07-11 01:55:46.315009 7f04c75fe700 10 objectcacher trim  start:
> > > bytes: max 209715200  clean 0, objects: max 1000 current 2
> > >      2016-07-11 01:55:46.315012 7f04c75fe700 10 objectcacher trim finish:
> > > max 209715200  clean 0, objects: max 1000 current 2
> > >      <... snip ... >
> > >      2016-07-11 01:56:09.444853 7f04c75fe700 10 objectcacher trim  start:
> > > bytes: max 209715200  clean 204608008, objects: max 1000 current 55
> > >      2016-07-11 01:56:09.444855 7f04c75fe700 10 objectcacher trim finish:
> > > max 209715200  clean 204608008, objects: max 1000 current 55
> > >      2016-07-11 01:56:09.445010 7f04c57fb700 10 objectcacher trim  start:
> > > bytes: max 209715200  clean 204608008, objects: max 1000 current 55
> > >      2016-07-11 01:56:09.445011 7f04c57fb700 10 objectcacher trim finish:
> > > max 209715200  clean 204608008, objects: max 1000 current 55
> > >      2016-07-11 01:56:09.798269 7f04c75fe700 10 objectcacher trim  start:
> > > bytes: max 209715200  clean 210943832, objects: max 1000 current 55
> > >      2016-07-11 01:56:09.798272 7f04c75fe700 10 objectcacher trim trimming
> > > bh[ 0x7f04a8016100 96~59048 0x7f04a8014cd0 (59048) v 3 clean firstbyte=1]
> > > waiters = {}
> > >      2016-07-11 01:56:09.798284 7f04c75fe700 10 objectcacher trim trimming
> > > bh[ 0x7f04b4011550 96~59048 0x7f04b4010430 (59048) v 4 clean firstbyte=1]
> > > waiters = {}
> > >      2016-07-11 01:56:09.798294 7f04c75fe700 10 objectcacher trim trimming
> > > bh[ 0x7f04b001bea0 61760~4132544 0x7f04b4010430 (4132544) v 24 clean
> > > firstbyte=71] waiters = {}
> > >      2016-07-11 01:56:09.798395 7f04c75fe700 10 objectcacher trim finish:
> > > max 209715200  clean 206693192, objects: max 1000 current 55
> > >      2016-07-11 01:56:09.798687 7f04c57fb700 10 objectcacher trim  start:
> > > bytes: max 209715200  clean 206693192, objects: max 1000 current 55
> > >      2016-07-11 01:56:09.798689 7f04c57fb700 10 objectcacher trim finish:
> > > max 209715200  clean 206693192, objects: max 1000 current 55
> > >      <... snip ...>
> > >      2016-07-11 01:56:10.494928 7f04c75fe700 10 objectcacher trim  start:
> > > bytes: max 209715200  clean 210806408, objects: max 1000 current 55
> > >      2016-07-11 01:56:10.494931 7f04c75fe700 10 objectcacher trim trimming
> > > bh[ 0x7f04b401a760 61760~4132544 0x7f04a8014cd0 (4132544) v 32 clean
> > > firstbyte=71] waiters = {}
> > >      2016-07-11 01:56:10.495058 7f04c75fe700 10 objectcacher trim finish:
> > > max 209715200  clean 206673864, objects: max 1000 current 55
> > >      <... snip ...>
> > >      2016-07-11 01:57:08.333503 7f04c6bfd700 10 objectcacher trim  start:
> > > bytes: max 209715200  clean 211528796, objects: max 1000 current 187
> > >      2016-07-11 01:57:08.333507 7f04c6bfd700 10 objectcacher trim trimming
> > > bh[ 0x7f04b0b370e0 0~4194304 0x7f04b09f2630 (4194304) v 404 clean
> > > firstbyte=84] waiters = {}
> > >      2016-07-11 01:57:08.333708 7f04c6bfd700 10 objectcacher trim finish:
> > > max 209715200  clean 207334492, objects: max 1000 current 187
> > >      2016-07-11 01:57:08.616143 7f04c61fc700 10 objectcacher trim  start:
> > > bytes: max 209715200  clean 209949683, objects: max 1000 current 188
> > >      2016-07-11 01:57:08.616146 7f04c61fc700 10 objectcacher trim trimming
> > > bh[ 0x7f04a8bfdd60 0~4194304 0x7f04a8bfe660 (4194304) v 407 clean
> > > firstbyte=84] waiters = {}
> > >      2016-07-11 01:57:08.616303 7f04c61fc700 10 objectcacher trim finish:
> > > max 209715200  clean 205755379, objects: max 1000 current 188
> > >      2016-07-11 01:57:08.936060 7f04c57fb700 10 objectcacher trim  start:
> > > bytes: max 209715200  clean 205760010, objects: max 1000 current 189
> > >      2016-07-11 01:57:08.936063 7f04c57fb700 10 objectcacher trim finish:
> > > max 209715200  clean 205760010, objects: max 1000 current 189
> > >      2016-07-11 01:58:02.918322 7f04f27f4e40 10 objectcacher release trimming
> > > object[100003dffd9.00000000/head oset 0x7f04d4045c98 wr 566/566]
> > >      2016-07-11 01:58:02.918335 7f04f27f4e40 10 objectcacher release trimming
> > > object[100003dffd5.00000000/head oset 0x7f04d403e378 wr 564/564]
> > >      <... snip...>
> > >      2016-07-11 01:58:02.924699 7f04f27f4e40 10 objectcacher release trimming
> > > object[100003dffc4.0000000f/head oset 0x7f04d402b308 wr 557/557]
> > >      2016-07-11 01:58:02.924717 7f04f27f4e40 10 objectcacher release trimming
> > > object[100003dffc5.00000000/head oset 0x7f04d40026b8 wr 541/541]
> > >      2016-07-11 01:58:02.924769 7f04f27f4e40 10 objectcacher release trimming
> > > object[100003dffc8.00000000/head oset 0x7f04d4027818 wr 547/547]
> > >      <... snip...>
> > >      2016-07-11 01:58:02.925879 7f04f27f4e40 10 objectcacher release_set on
> > > 0x7f04d401a568 dne
> > >      2016-07-11 01:58:02.925881 7f04f27f4e40 10 objectcacher release_set on
> > > 0x7f04d401b078 dne
> > >      2016-07-11 01:58:02.957626 7f04e57fb700 10 objectcacher flusher finish
> > > 
> > > So, if I am understanding this correctly, every time the client_oc_size
> > > bytes of cached data is above 200M bytes, it is trimmed and the values is
> > > well kepted near its limit.
> > > 
> > > 
> > > --- * ---
> > > 
> > > 4.> See if fuse_disable_pagecache setting makes a difference
> > > 
> > > It doesn't seem to make a difference. I've set in ceph config
> > > 
> > >      # grep fuse /etc/ceph/ceph.conf
> > >      fuse_disable_pagecache = true
> > > 
> > > on this client (I guess I do not have to do it in the overall cluster).
> > > Then, I've remounted cephfs via ceph-fuse and filled the host with user
> > > applications.
> > > 
> > > Almost immediatly this is what I got:
> > > 
> > >        PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> > >      28681 root      20   0 8543m 248m 5948 S  4.0  0.5   0:02.73 ceph-fuse
> > >       5369 root      20   0 3131m 231m  12m S  0.0  0.5  26:22.90
> > > dsm_om_connsvcd
> > >       1429 goncalo   20   0 1595m  98m  32m R 99.5  0.2   1:04.34 python
> > >       1098 goncalo   20   0 1596m  86m  20m R 99.9  0.2   1:04.29 python
> > >        994 goncalo   20   0 1594m  86m  20m R 99.9  0.2   1:04.16 python
> > >      31928 goncalo   20   0 1595m  86m  19m R 99.9  0.2   1:04.76 python
> > >      16852 goncalo   20   0 1596m  86m  19m R 99.9  0.2   1:06.16 python
> > >      16846 goncalo   20   0 1594m  84m  19m R 99.9  0.2   1:06.05 python
> > >      29595 goncalo   20   0 1594m  83m  19m R 100.2  0.2   1:05.57 python
> > >      29312 goncalo   20   0 1594m  83m  19m R 99.9  0.2   1:05.01 python
> > >      31979 goncalo   20   0 1595m  82m  19m R 100.2  0.2   1:04.82 python
> > >      29333 goncalo   20   0 1594m  82m  19m R 99.5  0.2   1:04.94 python
> > >      29609 goncalo   20   0 1594m  82m  19m R 99.9  0.2   1:05.07 python
> > > 
> > > 
> > > 5.> Also, is the version of fuse the same on the nodes running 9.2.0 vs. the
> > > nodes running 10.2.2?
> > > 
> > > In 10.2.2 I've compiled with fuse 2.9.7 while in 9.2.0 I've compiled against
> > > the default sl6 fuse libs version 2.8.7. However, as I said before, I am
> > > seeing the same issue with 9.2.0 (although with a bit less of used virtual
> > > memory in total).
> > > 
> > > 
> > > 
> > > 
> > > On 07/08/2016 10:53 PM, John Spray wrote:
> > > 
> > > On Fri, Jul 8, 2016 at 8:01 AM, Goncalo Borges
> > > <goncalo.borges@xxxxxxxxxxxxx> wrote:
> > > 
> > > Hi Brad, Patrick, All...
> > > 
> > > I think I've understood this second problem. In summary, it is memory
> > > related.
> > > 
> > > This is how I found the source of the problem:
> > > 
> > > 1./ I copied and adapted the user application to run in another cluster of
> > > ours. The idea was for me to understand the application and run it myself to
> > > collect logs and so on...
> > > 
> > > 2./ Once I submit it to this other cluster, every thing went fine. I was
> > > hammering cephfs from multiple nodes without problems. This pointed to
> > > something different between the two clusters.
> > > 
> > > 3./ I've started to look better to the segmentation fault message, and
> > > assuming that the names of the methods and functions do mean something, the
> > > log seems related to issues on the management of objects in cache. This
> > > pointed to a memory related problem.
> > > 
> > > 4./ On the cluster where the application run successfully, machines have
> > > 48GB of RAM and 96GB of SWAP (don't know why we have such a large SWAP size,
> > > it is a legacy setup).
> > > 
> > > # top
> > > top - 00:34:01 up 23 days, 22:21,  1 user,  load average: 12.06, 12.12,
> > > 10.40
> > > Tasks: 683 total,  13 running, 670 sleeping,   0 stopped,   0 zombie
> > > Cpu(s): 49.7%us,  0.6%sy,  0.0%ni, 49.7%id,  0.1%wa,  0.0%hi,  0.0%si,
> > > 0.0%st
> > > Mem:  49409308k total, 29692548k used, 19716760k free,   433064k buffers
> > > Swap: 98301948k total,        0k used, 98301948k free, 26742484k cached
> > > 
> > > 5./ I have noticed that ceph-fuse (in 10.2.2) consumes about 1.5 GB of
> > > virtual memory when there is no applications using the filesystem.
> > > 
> > >   7152 root      20   0 1108m  12m 5496 S  0.0  0.0   0:00.04 ceph-fuse
> > > 
> > > When I only have one instance of the user application running, ceph-fuse (in
> > > 10.2.2) slowly rises with time up to 10 GB of memory usage.
> > > 
> > > if I submit a large number of user applications simultaneously, ceph-fuse
> > > goes very fast to ~10GB.
> > > 
> > >    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> > > 18563 root      20   0 10.0g 328m 5724 S  4.0  0.7   1:38.00 ceph-fuse
> > >   4343 root      20   0 3131m 237m  12m S  0.0  0.5  28:24.56 dsm_om_connsvcd
> > >   5536 goncalo   20   0 1599m  99m  32m R 99.9  0.2  31:35.46 python
> > > 31427 goncalo   20   0 1597m  89m  20m R 99.9  0.2  31:35.88 python
> > > 20504 goncalo   20   0 1599m  89m  20m R 100.2  0.2  31:34.29 python
> > > 20508 goncalo   20   0 1599m  89m  20m R 99.9  0.2  31:34.20 python
> > >   4973 goncalo   20   0 1599m  89m  20m R 99.9  0.2  31:35.70 python
> > >   1331 goncalo   20   0 1597m  88m  20m R 99.9  0.2  31:35.72 python
> > > 20505 goncalo   20   0 1597m  88m  20m R 99.9  0.2  31:34.46 python
> > > 20507 goncalo   20   0 1599m  87m  20m R 99.9  0.2  31:34.37 python
> > > 28375 goncalo   20   0 1597m  86m  20m R 99.9  0.2  31:35.52 python
> > > 20503 goncalo   20   0 1597m  85m  20m R 100.2  0.2  31:34.09 python
> > > 20506 goncalo   20   0 1597m  84m  20m R 99.5  0.2  31:34.42 python
> > > 20502 goncalo   20   0 1597m  83m  20m R 99.9  0.2  31:34.32 python
> > > 
> > > 6./ On the machines where the user had the segfault, we have 16 GB of RAM
> > > and 1GB of SWAP
> > > 
> > > Mem:  16334244k total,  3590100k used, 12744144k free,   221364k buffers
> > > Swap:  1572860k total,    10512k used,  1562348k free,  2937276k cached
> > > 
> > > 7./ I think what is happening is that once the user submits his sets of
> > > jobs, the memory usage goes to the very limit on this type machine, and the
> > > raise is actually to fast that ceph-fuse segfaults before OOM Killer can
> > > kill it.
> > > 
> > > 8./ We have run the user application in the same type of machines but with
> > > 64 GB of RAM and 1GB of SWAP, and everything goes fine also here.
> > > 
> > > 
> > > So, in conclusion, our second problem (besides the locks which was fixed by
> > > Pat patch) is the memory usage profile of ceph-fuse in 10.2.2 which seems to
> > > be very different than what it was in ceph-fuse 9.2.0.
> > > 
> > > Are there any ideas how can we limit the virtual memory usage of ceph-fuse
> > > in 10.2.2?
> > > 
> > > The fuse client is designed to limit its cache sizes:
> > > client_cache_size (default 16384) inodes of cached metadata
> > > client_oc_size (default 200MB) bytes of cached data
> > > 
> > > We do run the fuse client with valgrind during testing, so it it is
> > > showing memory leaks in normal usage on your system then that's news.
> > > 
> > > The top output you've posted seems to show that ceph-fuse only
> > > actually has 328MB resident though?
> > > 
> > > If you can reproduce the memory growth, then it would be good to:
> > >   * Try running ceph-fuse with valgrind --tool=memcheck to see if it's
> > > leaking
> > >   * Inspect inode count (ceph daemon <path to asok> status) to see if
> > > it's obeying its limit
> > >   * Enable objectcacher debug (debug objectcacher = 10) and look at the
> > > output (from the "trim" lines) to see if it's obeying its limit
> > >   * See if fuse_disable_pagecache setting makes a difference
> > > 
> > > Also, is the version of fuse the same on the nodes running 9.2.0 vs.
> > > the nodes running 10.2.2?
> > > 
> > > John
> > > 
> > > Cheers
> > > Goncalo
> > > 
> > > 
> > > 
> > > On 07/08/2016 09:54 AM, Brad Hubbard wrote:
> > > 
> > > Hi Goncalo,
> > > 
> > > If possible it would be great if you could capture a core file for this with
> > > full debugging symbols (preferably glibc debuginfo as well). How you do
> > > that will depend on the ceph version and your OS but we can offfer help
> > > if required I'm sure.
> > > 
> > > Once you have the core do the following.
> > > 
> > > $ gdb /path/to/ceph-fuse core.XXXX
> > > (gdb) set pag off
> > > (gdb) set log on
> > > (gdb) thread apply all bt
> > > (gdb) thread apply all bt full
> > > 
> > > Then quit gdb and you should find a file called gdb.txt in your
> > > working directory.
> > > If you could attach that file to http://tracker.ceph.com/issues/16610
> > > 
> > > Cheers,
> > > Brad
> > > 
> > > On Fri, Jul 8, 2016 at 12:06 AM, Patrick Donnelly <pdonnell@xxxxxxxxxx>
> > > wrote:
> > > 
> > > On Thu, Jul 7, 2016 at 2:01 AM, Goncalo Borges
> > > <goncalo.borges@xxxxxxxxxxxxx> wrote:
> > > 
> > > Unfortunately, the other user application breaks ceph-fuse again (It is a
> > > completely different application then in my previous test).
> > > 
> > > We have tested it in 4 machines with 4 cores. The user is submitting 16
> > > single core jobs which are all writing different output files (one per job)
> > > to a common dir in cephfs. The first 4 jobs run happily and never break
> > > ceph-fuse. But the remaining 12 jobs, running in the remaining 3 machines,
> > > trigger a segmentation fault, which is completely different from the other
> > > case.
> > > 
> > > ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
> > > 1: (()+0x297fe2) [0x7f54402b7fe2]
> > > 2: (()+0xf7e0) [0x7f543ecf77e0]
> > > 3: (ObjectCacher::bh_write_scattered(std::list<ObjectCacher::BufferHead*,
> > > std::allocator<ObjectCacher::BufferHead*> >&)+0x36) [0x7f5440268086]
> > > 4: (ObjectCacher::bh_write_adjacencies(ObjectCacher::BufferHead*,
> > > std::chrono::time_point<ceph::time_detail::real_clock,
> > > std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, long*,
> > > int*)+0x22c) [0x7f5440268a3c]
> > > 5: (ObjectCacher::flush(long)+0x1ef) [0x7f5440268cef]
> > > 6: (ObjectCacher::flusher_entry()+0xac4) [0x7f5440269a34]
> > > 7: (ObjectCacher::FlusherThread::entry()+0xd) [0x7f5440275c6d]
> > > 8: (()+0x7aa1) [0x7f543ecefaa1]
> > >   9: (clone()+0x6d) [0x7f543df6893d]
> > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> > > interpret this.
> > > 
> > > This one looks like a very different problem. I've created an issue
> > > here: http://tracker.ceph.com/issues/16610
> > > 
> > > Thanks for the report and debug log!
> > > 
> > > --
> > > Patrick Donnelly
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> > > 
> > > 
> > > --
> > > Goncalo Borges
> > > Research Computing
> > > ARC Centre of Excellence for Particle Physics at the Terascale
> > > School of Physics A28 | University of Sydney, NSW  2006
> > > T: +61 2 93511937
> > > 
> > > 
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> > > 
> > > --
> > > Goncalo Borges
> > > Research Computing
> > > ARC Centre of Excellence for Particle Physics at the Terascale
> > > School of Physics A28 | University of Sydney, NSW  2006
> > > T: +61 2 93511937
> > > 
> > > 
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> 
> -- 
> Goncalo Borges
> Research Computing
> ARC Centre of Excellence for Particle Physics at the Terascale
> School of Physics A28 | University of Sydney, NSW  2006
> T: +61 2 93511937
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com