On Tue, Jul 19, 2016 at 1:03 PM, Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> wrote: > Hi All... > > We do have some good news. > > As promised, I've recompiled ceph 10.2.2 (in an intel processor without > AVX2) with and without the patch provided by Zheng. It turns out that > Zheng's patch is the solution for the segfaults we saw in ObjectCacher when > ceph-fuse runs in AMD 62xx processors. > > To convince ourselves that the problem was really solved, we executed 40 > jobs (with the user application where the ObjectCacher segfault was seen for > the first time) in a dozen of AMD 62XX VMs, and none failed. Before, > ceph-fuse was always segfaulting a couple of minutes after job startup. > > Thank you for all the help. With all the bits and pieces from everyone we > were able to nail this one. > > I am a bit surprised that no other complains appeared in the mailing list > for both of the issues we saw: first the locking issue and then the > ObjectCacher issue. This makes me think that we are using ceph-fuse in a > different way than others (probably exposing it to real user applications > more heavily than other communities). If you actually need a beta tester > next time, I think it is also in our best interest to participate. > > I do have a last question. > > While searching / googling for tips, I saw an issue claiming that > 'fuse_disable_pagecache' should be set to true in ceph.conf. Can you briefly > explain is this is correct and what is the con of not using it? (just or me > to understand it). For ceph-fuse, there are two caches, one is in ceph-fuse, another one is kernel pagecache. When multiple clients read/write a file at the same time, ceph-fuse needs to disable cache and let reads/writes go to OSDs directly. ceph-fuse can disable its own cache, but there is no way to disable the kernel pagecache dynamically. So client may read stale data from the kernel pagecache. > > Thank you in Advance > > Cheers > > Goncalo > > > > On 07/15/2016 01:35 PM, Goncalo Borges wrote: > > Thanks Zheng... > > Now that we have identified the exact context when the segfault appears > (only in AMD 62XX) I think it should be safe to understand in each situation > does the crash appears. > > My current compilation is ongoing and I will then test it. > > If it fails, I will recompile including your patch. > > Will report here afterwards. > > Thanks for the feedback. > > Cheers > > Goncalo > > > On 07/15/2016 01:19 PM, Yan, Zheng wrote: > > On Fri, Jul 15, 2016 at 9:35 AM, Goncalo Borges > <goncalo.borges@xxxxxxxxxxxxx> wrote: > > Hi All... > > I've seen that Zheng, Brad, Pat and Greg already updated or made some > comments on the bug issue. Zheng also proposes a simple patch. However, I do > have a bit more information. We do think we have identified the source of > the problem and that we can correct it. Therefore, I would propose that you > hold any work on the issue until we test our hypothesis. I'll try to > summarize it: > > 1./ After being convinced that the ceph-fuse segfault we saw in specific VMs > was not memory related, I decided to run the user application in multiple > zones of the openstack cloud we use. We scale up our resources by using a > public funded openstack cloud which spawns machines (using always the same > image) in multiple availability zones. In the majority of the cases we limit > our VMs to (normally) the same availability zone because it seats in the > same data center as our infrastructure. This experiment showed that > ceph-fuse does not segfaults in other availability zones with multiple VMS > of different sizes and types. So the problem was restricted to the > availability zone we normally use as our default one. > > 2./ I've them created new VMs of multiple sizes and types in our 'default' > availability zone and rerun the user application. This new experiment, > running in newly created VMs, showed ceph-fuse segfaults independent of the > VM types but not in all VMs. For example, in this new test, ceph-fuse was > segfaulting in some 4 and 8 core VMs but not in all. > > 3./ I've then decided to inspect the CPU types, and the breakthrough was > that I got a 100% correlation of ceph-fuse segfaults with AMD 62xx processor > VMs. This availability zone has only 2 types of hypervisors: an old one with > AMD 62xx processors, and a new one with Intel processors. If my jobs run in > a VM with Intel, everything is ok. If my jobs run in AMD 62xx, ceph-fuse > segfaults. Actually, the segfault is almost immediate in 4 core AMD 62xx VMs > but takes much more time in 8-core AMD62xx VMs. > > 4./ I've then crosschecked what processors were used in the successful jobs > executed in the other availability zones: Several types of intel, AMD 63xx > but not AMD 62xx processors. > > 5./ Talking with my awesome colleague Sean, he remembered some discussions > about applications segfaulting in AMD processors when compiled in an Intel > processor with AVX2 extension. Actually, I compiled ceph 10.2.2 in an intel > processor with AVX2 but ceph 9.2.0 was compiled several months ago on an > intel processor without AVX2. The reason for the change is simply because we > upgraded our infrastructure. > > 6./ Then, we compared the cpuflags between AMD 63xx and AMD62xx. if you look > carefully, 63xx has 'fma f16c tbm bmi1' and 62xx has 'svm'. According to my > colleague, fma and f16c are both AMD extensions which make AMD more > compatible with the AVX extension by Intel. > > 63xx > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat > pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb lm > rep_good extd_apicid unfair_spinlock pni pclmulqdq ssse3 fma cx16 sse4_1 > sse4_2 x2apic popcnt aes xsave avx f16c hypervisor lahf_lm cmp_legacy > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw xop fma4 tbm bmi1 > > 62xx > flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat > pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb lm > rep_good extd_apicid unfair_spinlock pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 > x2apic popcnt aes xsave avx hypervisor lahf_lm cmp_legacy svm cr8_legacy abm > sse4a misalignsse 3dnowprefetch osvw xop fma4 > > > All of the previous arguments may explain why we can use 9.2.0 in AMD 62xx, > and why 10.2.2 works in AMD 63xx but not in AMD 62xx. > > So, we are hopping that compiling 10.2.2 in an intel processor without the > AVX extensions will solve our problem. > > Does this make sense? > > I have a different theory. ObjectCacher::flush() checks > "bh->last_write <= cutoff" to decide if it should write buffer head. > But ObjectCacher::bh_write_adjacencies() checks "bh->last_write < > cutoff". (cutoff is the time clock when ObjectCacher::flush() starts > executing). If there is only one dirty buffer head and its last_write > is equal to cutoff, the segfault happens. For some hardware > limitations, AMD 62xx CPU may unable to provide high precision time > clock. This explains the segfault only happens in AMD 62xx. The code > that causes the segfault was introduced in jewel release. So ceph-fuse > 9.2.0 does not have this problem. > > > Regards > Yan, Zheng > > > > > The compilation takes a while but I will update the issue once I have > finished this last experiment (in the next few days) > > Cheers > Goncalo > > > > On 07/12/2016 09:45 PM, Goncalo Borges wrote: > > Hi All... > > Thank you for continuing to follow this already very long thread. > > Pat and Greg are correct in their assumption regarding the 10gb virtual > memory footprint I see for ceph-fuse process in our cluster with 12 core (24 > because of hyperthreading) machines and 96 gb of RAM. The source is glibc > > 1.10. I can reduce / tune virtual memory threads usage by setting > MALLOC_ARENA_MAX = 4 (the default is 8 on 64 bits machines) before mounting > the filesystem with ceph-fuse. So, there is no memory leak on ceph-fuse :-) > > The bad news is that, while reading the arena malloc glibc explanation, it > became obvious that the virtual memory footprint scales with tje numer of > cores. Therefore the 10gb virtual memory i was seeing in the resources with > 12 cores (24 because of hyperthreading) could not / would not be the same in > the VMs where I get the segfault since they have only 4 cores. > > So, at this point, I know that: > a./ The segfault is always appearing in a set of VMs with 16 GB of RAM and 4 > cores. > b./ The segfault is not appearing in a set of VMs (in principle identical to > the 16 GB ones) but with 16 cores and 64 GB of RAM. > c./ the segfault is not appearing in a physicall cluster with machines with > 96 GB of RAM and 12 cores (24 because of hyperthreading) > and I am not so sure anymore that this is memory related. > > For further debugging, I've updated > http://tracker.ceph.com/issues/16610 > with a summary of my finding plus some log files: > - The gdb.txt I get after running > $ gdb /path/to/ceph-fuse core.XXXX > (gdb) set pag off > (gdb) set log on > (gdb) thread apply all bt > (gdb) thread apply all bt full > as advised by Brad > - The debug.out (gzipped) I get after running ceph-fuse in debug mode with > 'debug client 20' and 'debug objectcacher = 20' > > Cheers > Goncalo > ________________________________________ > From: Gregory Farnum [gfarnum@xxxxxxxxxx] > Sent: 12 July 2016 03:07 > To: Goncalo Borges > Cc: John Spray; ceph-users > Subject: Re: ceph-fuse segfaults ( jewel 10.2.2) > > Oh, is this one of your custom-built packages? Are they using > tcmalloc? That difference between VSZ and RSS looks like a glibc > malloc problem. > -Greg > > On Mon, Jul 11, 2016 at 12:04 AM, Goncalo Borges > <goncalo.borges@xxxxxxxxxxxxx> wrote: > > Hi John... > > Thank you for replying. > > Here is the result of the tests you asked but I do not see nothing abnormal. > Actually, your suggestions made me see that: > > 1) ceph-fuse 9.2.0 is presenting the same behaviour but with less memory > consumption, probably, less enought so that it doesn't brake ceph-fuse in > our machines with less memory. > > 2) I see a tremendous number of ceph-fuse threads launched (around 160). > > # ps -T -p 3230 -o command,ppid,pid,spid,vsize,rss,%mem,%cpu | wc -l > 157 > > # ps -T -p 3230 -o command,ppid,pid,spid,vsize,rss,%mem,%cpu | head -n 10 > COMMAND PPID PID SPID VSZ RSS %MEM %CPU > ceph-fuse --id mount_user - 1 3230 3230 9935240 339780 0.6 0.0 > ceph-fuse --id mount_user - 1 3230 3231 9935240 339780 0.6 0.1 > ceph-fuse --id mount_user - 1 3230 3232 9935240 339780 0.6 0.0 > ceph-fuse --id mount_user - 1 3230 3233 9935240 339780 0.6 0.0 > ceph-fuse --id mount_user - 1 3230 3234 9935240 339780 0.6 0.0 > ceph-fuse --id mount_user - 1 3230 3235 9935240 339780 0.6 0.0 > ceph-fuse --id mount_user - 1 3230 3236 9935240 339780 0.6 0.0 > ceph-fuse --id mount_user - 1 3230 3237 9935240 339780 0.6 0.0 > ceph-fuse --id mount_user - 1 3230 3238 9935240 339780 0.6 0.0 > > > I do not see a way to actually limit the number of ceph-fuse threads > launched or to limit the max vm size each thread should take. > > Do you know how to limit those options. > > Cheers > > Goncalo > > > > > 1.> Try running ceph-fuse with valgrind --tool=memcheck to see if it's > leaking > > I have launched ceph-fuse with valgrind in the cluster where there is > sufficient memory available, and therefore, there is no object cacher > segfault. > > $ valgrind --log-file=/tmp/valgrind-ceph-fuse-10.2.2.txt --tool=memcheck > ceph-fuse --id mount_user -k /etc/ceph/ceph.client.mount_user.keyring -m > X.X.X.8:6789 -r /cephfs /coepp/cephfs > > This is the output which I get once I unmount the file system after user > application execution > > # cat valgrind-ceph-fuse-10.2.2.txt > ==12123== Memcheck, a memory error detector > ==12123== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al. > ==12123== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info > ==12123== Command: ceph-fuse --id mount_user -k > /etc/ceph/ceph.client.mount_user.keyring -m 192.231.127.8:6789 -r /cephfs > /coepp/cephfs > ==12123== Parent PID: 11992 > ==12123== > ==12123== > ==12123== HEAP SUMMARY: > ==12123== in use at exit: 29,129 bytes in 397 blocks > ==12123== total heap usage: 14,824 allocs, 14,427 frees, 648,030 bytes > allocated > ==12123== > ==12123== LEAK SUMMARY: > ==12123== definitely lost: 16 bytes in 1 blocks > ==12123== indirectly lost: 0 bytes in 0 blocks > ==12123== possibly lost: 11,705 bytes in 273 blocks > ==12123== still reachable: 17,408 bytes in 123 blocks > ==12123== suppressed: 0 bytes in 0 blocks > ==12123== Rerun with --leak-check=full to see details of leaked memory > ==12123== > ==12123== For counts of detected and suppressed errors, rerun with: -v > ==12123== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 8 from 6) > ==12126== > ==12126== HEAP SUMMARY: > ==12126== in use at exit: 9,641 bytes in 73 blocks > ==12126== total heap usage: 31,363,579 allocs, 31,363,506 frees, > 41,389,143,617 bytes allocated > ==12126== > ==12126== LEAK SUMMARY: > ==12126== definitely lost: 28 bytes in 1 blocks > ==12126== indirectly lost: 0 bytes in 0 blocks > ==12126== possibly lost: 0 bytes in 0 blocks > ==12126== still reachable: 9,613 bytes in 72 blocks > ==12126== suppressed: 0 bytes in 0 blocks > ==12126== Rerun with --leak-check=full to see details of leaked memory > ==12126== > ==12126== For counts of detected and suppressed errors, rerun with: -v > ==12126== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 17 from 9) > > --- * --- > > 2.> Inspect inode count (ceph daemon <path to asok> status) to see if it's > obeying its limit > > This is the output I get once ceph-fuse is mounted but no user application > is running > > # ceph daemon /var/run/ceph/ceph-client.mount_user.asok status > { > "metadata": { > "ceph_sha1": "45107e21c568dd033c2f0a3107dec8f0b0e58374", > "ceph_version": "ceph version 10.2.2 > (45107e21c568dd033c2f0a3107dec8f0b0e58374)", > "entity_id": "mount_user", > "hostname": "<some host name>", > "mount_point": "\/coepp\/cephfs", > "root": "\/cephfs" > }, > "dentry_count": 0, > "dentry_pinned_count": 0, > "inode_count": 2, > "mds_epoch": 817, > "osd_epoch": 1005, > "osd_epoch_barrier": 0 > } > > > This is already when ceph-fuse reached 10g of virtual memory, and user > applications are hammering the filesystem. > > # ceph daemon /var/run/ceph/ceph-client.mount_user.asok status > { > "metadata": { > "ceph_sha1": "45107e21c568dd033c2f0a3107dec8f0b0e58374", > "ceph_version": "ceph version 10.2.2 > (45107e21c568dd033c2f0a3107dec8f0b0e58374)", > "entity_id": "mount_user", > "hostname": "<some host name>", > "mount_point": "\/coepp\/cephfs", > "root": "\/cephfs" > }, > "dentry_count": 13, > "dentry_pinned_count": 2, > "inode_count": 15, > "mds_epoch": 817, > "osd_epoch": 1005, > "osd_epoch_barrier": 1005 > } > > Once I kill the applications I get > > # ceph daemon /var/run/ceph/ceph-client.mount_user.asok status > { > "metadata": { > "ceph_sha1": "45107e21c568dd033c2f0a3107dec8f0b0e58374", > "ceph_version": "ceph version 10.2.2 > (45107e21c568dd033c2f0a3107dec8f0b0e58374)", > "entity_id": "mount_user", > "hostname": "<some host name>", > "mount_point": "\/coepp\/cephfs", > "root": "\/cephfs" > }, > "dentry_count": 38, > "dentry_pinned_count": 3, > "inode_count": 40, > "mds_epoch": 817, > "osd_epoch": 1005, > "osd_epoch_barrier": 1005 > } > > --- * --- > > 3.> Enable objectcacher debug (debug objectcacher = 10) and look at the > output (from the "trim" lines) to see if it's obeying its limit > > I've mounted ceph-fuse with debug objectcacher = 10, and filled the host > with user applications. I killed the applications when I saw ceph-fuse > virtual > memory stabilize at around 10g. > > Greping for the trim lines in the log, this is the structure I've found: > > 2016-07-11 01:55:46.314888 7f04c57fb700 10 objectcacher trim start: > bytes: max 209715200 clean 0, objects: max 1000 current 1 > 2016-07-11 01:55:46.314891 7f04c57fb700 10 objectcacher trim finish: > max 209715200 clean 0, objects: max 1000 current 1 > 2016-07-11 01:55:46.315009 7f04c75fe700 10 objectcacher trim start: > bytes: max 209715200 clean 0, objects: max 1000 current 2 > 2016-07-11 01:55:46.315012 7f04c75fe700 10 objectcacher trim finish: > max 209715200 clean 0, objects: max 1000 current 2 > <... snip ... > > 2016-07-11 01:56:09.444853 7f04c75fe700 10 objectcacher trim start: > bytes: max 209715200 clean 204608008, objects: max 1000 current 55 > 2016-07-11 01:56:09.444855 7f04c75fe700 10 objectcacher trim finish: > max 209715200 clean 204608008, objects: max 1000 current 55 > 2016-07-11 01:56:09.445010 7f04c57fb700 10 objectcacher trim start: > bytes: max 209715200 clean 204608008, objects: max 1000 current 55 > 2016-07-11 01:56:09.445011 7f04c57fb700 10 objectcacher trim finish: > max 209715200 clean 204608008, objects: max 1000 current 55 > 2016-07-11 01:56:09.798269 7f04c75fe700 10 objectcacher trim start: > bytes: max 209715200 clean 210943832, objects: max 1000 current 55 > 2016-07-11 01:56:09.798272 7f04c75fe700 10 objectcacher trim trimming > bh[ 0x7f04a8016100 96~59048 0x7f04a8014cd0 (59048) v 3 clean firstbyte=1] > waiters = {} > 2016-07-11 01:56:09.798284 7f04c75fe700 10 objectcacher trim trimming > bh[ 0x7f04b4011550 96~59048 0x7f04b4010430 (59048) v 4 clean firstbyte=1] > waiters = {} > 2016-07-11 01:56:09.798294 7f04c75fe700 10 objectcacher trim trimming > bh[ 0x7f04b001bea0 61760~4132544 0x7f04b4010430 (4132544) v 24 clean > firstbyte=71] waiters = {} > 2016-07-11 01:56:09.798395 7f04c75fe700 10 objectcacher trim finish: > max 209715200 clean 206693192, objects: max 1000 current 55 > 2016-07-11 01:56:09.798687 7f04c57fb700 10 objectcacher trim start: > bytes: max 209715200 clean 206693192, objects: max 1000 current 55 > 2016-07-11 01:56:09.798689 7f04c57fb700 10 objectcacher trim finish: > max 209715200 clean 206693192, objects: max 1000 current 55 > <... snip ...> > 2016-07-11 01:56:10.494928 7f04c75fe700 10 objectcacher trim start: > bytes: max 209715200 clean 210806408, objects: max 1000 current 55 > 2016-07-11 01:56:10.494931 7f04c75fe700 10 objectcacher trim trimming > bh[ 0x7f04b401a760 61760~4132544 0x7f04a8014cd0 (4132544) v 32 clean > firstbyte=71] waiters = {} > 2016-07-11 01:56:10.495058 7f04c75fe700 10 objectcacher trim finish: > max 209715200 clean 206673864, objects: max 1000 current 55 > <... snip ...> > 2016-07-11 01:57:08.333503 7f04c6bfd700 10 objectcacher trim start: > bytes: max 209715200 clean 211528796, objects: max 1000 current 187 > 2016-07-11 01:57:08.333507 7f04c6bfd700 10 objectcacher trim trimming > bh[ 0x7f04b0b370e0 0~4194304 0x7f04b09f2630 (4194304) v 404 clean > firstbyte=84] waiters = {} > 2016-07-11 01:57:08.333708 7f04c6bfd700 10 objectcacher trim finish: > max 209715200 clean 207334492, objects: max 1000 current 187 > 2016-07-11 01:57:08.616143 7f04c61fc700 10 objectcacher trim start: > bytes: max 209715200 clean 209949683, objects: max 1000 current 188 > 2016-07-11 01:57:08.616146 7f04c61fc700 10 objectcacher trim trimming > bh[ 0x7f04a8bfdd60 0~4194304 0x7f04a8bfe660 (4194304) v 407 clean > firstbyte=84] waiters = {} > 2016-07-11 01:57:08.616303 7f04c61fc700 10 objectcacher trim finish: > max 209715200 clean 205755379, objects: max 1000 current 188 > 2016-07-11 01:57:08.936060 7f04c57fb700 10 objectcacher trim start: > bytes: max 209715200 clean 205760010, objects: max 1000 current 189 > 2016-07-11 01:57:08.936063 7f04c57fb700 10 objectcacher trim finish: > max 209715200 clean 205760010, objects: max 1000 current 189 > 2016-07-11 01:58:02.918322 7f04f27f4e40 10 objectcacher release trimming > object[100003dffd9.00000000/head oset 0x7f04d4045c98 wr 566/566] > 2016-07-11 01:58:02.918335 7f04f27f4e40 10 objectcacher release trimming > object[100003dffd5.00000000/head oset 0x7f04d403e378 wr 564/564] > <... snip...> > 2016-07-11 01:58:02.924699 7f04f27f4e40 10 objectcacher release trimming > object[100003dffc4.0000000f/head oset 0x7f04d402b308 wr 557/557] > 2016-07-11 01:58:02.924717 7f04f27f4e40 10 objectcacher release trimming > object[100003dffc5.00000000/head oset 0x7f04d40026b8 wr 541/541] > 2016-07-11 01:58:02.924769 7f04f27f4e40 10 objectcacher release trimming > object[100003dffc8.00000000/head oset 0x7f04d4027818 wr 547/547] > <... snip...> > 2016-07-11 01:58:02.925879 7f04f27f4e40 10 objectcacher release_set on > 0x7f04d401a568 dne > 2016-07-11 01:58:02.925881 7f04f27f4e40 10 objectcacher release_set on > 0x7f04d401b078 dne > 2016-07-11 01:58:02.957626 7f04e57fb700 10 objectcacher flusher finish > > So, if I am understanding this correctly, every time the client_oc_size > bytes of cached data is above 200M bytes, it is trimmed and the values is > well kepted near its limit. > > > --- * --- > > 4.> See if fuse_disable_pagecache setting makes a difference > > It doesn't seem to make a difference. I've set in ceph config > > # grep fuse /etc/ceph/ceph.conf > fuse_disable_pagecache = true > > on this client (I guess I do not have to do it in the overall cluster). > Then, I've remounted cephfs via ceph-fuse and filled the host with user > applications. > > Almost immediatly this is what I got: > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 28681 root 20 0 8543m 248m 5948 S 4.0 0.5 0:02.73 ceph-fuse > 5369 root 20 0 3131m 231m 12m S 0.0 0.5 26:22.90 > dsm_om_connsvcd > 1429 goncalo 20 0 1595m 98m 32m R 99.5 0.2 1:04.34 python > 1098 goncalo 20 0 1596m 86m 20m R 99.9 0.2 1:04.29 python > 994 goncalo 20 0 1594m 86m 20m R 99.9 0.2 1:04.16 python > 31928 goncalo 20 0 1595m 86m 19m R 99.9 0.2 1:04.76 python > 16852 goncalo 20 0 1596m 86m 19m R 99.9 0.2 1:06.16 python > 16846 goncalo 20 0 1594m 84m 19m R 99.9 0.2 1:06.05 python > 29595 goncalo 20 0 1594m 83m 19m R 100.2 0.2 1:05.57 python > 29312 goncalo 20 0 1594m 83m 19m R 99.9 0.2 1:05.01 python > 31979 goncalo 20 0 1595m 82m 19m R 100.2 0.2 1:04.82 python > 29333 goncalo 20 0 1594m 82m 19m R 99.5 0.2 1:04.94 python > 29609 goncalo 20 0 1594m 82m 19m R 99.9 0.2 1:05.07 python > > > 5.> Also, is the version of fuse the same on the nodes running 9.2.0 vs. the > nodes running 10.2.2? > > In 10.2.2 I've compiled with fuse 2.9.7 while in 9.2.0 I've compiled against > the default sl6 fuse libs version 2.8.7. However, as I said before, I am > seeing the same issue with 9.2.0 (although with a bit less of used virtual > memory in total). > > > > > On 07/08/2016 10:53 PM, John Spray wrote: > > On Fri, Jul 8, 2016 at 8:01 AM, Goncalo Borges > <goncalo.borges@xxxxxxxxxxxxx> wrote: > > Hi Brad, Patrick, All... > > I think I've understood this second problem. In summary, it is memory > related. > > This is how I found the source of the problem: > > 1./ I copied and adapted the user application to run in another cluster of > ours. The idea was for me to understand the application and run it myself to > collect logs and so on... > > 2./ Once I submit it to this other cluster, every thing went fine. I was > hammering cephfs from multiple nodes without problems. This pointed to > something different between the two clusters. > > 3./ I've started to look better to the segmentation fault message, and > assuming that the names of the methods and functions do mean something, the > log seems related to issues on the management of objects in cache. This > pointed to a memory related problem. > > 4./ On the cluster where the application run successfully, machines have > 48GB of RAM and 96GB of SWAP (don't know why we have such a large SWAP size, > it is a legacy setup). > > # top > top - 00:34:01 up 23 days, 22:21, 1 user, load average: 12.06, 12.12, > 10.40 > Tasks: 683 total, 13 running, 670 sleeping, 0 stopped, 0 zombie > Cpu(s): 49.7%us, 0.6%sy, 0.0%ni, 49.7%id, 0.1%wa, 0.0%hi, 0.0%si, > 0.0%st > Mem: 49409308k total, 29692548k used, 19716760k free, 433064k buffers > Swap: 98301948k total, 0k used, 98301948k free, 26742484k cached > > 5./ I have noticed that ceph-fuse (in 10.2.2) consumes about 1.5 GB of > virtual memory when there is no applications using the filesystem. > > 7152 root 20 0 1108m 12m 5496 S 0.0 0.0 0:00.04 ceph-fuse > > When I only have one instance of the user application running, ceph-fuse (in > 10.2.2) slowly rises with time up to 10 GB of memory usage. > > if I submit a large number of user applications simultaneously, ceph-fuse > goes very fast to ~10GB. > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 18563 root 20 0 10.0g 328m 5724 S 4.0 0.7 1:38.00 ceph-fuse > 4343 root 20 0 3131m 237m 12m S 0.0 0.5 28:24.56 dsm_om_connsvcd > 5536 goncalo 20 0 1599m 99m 32m R 99.9 0.2 31:35.46 python > 31427 goncalo 20 0 1597m 89m 20m R 99.9 0.2 31:35.88 python > 20504 goncalo 20 0 1599m 89m 20m R 100.2 0.2 31:34.29 python > 20508 goncalo 20 0 1599m 89m 20m R 99.9 0.2 31:34.20 python > 4973 goncalo 20 0 1599m 89m 20m R 99.9 0.2 31:35.70 python > 1331 goncalo 20 0 1597m 88m 20m R 99.9 0.2 31:35.72 python > 20505 goncalo 20 0 1597m 88m 20m R 99.9 0.2 31:34.46 python > 20507 goncalo 20 0 1599m 87m 20m R 99.9 0.2 31:34.37 python > 28375 goncalo 20 0 1597m 86m 20m R 99.9 0.2 31:35.52 python > 20503 goncalo 20 0 1597m 85m 20m R 100.2 0.2 31:34.09 python > 20506 goncalo 20 0 1597m 84m 20m R 99.5 0.2 31:34.42 python > 20502 goncalo 20 0 1597m 83m 20m R 99.9 0.2 31:34.32 python > > 6./ On the machines where the user had the segfault, we have 16 GB of RAM > and 1GB of SWAP > > Mem: 16334244k total, 3590100k used, 12744144k free, 221364k buffers > Swap: 1572860k total, 10512k used, 1562348k free, 2937276k cached > > 7./ I think what is happening is that once the user submits his sets of > jobs, the memory usage goes to the very limit on this type machine, and the > raise is actually to fast that ceph-fuse segfaults before OOM Killer can > kill it. > > 8./ We have run the user application in the same type of machines but with > 64 GB of RAM and 1GB of SWAP, and everything goes fine also here. > > > So, in conclusion, our second problem (besides the locks which was fixed by > Pat patch) is the memory usage profile of ceph-fuse in 10.2.2 which seems to > be very different than what it was in ceph-fuse 9.2.0. > > Are there any ideas how can we limit the virtual memory usage of ceph-fuse > in 10.2.2? > > The fuse client is designed to limit its cache sizes: > client_cache_size (default 16384) inodes of cached metadata > client_oc_size (default 200MB) bytes of cached data > > We do run the fuse client with valgrind during testing, so it it is > showing memory leaks in normal usage on your system then that's news. > > The top output you've posted seems to show that ceph-fuse only > actually has 328MB resident though? > > If you can reproduce the memory growth, then it would be good to: > * Try running ceph-fuse with valgrind --tool=memcheck to see if it's > leaking > * Inspect inode count (ceph daemon <path to asok> status) to see if > it's obeying its limit > * Enable objectcacher debug (debug objectcacher = 10) and look at the > output (from the "trim" lines) to see if it's obeying its limit > * See if fuse_disable_pagecache setting makes a difference > > Also, is the version of fuse the same on the nodes running 9.2.0 vs. > the nodes running 10.2.2? > > John > > Cheers > Goncalo > > > > On 07/08/2016 09:54 AM, Brad Hubbard wrote: > > Hi Goncalo, > > If possible it would be great if you could capture a core file for this with > full debugging symbols (preferably glibc debuginfo as well). How you do > that will depend on the ceph version and your OS but we can offfer help > if required I'm sure. > > Once you have the core do the following. > > $ gdb /path/to/ceph-fuse core.XXXX > (gdb) set pag off > (gdb) set log on > (gdb) thread apply all bt > (gdb) thread apply all bt full > > Then quit gdb and you should find a file called gdb.txt in your > working directory. > If you could attach that file to http://tracker.ceph.com/issues/16610 > > Cheers, > Brad > > On Fri, Jul 8, 2016 at 12:06 AM, Patrick Donnelly <pdonnell@xxxxxxxxxx> > wrote: > > On Thu, Jul 7, 2016 at 2:01 AM, Goncalo Borges > <goncalo.borges@xxxxxxxxxxxxx> wrote: > > Unfortunately, the other user application breaks ceph-fuse again (It is a > completely different application then in my previous test). > > We have tested it in 4 machines with 4 cores. The user is submitting 16 > single core jobs which are all writing different output files (one per job) > to a common dir in cephfs. The first 4 jobs run happily and never break > ceph-fuse. But the remaining 12 jobs, running in the remaining 3 machines, > trigger a segmentation fault, which is completely different from the other > case. > > ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) > 1: (()+0x297fe2) [0x7f54402b7fe2] > 2: (()+0xf7e0) [0x7f543ecf77e0] > 3: (ObjectCacher::bh_write_scattered(std::list<ObjectCacher::BufferHead*, > std::allocator<ObjectCacher::BufferHead*> >&)+0x36) [0x7f5440268086] > 4: (ObjectCacher::bh_write_adjacencies(ObjectCacher::BufferHead*, > std::chrono::time_point<ceph::time_detail::real_clock, > std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >, long*, > int*)+0x22c) [0x7f5440268a3c] > 5: (ObjectCacher::flush(long)+0x1ef) [0x7f5440268cef] > 6: (ObjectCacher::flusher_entry()+0xac4) [0x7f5440269a34] > 7: (ObjectCacher::FlusherThread::entry()+0xd) [0x7f5440275c6d] > 8: (()+0x7aa1) [0x7f543ecefaa1] > 9: (clone()+0x6d) [0x7f543df6893d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > This one looks like a very different problem. I've created an issue > here: http://tracker.ceph.com/issues/16610 > > Thanks for the report and debug log! > > -- > Patrick Donnelly > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Goncalo Borges > Research Computing > ARC Centre of Excellence for Particle Physics at the Terascale > School of Physics A28 | University of Sydney, NSW 2006 > T: +61 2 93511937 > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > -- > Goncalo Borges > Research Computing > ARC Centre of Excellence for Particle Physics at the Terascale > School of Physics A28 | University of Sydney, NSW 2006 > T: +61 2 93511937 > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > -- > Goncalo Borges > Research Computing > ARC Centre of Excellence for Particle Physics at the Terascale > School of Physics A28 | University of Sydney, NSW 2006 > T: +61 2 93511937 > > > -- > Goncalo Borges > Research Computing > ARC Centre of Excellence for Particle Physics at the Terascale > School of Physics A28 | University of Sydney, NSW 2006 > T: +61 2 93511937 > > > -- > Goncalo Borges > Research Computing > ARC Centre of Excellence for Particle Physics at the Terascale > School of Physics A28 | University of Sydney, NSW 2006 > T: +61 2 93511937 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com