RE: [Gluster-devel] "du" a large count files in a directory casue mounted glusterfs filesystem coredump

"Zhou, Cynthia (Nokia - CN/Hangzhou)" <cynthia.zhou@xxxxxxxxx> · Mon, 12 Dec 2016 05:29:14 +0000

Hi glusterfs expert:
	From https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Developer-guide/datastructure-inode/ , there is following description :

when the the lru limit of the inode table has been exceeded, A inode is removed from the inode table and eventually destroyed

    From glusterfs source code in function inode_table_new, there are following lines, so lru_limit is not infinite.

        /* In case FUSE is initing the inode table. */
        if (lru_limit == 0)
                lru_limit = DEFAULT_INODE_MEMPOOL_ENTRIES; // 32 * 1024
    Is that possible that glusterfs remove inode table because of lru limit reached?
    From the callbacktrace pasted by George ,seems inode table address is invalid, which caused the coredump.

Best regards,
Cynthia （周琳）
MBB SM HETRAN SW3 MATRIX  
Storage         
Mobile: +86 (0)18657188311

-----Original Message-----
From: Raghavendra Gowdappa [mailto:rgowdapp@xxxxxxxxxx] 
Sent: Monday, December 12, 2016 12:34 PM
To: Lian, George (Nokia - CN/Hangzhou) <george.lian@xxxxxxxxx>
Cc: Gluster-devel@xxxxxxxxxxx; Chinea, Carlos (Nokia - FI/Espoo) <carlos.chinea@xxxxxxxxx>; Hautio, Kari (Nokia - FI/Espoo) <kari.hautio@xxxxxxxxx>; linux-fsdevel@xxxxxxxxxxxxxxx; Zhang, Bingxuan (Nokia - CN/Hangzhou) <bingxuan.zhang@xxxxxxxxx>; Zhou, Cynthia (Nokia - CN/Hangzhou) <cynthia.zhou@xxxxxxxxx>; Li, Deqian (Nokia - CN/Hangzhou) <deqian.li@xxxxxxxxx>; Zizka, Jan (Nokia - CZ/Prague) <jan.zizka@xxxxxxxxx>; Bao, Xiaohui (Nokia - CN/Hangzhou) <xiaohui.bao@xxxxxxxxx>
Subject: Re: [Gluster-devel] "du" a large count files in a directory casue mounted glusterfs filesystem coredump

----- Original Message -----
> From: "George Lian (Nokia - CN/Hangzhou)" <george.lian@xxxxxxxxx>
> To: Gluster-devel@xxxxxxxxxxx, "Carlos Chinea (Nokia - FI/Espoo)" <carlos.chinea@xxxxxxxxx>, "Kari Hautio (Nokia -
> FI/Espoo)" <kari.hautio@xxxxxxxxx>, linux-fsdevel@xxxxxxxxxxxxxxx
> Cc: "Bingxuan Zhang (Nokia - CN/Hangzhou)" <bingxuan.zhang@xxxxxxxxx>, "Cynthia Zhou (Nokia - CN/Hangzhou)"
> <cynthia.zhou@xxxxxxxxx>, "Deqian Li (Nokia - CN/Hangzhou)" <deqian.li@xxxxxxxxx>, "Jan Zizka (Nokia - CZ/Prague)"
> <jan.zizka@xxxxxxxxx>, "Xiaohui Bao (Nokia - CN/Hangzhou)" <xiaohui.bao@xxxxxxxxx>
> Sent: Friday, December 9, 2016 2:50:44 PM
> Subject: Re: [Gluster-devel] "du" a large count files in a directory casue mounted glusterfs filesystem coredump
> 
> For Life cycle of inode in glusterfs which showed in
> https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Developer-guide/datastructure-inode/
> It shows that “ A inode is removed from the inode table and eventually
> destroyed when unlink or rmdir operation is performed on a file/directory,
> or the the lru limit of the inode table has been exceeded . ”
> Now the default value for inode lru limit is 32k for glusterfs,
> When we “ du ” or “ ls –R” large a mount files in directory which bigger than
> 32K, it could easy lead to the limit of lru.

Glusterfs mount process has an infinite lru limit. The reason is that Glusterfs passes the address of inode object as "nodeid" (aka identifier) representing inode. For all future references of the inode, kernel just sends back this nodeid. So, Glusterfs cannot free up the inode as long as kernel remembers it. In other words, inode table size in mount process is dependent on the dentry-cache or inode table size in fuse kernel module. So, for an inode to be freed up in mount process:
1. There should not be any on-going ops referring the inode
2. Kernel should send as many number of forgets as the number of lookups it has done.

> @gluster-expert, when glusterfs destroy the inode due to the LRU limit, does
> glusterfs notify to the kernel? (from my study now , it seems not)

No. It does not. As explained above mount process never destroys the inode as long as kernel remembers it.

> @linux-fsdevel-expert, could you please clarify the mechanism of inode
> recycle mechanism or fuse-forget FOP for inode for us ?
> Is it possible that kernel free the inode (which will trigger fuse-forget to
> glusterfs) later than the destroy in glusterfs due to lru limit?

On mount process inode is never destroyed through lru mechanism as limit is inifinite.

> If it is possible , then the nodeid (which is conver from the memory address
> in glusterfs) maybe stale, and when it pass to the glusterfs userspace, the
> glusterfs just conver the u64 nodeid to memory address, and try to access
> the address, it will lead to invalid access and coredump finally !

That's precisely the reason why we keep an infinite lru limit for glusterfs client process. Though please note that we do have a finite lru limit for brick process, nfsv3 server etc.

regards,
Raghavendra

> Thanks & Best Regards,
> George
> _____________________________________________
> From: Lian, George (Nokia - CN/Hangzhou)
> Sent: Friday, December 09, 2016 9:49 AM
> To: 'Gluster-devel@xxxxxxxxxxx' <Gluster-devel@xxxxxxxxxxx>
> Cc: Zhou, Cynthia (Nokia - CN/Hangzhou) <cynthia.zhou@xxxxxxxxx>; Bao,
> Xiaohui (Nokia - CN/Hangzhou) <xiaohui.bao@xxxxxxxxx>; Zhang, Bingxuan
> (Nokia - CN/Hangzhou) <bingxuan.zhang@xxxxxxxxx>; Li, Deqian (Nokia -
> CN/Hangzhou) <deqian.li@xxxxxxxxx>
> Subject: "du" a large count files in a directory casue mounted glusterfs
> filesystem coredump
> Hi, GlusterFS Expert,
> Now we have an issue when run “du” command for a large count files/directory
> in a directory, in our environment there are more than 150k files in the
> directory.
> # df -i .
> Filesystem Inodes IUsed IFree IUse% Mounted on
> 169.254.0.23:/home 261888 154146 107742 59% /home
> Now we run “du” command in this directory, it is so easy to cause glusterfs
> process coredump, and the coredump backtrace shows it always caused by
> do_forget API, but last call some time difference. Please see the detail
> backtrace as the end of this mail.
> From my investigation, the issue maybe caused by the unsafe call of API
> “fuse_ino_to_inode”,
> I JUST GUESS in some un-expect case, when call “fuse_ino_to_inode” with
> nodeid which came from forget FOP, just call “fuse_ino_to_inode” to get the
> address from simply mapping of uint64 to memory address,
> the inode address maybe just destroyed by “ the lru limit of the inode table
> has been exceeded” in our large file case, so this operation maybe not safe,
> and the coredump backtrace also show there are more difference case when
> core occurred.
> Could you please share your comments on my investigation?
> And BTW I have some questions,
> 
> 
>     1. How the inode number in “stat” command mapping to the inode in
>     glusterfs?
> stat log
> File: ‘log’
> Size: 4096 Blocks: 8 IO Block: 4096 directory
> Device: fd10h/64784d I node: 14861593 Links: 3
> 
> 
>     1. When will system call the forget FOP and where the nodeid parameter
>     came from in system?
> 
> 
>     1. When the inode is eventually destroyed due to lru limit, and when the
>     same file is FOPed next time, does the address of this inode is same
>     address in next lookup? If not same, is there exist an case the FOP of
>     forget give out an old nodeid than glusterfs has?
> Thanks & Best Regards,
> George
> 
> 
>     1. Coredump backtrace 1
> #0 0x00007fcd610a69e7 in __list_splice (list=0x26c350c, head=0x7fcd56c23db0)
> at list.h:121
> #1 0x00007fcd610a6a51 in list_splice_init (list=0x26c350c,
> head=0x7fcd56c23db0) at list.h:146
> #2 0x00007fcd610a95c8 in inode_table_prune (table=0x26c347c) at inode.c:1330
> #3 0x00007fcd610a8a02 in inode_forget (inode=0x7fcd5001147c, nlookup=1) at
> inode.c:977
> #4 0x00007fcd5f151e24 in do_forget (this=0xc43590, unique=437787,
> nodeid=140519787271292, nlookup=1) at fuse-bridge.c:637
> #5 0x00007fcd5f151fd3 in fuse_batch_forget (this=0xc43590,
> finh=0x7fcd50c266c0, msg=0x7fcd50c266e8) at fuse-bridge.c:676
> #6 0x00007fcd5f168aff in fuse_thread_proc (data=0xc43590) at
> fuse-bridge.c:4909
> #7 0x00007fcd6080b414 in start_thread (arg=0x7fcd56c24700) at
> pthread_create.c:333
> #8 0x00007fcd600f7b9f in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:105
> (gdb) print head
> $6 = (struct list_head *) 0x7fcd56c23db0
> (gdb) print *head
> $7 = {next = 0x7fcd56c23db0, prev = 0x7fcd56c23db0}
> (gdb) print head->next
> $8 = (struct list_head *) 0x7fcd56c23db0
> (gdb) print list->prev
> $9 = (struct list_head *) 0x5100000000
> (gdb) print (list->prev)->next
> Cannot access memory at address 0x5100000000
> 
> 
>     1. Coredump backtrace 2
> #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
> #1 0x00007f612ab4a43a in __GI_abort () at abort.c:89
> #2 0x00007f612ab41ccd in __assert_fail_base (fmt=0x7f612ac76618 "%s%s%s:%u:
> %s%sAssertion `%s' failed.\n%n",
> assertion=assertion@entry=0x7f612bc01ec1 "inode->nlookup >= nlookup",
> file=file@entry=0x7f612bc01d9b "inode.c", line=line@entry=607,
> function=function@entry=0x7f612bc02339 <__PRETTY_FUNCTION__.10128>
> "__inode_forget") at assert.c:92
> #3 0x00007f612ab41d82 in __GI___assert_fail (assertion=0x7f612bc01ec1
> "inode->nlookup >= nlookup", file=0x7f612bc01d9b "inode.c", line=607,
> function=0x7f612bc02339 <__PRETTY_FUNCTION__.10128> "__inode_forget") at
> assert.c:101
> #4 0x00007f612bbade56 in __inode_forget (inode=0x7f611801d68c, nlookup=4) at
> inode.c:607
> #5 0x00007f612bbae9ea in inode_forget (inode=0x7f611801d68c, nlookup=4) at
> inode.c:973
> #6 0x00007f6129defdd5 in do_forget (this=0x1a895c0, unique=436589,
> nodeid=140054991328908, nlookup=4) at fuse-bridge.c:633
> #7 0x00007f6129defe94 in fuse_forget (this=0x1a895c0, finh=0x7f6118c28be0,
> msg=0x7f6118c28c08) at fuse-bridge.c:652
> #8 0x00007f6129e06ab0 in fuse_thread_proc (data=0x1a895c0) at
> fuse-bridge.c:4905
> #9 0x00007f612b311414 in start_thread (arg=0x7f61220d0700) at
> pthread_create.c:333
> #10 0x00007f612abfdb9f in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:105
> (gdb) f 5
> #5 0x00007f612bbae9ea in inode_forget (inode=0x7f611801d68c, nlookup=4) at
> inode.c:973
> 973 inode.c: No such file or directory.
> (gdb) f 4
> #4 0x00007f612bbade56 in __inode_forget (inode=0x7f611801d68c, nlookup=4) at
> inode.c:607
> 607 in inode.c
> (gdb) print inode->nlookup
> 
> 
>     1. Coredump backtrace 3
> #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
> #1 0x00007f86b0b0f43a in __GI_abort () at abort.c:89
> #2 0x00007f86b0b06ccd in __assert_fail_base (fmt=0x7f86b0c3b618 "%s%s%s:%u:
> %s%sAssertion `%s' failed.\n%n",
> assertion=assertion@entry=0x7f86b12e1f38 "INTERNAL_SYSCALL_ERRNO (e, __err)
> != ESRCH || !robust",
> file=file@entry=0x7f86b12e1e7c "../nptl/pthread_mutex_lock.c",
> line=line@entry=352,
> function=function@entry=0x7f86b12e1fe0 <__PRETTY_FUNCTION__.8666>
> "__pthread_mutex_lock_full") at assert.c:92
> #3 0x00007f86b0b06d82 in __GI___assert_fail
> (assertion=assertion@entry=0x7f86b12e1f38 "INTERNAL_SYSCALL_ERRNO (e, __err)
> != ESRCH || !robust",
> file=file@entry=0x7f86b12e1e7c "../nptl/pthread_mutex_lock.c",
> line=line@entry=352,
> function=function@entry=0x7f86b12e1fe0 <__PRETTY_FUNCTION__.8666>
> "__pthread_mutex_lock_full") at assert.c:101
> #4 0x00007f86b12d89da in __pthread_mutex_lock_full (mutex=0x7f86a12ffcac) at
> ../nptl/pthread_mutex_lock.c:352
> #5 0x00007f86b1b729f1 in inode_ref (inode=0x7f86a03cefec) at inode.c:476
> #6 0x00007f86afdafb04 in fuse_ino_to_inode (ino=140216190693356,
> fuse=0x1a541f0) at fuse-helpers.c:390
> #7 0x00007f86afdb4d6b in do_forget (this=0x1a541f0, unique=96369,
> nodeid=140216190693356, nlookup=1) at fuse-bridge.c:629
> #8 0x00007f86afdb4f84 in fuse_batch_forget (this=0x1a541f0,
> finh=0x7f86a03b4f90, msg=0x7f86a03b4fb8) at fuse-bridge.c:674
> #9 0x00007f86afdcbab0 in fuse_thread_proc (data=0x1a541f0) at
> fuse-bridge.c:4905
> #10 0x00007f86b12d6414 in start_thread (arg=0x7f86a7ac8700) at
> pthread_create.c:333
> #11 0x00007f86b0bc2b9f in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:105
> 
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxxx
> http://www.gluster.org/mailman/listinfo/gluster-devel
��.n��������+%������w��{.n�����{���)��jg��������ݢj����G�������j:+v���w�m������w�������h�����٥