Hi Gowdappa, Thanks for your comments, I am clear about it now! Best regards, Cynthia (周琳) MBB SM HETRAN SW3 MATRIX Storage Mobile: +86 (0)18657188311 -----Original Message----- From: Raghavendra Gowdappa [mailto:rgowdapp@xxxxxxxxxx] Sent: Monday, December 12, 2016 2:29 PM To: Zhou, Cynthia (Nokia - CN/Hangzhou) <cynthia.zhou@xxxxxxxxx> Cc: Lian, George (Nokia - CN/Hangzhou) <george.lian@xxxxxxxxx>; Gluster-devel@xxxxxxxxxxx; Chinea, Carlos (Nokia - FI/Espoo) <carlos.chinea@xxxxxxxxx>; Hautio, Kari (Nokia - FI/Espoo) <kari.hautio@xxxxxxxxx>; linux-fsdevel@xxxxxxxxxxxxxxx; Zhang, Bingxuan (Nokia - CN/Hangzhou) <bingxuan.zhang@xxxxxxxxx>; Li, Deqian (Nokia - CN/Hangzhou) <deqian.li@xxxxxxxxx>; Zizka, Jan (Nokia - CZ/Prague) <jan.zizka@xxxxxxxxx>; Bao, Xiaohui (Nokia - CN/Hangzhou) <xiaohui.bao@xxxxxxxxx> Subject: Re: "du" a large count files in a directory casue mounted glusterfs filesystem coredump ----- Original Message ----- > From: "Cynthia Zhou (Nokia - CN/Hangzhou)" <cynthia.zhou@xxxxxxxxx> > To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>, "George Lian (Nokia - CN/Hangzhou)" <george.lian@xxxxxxxxx> > Cc: Gluster-devel@xxxxxxxxxxx, "Carlos Chinea (Nokia - FI/Espoo)" <carlos.chinea@xxxxxxxxx>, "Kari Hautio (Nokia - > FI/Espoo)" <kari.hautio@xxxxxxxxx>, linux-fsdevel@xxxxxxxxxxxxxxx, "Bingxuan Zhang (Nokia - CN/Hangzhou)" > <bingxuan.zhang@xxxxxxxxx>, "Deqian Li (Nokia - CN/Hangzhou)" <deqian.li@xxxxxxxxx>, "Jan Zizka (Nokia - CZ/Prague)" > <jan.zizka@xxxxxxxxx>, "Xiaohui Bao (Nokia - CN/Hangzhou)" <xiaohui.bao@xxxxxxxxx> > Sent: Monday, December 12, 2016 10:59:14 AM > Subject: RE: "du" a large count files in a directory casue mounted glusterfs filesystem coredump > > Hi glusterfs expert: > From > https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Developer-guide/datastructure-inode/ > , there is following description : > > when the the lru limit of the inode table has been exceeded, A inode is > removed from the inode table and eventually destroyed > > From glusterfs source code in function inode_table_new, there are > following lines, so lru_limit is not infinite. > > /* In case FUSE is initing the inode table. */ > if (lru_limit == 0) > lru_limit = DEFAULT_INODE_MEMPOOL_ENTRIES; // 32 * 1024 That's just reuse of variable lru_limit. Note that the value passed by caller is already stored in new itable at: https://github.com/gluster/glusterfs/blob/master/libglusterfs/src/inode.c#L1582 Also, as can be seen here https://github.com/gluster/glusterfs/blob/master/xlators/mount/fuse/src/fuse-bridge.c#L5205 Fuse passes 0 as lru-limit which is considered as infinite. So, I don't seen an issue here. > Is that possible that glusterfs remove inode table because of lru limit > reached? > From the callbacktrace pasted by George ,seems inode table address is > invalid, which caused the coredump. > > Best regards, > Cynthia (周琳) > MBB SM HETRAN SW3 MATRIX > Storage > Mobile: +86 (0)18657188311 > > -----Original Message----- > From: Raghavendra Gowdappa [mailto:rgowdapp@xxxxxxxxxx] > Sent: Monday, December 12, 2016 12:34 PM > To: Lian, George (Nokia - CN/Hangzhou) <george.lian@xxxxxxxxx> > Cc: Gluster-devel@xxxxxxxxxxx; Chinea, Carlos (Nokia - FI/Espoo) > <carlos.chinea@xxxxxxxxx>; Hautio, Kari (Nokia - FI/Espoo) > <kari.hautio@xxxxxxxxx>; linux-fsdevel@xxxxxxxxxxxxxxx; Zhang, Bingxuan > (Nokia - CN/Hangzhou) <bingxuan.zhang@xxxxxxxxx>; Zhou, Cynthia (Nokia - > CN/Hangzhou) <cynthia.zhou@xxxxxxxxx>; Li, Deqian (Nokia - CN/Hangzhou) > <deqian.li@xxxxxxxxx>; Zizka, Jan (Nokia - CZ/Prague) <jan.zizka@xxxxxxxxx>; > Bao, Xiaohui (Nokia - CN/Hangzhou) <xiaohui.bao@xxxxxxxxx> > Subject: Re: "du" a large count files in a directory casue > mounted glusterfs filesystem coredump > > > > ----- Original Message ----- > > From: "George Lian (Nokia - CN/Hangzhou)" <george.lian@xxxxxxxxx> > > To: Gluster-devel@xxxxxxxxxxx, "Carlos Chinea (Nokia - FI/Espoo)" > > <carlos.chinea@xxxxxxxxx>, "Kari Hautio (Nokia - > > FI/Espoo)" <kari.hautio@xxxxxxxxx>, linux-fsdevel@xxxxxxxxxxxxxxx > > Cc: "Bingxuan Zhang (Nokia - CN/Hangzhou)" <bingxuan.zhang@xxxxxxxxx>, > > "Cynthia Zhou (Nokia - CN/Hangzhou)" > > <cynthia.zhou@xxxxxxxxx>, "Deqian Li (Nokia - CN/Hangzhou)" > > <deqian.li@xxxxxxxxx>, "Jan Zizka (Nokia - CZ/Prague)" > > <jan.zizka@xxxxxxxxx>, "Xiaohui Bao (Nokia - CN/Hangzhou)" > > <xiaohui.bao@xxxxxxxxx> > > Sent: Friday, December 9, 2016 2:50:44 PM > > Subject: Re: "du" a large count files in a directory casue > > mounted glusterfs filesystem coredump > > > > For Life cycle of inode in glusterfs which showed in > > https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Developer-guide/datastructure-inode/ > > It shows that “ A inode is removed from the inode table and eventually > > destroyed when unlink or rmdir operation is performed on a file/directory, > > or the the lru limit of the inode table has been exceeded . ” > > Now the default value for inode lru limit is 32k for glusterfs, > > When we “ du ” or “ ls –R” large a mount files in directory which bigger > > than > > 32K, it could easy lead to the limit of lru. > > Glusterfs mount process has an infinite lru limit. The reason is that > Glusterfs passes the address of inode object as "nodeid" (aka identifier) > representing inode. For all future references of the inode, kernel just > sends back this nodeid. So, Glusterfs cannot free up the inode as long as > kernel remembers it. In other words, inode table size in mount process is > dependent on the dentry-cache or inode table size in fuse kernel module. So, > for an inode to be freed up in mount process: > 1. There should not be any on-going ops referring the inode > 2. Kernel should send as many number of forgets as the number of lookups it > has done. > > > > @gluster-expert, when glusterfs destroy the inode due to the LRU limit, > > does > > glusterfs notify to the kernel? (from my study now , it seems not) > > No. It does not. As explained above mount process never destroys the inode as > long as kernel remembers it. > > > @linux-fsdevel-expert, could you please clarify the mechanism of inode > > recycle mechanism or fuse-forget FOP for inode for us ? > > Is it possible that kernel free the inode (which will trigger fuse-forget > > to > > glusterfs) later than the destroy in glusterfs due to lru limit? > > On mount process inode is never destroyed through lru mechanism as limit is > inifinite. > > > If it is possible , then the nodeid (which is conver from the memory > > address > > in glusterfs) maybe stale, and when it pass to the glusterfs userspace, the > > glusterfs just conver the u64 nodeid to memory address, and try to access > > the address, it will lead to invalid access and coredump finally ! > > That's precisely the reason why we keep an infinite lru limit for glusterfs > client process. Though please note that we do have a finite lru limit for > brick process, nfsv3 server etc. > > regards, > Raghavendra > > > Thanks & Best Regards, > > George > > _____________________________________________ > > From: Lian, George (Nokia - CN/Hangzhou) > > Sent: Friday, December 09, 2016 9:49 AM > > To: 'Gluster-devel@xxxxxxxxxxx' <Gluster-devel@xxxxxxxxxxx> > > Cc: Zhou, Cynthia (Nokia - CN/Hangzhou) <cynthia.zhou@xxxxxxxxx>; Bao, > > Xiaohui (Nokia - CN/Hangzhou) <xiaohui.bao@xxxxxxxxx>; Zhang, Bingxuan > > (Nokia - CN/Hangzhou) <bingxuan.zhang@xxxxxxxxx>; Li, Deqian (Nokia - > > CN/Hangzhou) <deqian.li@xxxxxxxxx> > > Subject: "du" a large count files in a directory casue mounted glusterfs > > filesystem coredump > > Hi, GlusterFS Expert, > > Now we have an issue when run “du” command for a large count > > files/directory > > in a directory, in our environment there are more than 150k files in the > > directory. > > # df -i . > > Filesystem Inodes IUsed IFree IUse% Mounted on > > 169.254.0.23:/home 261888 154146 107742 59% /home > > Now we run “du” command in this directory, it is so easy to cause glusterfs > > process coredump, and the coredump backtrace shows it always caused by > > do_forget API, but last call some time difference. Please see the detail > > backtrace as the end of this mail. > > From my investigation, the issue maybe caused by the unsafe call of API > > “fuse_ino_to_inode”, > > I JUST GUESS in some un-expect case, when call “fuse_ino_to_inode” with > > nodeid which came from forget FOP, just call “fuse_ino_to_inode” to get the > > address from simply mapping of uint64 to memory address, > > the inode address maybe just destroyed by “ the lru limit of the inode > > table > > has been exceeded” in our large file case, so this operation maybe not > > safe, > > and the coredump backtrace also show there are more difference case when > > core occurred. > > Could you please share your comments on my investigation? > > And BTW I have some questions, > > > > > > 1. How the inode number in “stat” command mapping to the inode in > > glusterfs? > > stat log > > File: ‘log’ > > Size: 4096 Blocks: 8 IO Block: 4096 directory > > Device: fd10h/64784d I node: 14861593 Links: 3 > > > > > > 1. When will system call the forget FOP and where the nodeid parameter > > came from in system? > > > > > > 1. When the inode is eventually destroyed due to lru limit, and when > > the > > same file is FOPed next time, does the address of this inode is same > > address in next lookup? If not same, is there exist an case the FOP of > > forget give out an old nodeid than glusterfs has? > > Thanks & Best Regards, > > George > > > > > > 1. Coredump backtrace 1 > > #0 0x00007fcd610a69e7 in __list_splice (list=0x26c350c, > > head=0x7fcd56c23db0) > > at list.h:121 > > #1 0x00007fcd610a6a51 in list_splice_init (list=0x26c350c, > > head=0x7fcd56c23db0) at list.h:146 > > #2 0x00007fcd610a95c8 in inode_table_prune (table=0x26c347c) at > > inode.c:1330 > > #3 0x00007fcd610a8a02 in inode_forget (inode=0x7fcd5001147c, nlookup=1) at > > inode.c:977 > > #4 0x00007fcd5f151e24 in do_forget (this=0xc43590, unique=437787, > > nodeid=140519787271292, nlookup=1) at fuse-bridge.c:637 > > #5 0x00007fcd5f151fd3 in fuse_batch_forget (this=0xc43590, > > finh=0x7fcd50c266c0, msg=0x7fcd50c266e8) at fuse-bridge.c:676 > > #6 0x00007fcd5f168aff in fuse_thread_proc (data=0xc43590) at > > fuse-bridge.c:4909 > > #7 0x00007fcd6080b414 in start_thread (arg=0x7fcd56c24700) at > > pthread_create.c:333 > > #8 0x00007fcd600f7b9f in clone () at > > ../sysdeps/unix/sysv/linux/x86_64/clone.S:105 > > (gdb) print head > > $6 = (struct list_head *) 0x7fcd56c23db0 > > (gdb) print *head > > $7 = {next = 0x7fcd56c23db0, prev = 0x7fcd56c23db0} > > (gdb) print head->next > > $8 = (struct list_head *) 0x7fcd56c23db0 > > (gdb) print list->prev > > $9 = (struct list_head *) 0x5100000000 > > (gdb) print (list->prev)->next > > Cannot access memory at address 0x5100000000 > > > > > > 1. Coredump backtrace 2 > > #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58 > > #1 0x00007f612ab4a43a in __GI_abort () at abort.c:89 > > #2 0x00007f612ab41ccd in __assert_fail_base (fmt=0x7f612ac76618 "%s%s%s:%u: > > %s%sAssertion `%s' failed.\n%n", > > assertion=assertion@entry=0x7f612bc01ec1 "inode->nlookup >= nlookup", > > file=file@entry=0x7f612bc01d9b "inode.c", line=line@entry=607, > > function=function@entry=0x7f612bc02339 <__PRETTY_FUNCTION__.10128> > > "__inode_forget") at assert.c:92 > > #3 0x00007f612ab41d82 in __GI___assert_fail (assertion=0x7f612bc01ec1 > > "inode->nlookup >= nlookup", file=0x7f612bc01d9b "inode.c", line=607, > > function=0x7f612bc02339 <__PRETTY_FUNCTION__.10128> "__inode_forget") at > > assert.c:101 > > #4 0x00007f612bbade56 in __inode_forget (inode=0x7f611801d68c, nlookup=4) > > at > > inode.c:607 > > #5 0x00007f612bbae9ea in inode_forget (inode=0x7f611801d68c, nlookup=4) at > > inode.c:973 > > #6 0x00007f6129defdd5 in do_forget (this=0x1a895c0, unique=436589, > > nodeid=140054991328908, nlookup=4) at fuse-bridge.c:633 > > #7 0x00007f6129defe94 in fuse_forget (this=0x1a895c0, finh=0x7f6118c28be0, > > msg=0x7f6118c28c08) at fuse-bridge.c:652 > > #8 0x00007f6129e06ab0 in fuse_thread_proc (data=0x1a895c0) at > > fuse-bridge.c:4905 > > #9 0x00007f612b311414 in start_thread (arg=0x7f61220d0700) at > > pthread_create.c:333 > > #10 0x00007f612abfdb9f in clone () at > > ../sysdeps/unix/sysv/linux/x86_64/clone.S:105 > > (gdb) f 5 > > #5 0x00007f612bbae9ea in inode_forget (inode=0x7f611801d68c, nlookup=4) at > > inode.c:973 > > 973 inode.c: No such file or directory. > > (gdb) f 4 > > #4 0x00007f612bbade56 in __inode_forget (inode=0x7f611801d68c, nlookup=4) > > at > > inode.c:607 > > 607 in inode.c > > (gdb) print inode->nlookup > > > > > > 1. Coredump backtrace 3 > > #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58 > > #1 0x00007f86b0b0f43a in __GI_abort () at abort.c:89 > > #2 0x00007f86b0b06ccd in __assert_fail_base (fmt=0x7f86b0c3b618 "%s%s%s:%u: > > %s%sAssertion `%s' failed.\n%n", > > assertion=assertion@entry=0x7f86b12e1f38 "INTERNAL_SYSCALL_ERRNO (e, __err) > > != ESRCH || !robust", > > file=file@entry=0x7f86b12e1e7c "../nptl/pthread_mutex_lock.c", > > line=line@entry=352, > > function=function@entry=0x7f86b12e1fe0 <__PRETTY_FUNCTION__.8666> > > "__pthread_mutex_lock_full") at assert.c:92 > > #3 0x00007f86b0b06d82 in __GI___assert_fail > > (assertion=assertion@entry=0x7f86b12e1f38 "INTERNAL_SYSCALL_ERRNO (e, > > __err) > > != ESRCH || !robust", > > file=file@entry=0x7f86b12e1e7c "../nptl/pthread_mutex_lock.c", > > line=line@entry=352, > > function=function@entry=0x7f86b12e1fe0 <__PRETTY_FUNCTION__.8666> > > "__pthread_mutex_lock_full") at assert.c:101 > > #4 0x00007f86b12d89da in __pthread_mutex_lock_full (mutex=0x7f86a12ffcac) > > at > > ../nptl/pthread_mutex_lock.c:352 > > #5 0x00007f86b1b729f1 in inode_ref (inode=0x7f86a03cefec) at inode.c:476 > > #6 0x00007f86afdafb04 in fuse_ino_to_inode (ino=140216190693356, > > fuse=0x1a541f0) at fuse-helpers.c:390 > > #7 0x00007f86afdb4d6b in do_forget (this=0x1a541f0, unique=96369, > > nodeid=140216190693356, nlookup=1) at fuse-bridge.c:629 > > #8 0x00007f86afdb4f84 in fuse_batch_forget (this=0x1a541f0, > > finh=0x7f86a03b4f90, msg=0x7f86a03b4fb8) at fuse-bridge.c:674 > > #9 0x00007f86afdcbab0 in fuse_thread_proc (data=0x1a541f0) at > > fuse-bridge.c:4905 > > #10 0x00007f86b12d6414 in start_thread (arg=0x7f86a7ac8700) at > > pthread_create.c:333 > > #11 0x00007f86b0bc2b9f in clone () at > > ../sysdeps/unix/sysv/linux/x86_64/clone.S:105 > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel@xxxxxxxxxxx > > http://www.gluster.org/mailman/listinfo/gluster-devel > _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel