Re: "du" a large count files in a directory casue mounted glusterfs filesystem coredump

"Lian, George (Nokia - CN/Hangzhou)" <george.lian@xxxxxxxxx> · Fri, 9 Dec 2016 09:20:44 +0000

For Life cycle of inode in glusterfs which showed in https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Developer-guide/datastructure-inode/
It shows that “A inode is removed from the inode table and eventually destroyed when unlink or rmdir operation is
performed on a file/directory, or the the lru limit of the inode table has been exceeded.”
Now the default value for inode lru limit is 32k for glusterfs, 
When we “du”
or “ls –R”  large amount
files in directory which bigger than 32K, it could easy lead to the limit of lru.

@gluster-expert, when glusterfs destroy the
inode due to the LRU limit, does glusterfs notify to the kernel? (from my study now, it seems not)

@linux-fsdevel-expert, could you please clarify the mechanism of inode recycle mechanism
or fuse-forget FOP for inode for us?

Is it possible that kernel free the inode(which will trigger fuse-forget to glusterfs) 
later than the destroy in glusterfs due to lru limit?

If it is possible , then the nodeid (which is conver from the memory address in glusterfs) maybe stale,
and when it pass to the glusterfs userspace, the glusterfs just conver the u64 nodeid to memory address, and try to access the address, it will lead to invalid access and coredump
finally !

Thanks & Best Regards,
George

_____________________________________________

From: Lian, George (Nokia - CN/Hangzhou) 

Sent: Friday, December 09, 2016 9:49 AM

To: 'Gluster-devel@xxxxxxxxxxx' <Gluster-devel@xxxxxxxxxxx>

Cc: Zhou, Cynthia (Nokia - CN/Hangzhou) <cynthia.zhou@xxxxxxxxx>; Bao, Xiaohui (Nokia - CN/Hangzhou) <xiaohui.bao@xxxxxxxxx>; Zhang, Bingxuan (Nokia - CN/Hangzhou) <bingxuan.zhang@xxxxxxxxx>; Li, Deqian (Nokia - CN/Hangzhou) <deqian.li@xxxxxxxxx>

Subject: "du" a large count files in a directory casue mounted glusterfs filesystem coredump

Hi, GlusterFS Expert,

Now we have an issue when run “du” command for a large count files/directory in a directory, in our environment there are more than 150k files in the directory.
# df -i .
Filesystem         Inodes  IUsed  IFree IUse% Mounted on
169.254.0.23:/home 261888 154146 107742   59% /home

Now we run “du” command in this directory, it is so easy to cause glusterfs process coredump, and the coredump backtrace shows it always caused by do_forget API, but last call some time difference. Please see the
detail backtrace as the end of this mail.
From my investigation, the issue maybe caused by the unsafe call of API “fuse_ino_to_inode”, 
I JUST GUESS in some un-expect case, when call  “fuse_ino_to_inode” with nodeid which came from forget FOP,  just call  “fuse_ino_to_inode” to get the address from simply mapping of uint64 to memory address, 
the inode address maybe just destroyed by “ the
lru limit of the inode table has been exceeded” in our large file case, so this operation maybe not safe, and the coredump backtrace also show there are more difference case when core occurred.

Could you please share your comments on my investigation?
And BTW I have some questions, 

How the inode number in “stat” command mapping to the inode in glusterfs?

stat log

  File: ‘log’

  Size: 4096            Blocks: 8          IO Block: 4096   directory
Device: fd10h/64784d    Inode: 14861593    Links: 3

When will system call the forget FOP and where the nodeid parameter came from in system?

When the inode is eventually destroyed due to lru limit, and when the same file is FOPed next time, does the address of this inode is same address in next lookup? If not same, is
there exist an case the FOP of forget give out an old nodeid than glusterfs has?

Thanks & Best Regards,
George

Coredump backtrace 1

#0  0x00007fcd610a69e7 in __list_splice (list=0x26c350c, head=0x7fcd56c23db0) at list.h:121

#1  0x00007fcd610a6a51 in list_splice_init (list=0x26c350c, head=0x7fcd56c23db0) at list.h:146

#2  0x00007fcd610a95c8 in inode_table_prune (table=0x26c347c) at inode.c:1330

#3  0x00007fcd610a8a02 in inode_forget (inode=0x7fcd5001147c, nlookup=1) at inode.c:977

#4  0x00007fcd5f151e24 in do_forget (this=0xc43590, unique=437787, nodeid=140519787271292, nlookup=1) at fuse-bridge.c:637

#5  0x00007fcd5f151fd3 in fuse_batch_forget (this=0xc43590, finh=0x7fcd50c266c0, msg=0x7fcd50c266e8) at fuse-bridge.c:676

#6  0x00007fcd5f168aff in fuse_thread_proc (data="" at fuse-bridge.c:4909

#7  0x00007fcd6080b414 in start_thread (arg=0x7fcd56c24700) at pthread_create.c:333

#8  0x00007fcd600f7b9f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105

(gdb) print head

$6 = (struct list_head *) 0x7fcd56c23db0

(gdb) print *head

$7 = {next = 0x7fcd56c23db0, prev = 0x7fcd56c23db0}

(gdb) print head->next

$8 = (struct list_head *) 0x7fcd56c23db0

(gdb) print list->prev

$9 = (struct list_head *) 0x5100000000

(gdb) print (list->prev)->next
Cannot access memory at address 0x5100000000

Coredump backtrace 2

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58

#1  0x00007f612ab4a43a in __GI_abort () at abort.c:89

#2  0x00007f612ab41ccd in __assert_fail_base (fmt=0x7f612ac76618 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",

    assertion=assertion@entry=0x7f612bc01ec1 "inode->nlookup >= nlookup", file=file@entry=0x7f612bc01d9b "inode.c", line=line@entry=607,

    function=function@entry=0x7f612bc02339 <__PRETTY_FUNCTION__.10128> "__inode_forget") at assert.c:92

#3  0x00007f612ab41d82 in __GI___assert_fail (assertion=0x7f612bc01ec1 "inode->nlookup >= nlookup", file=0x7f612bc01d9b "inode.c", line=607,

    function=0x7f612bc02339 <__PRETTY_FUNCTION__.10128> "__inode_forget") at assert.c:101

#4  0x00007f612bbade56 in __inode_forget (inode=0x7f611801d68c, nlookup=4) at inode.c:607

#5  0x00007f612bbae9ea in inode_forget (inode=0x7f611801d68c, nlookup=4) at inode.c:973

#6  0x00007f6129defdd5 in do_forget (this=0x1a895c0, unique=436589, nodeid=140054991328908, nlookup=4) at fuse-bridge.c:633

#7  0x00007f6129defe94 in fuse_forget (this=0x1a895c0, finh=0x7f6118c28be0, msg=0x7f6118c28c08) at fuse-bridge.c:652

#8  0x00007f6129e06ab0 in fuse_thread_proc (data="" at fuse-bridge.c:4905

#9  0x00007f612b311414 in start_thread (arg=0x7f61220d0700) at pthread_create.c:333

#10 0x00007f612abfdb9f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105

(gdb) f 5

#5  0x00007f612bbae9ea in inode_forget (inode=0x7f611801d68c, nlookup=4) at inode.c:973

973     inode.c: No such file or directory.

(gdb) f 4

#4  0x00007f612bbade56 in __inode_forget (inode=0x7f611801d68c, nlookup=4) at inode.c:607

607     in inode.c
(gdb) print inode->nlookup

Coredump backtrace 3

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58

#1  0x00007f86b0b0f43a in __GI_abort () at abort.c:89

#2  0x00007f86b0b06ccd in __assert_fail_base (fmt=0x7f86b0c3b618 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",

    assertion=assertion@entry=0x7f86b12e1f38 "INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust",

    file=file@entry=0x7f86b12e1e7c "../nptl/pthread_mutex_lock.c", line=line@entry=352,

    function=function@entry=0x7f86b12e1fe0 <__PRETTY_FUNCTION__.8666> "__pthread_mutex_lock_full") at assert.c:92

#3  0x00007f86b0b06d82 in __GI___assert_fail (assertion=assertion@entry=0x7f86b12e1f38 "INTERNAL_SYSCALL_ERRNO (e, __err) != ESRCH || !robust",

    file=file@entry=0x7f86b12e1e7c "../nptl/pthread_mutex_lock.c", line=line@entry=352,

    function=function@entry=0x7f86b12e1fe0 <__PRETTY_FUNCTION__.8666> "__pthread_mutex_lock_full") at assert.c:101

#4  0x00007f86b12d89da in __pthread_mutex_lock_full (mutex=0x7f86a12ffcac) at ../nptl/pthread_mutex_lock.c:352

#5  0x00007f86b1b729f1 in inode_ref (inode=0x7f86a03cefec) at inode.c:476

#6  0x00007f86afdafb04 in fuse_ino_to_inode (ino=140216190693356, fuse=0x1a541f0) at fuse-helpers.c:390

#7  0x00007f86afdb4d6b in do_forget (this=0x1a541f0, unique=96369, nodeid=140216190693356, nlookup=1) at fuse-bridge.c:629

#8  0x00007f86afdb4f84 in fuse_batch_forget (this=0x1a541f0, finh=0x7f86a03b4f90, msg=0x7f86a03b4fb8) at fuse-bridge.c:674

#9  0x00007f86afdcbab0 in fuse_thread_proc (data="" at fuse-bridge.c:4905

#10 0x00007f86b12d6414 in start_thread (arg=0x7f86a7ac8700) at pthread_create.c:333
#11 0x00007f86b0bc2b9f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:105

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel