On 4/15/13 9:21 AM, 符永涛 wrote: > Hi Eric, > I'm sorry for spaming. > And I got some more info and hope you're interested. We are interested; TBH, Brian and I are spending more time on this one because we have a mutual interest in fixing it for someone who helps pay our salaries. We really appreciate your willingness to test & debug, since we've been unable to reproduce this locally so far, so as long as you're willing to try new things we're willing to keep suggesting them. :) I'm going to take some time to try to digest the new information, and Brian or I will let you know if we have more things to try. Thanks, -Eric > In glusterfs3.3 > glusterfsd/src/glusterfsd.c line 1332 there's an unlink operation. > if (ctx->cmd_args.pid_file) { > unlink (ctx->cmd_args.pid_file); > ctx->cmd_args.pid_file = NULL; > } > Glusterfs try to unlink the rebalance pid file after complete and may be this is where the issue happens. > See logs bellow: > 1. > /var/log/secure indicates I start rebalance on Apr 15 11:58:11 > Apr 15 11:58:11 10 sudo: root : TTY=pts/2 ; PWD=/root ; USER=root ; COMMAND=/usr/sbin/gluster volume rebalance testbug start > 2. > After xfs shutdown I got the following log: > --- xfs_iunlink_remove -- module("xfs").function("xfs_iunlink_remove@fs/xfs/xfs_inode.c:1680").return -- return=0x16 > vars: tp=0xffff881c81797c70 ip=0xffff881003c13c00 next_ino=? mp=? agi=? dip=? agibp=0xffff880109b47e20 ibp=? agno=? agino=? next_agino=? last_ibp=? last_dip=0xffff882000000000 bucket_index=? offset=? last_offset=0xffffffffffff8810 error=? __func__=[...] > ip: i_ino = 0x113, i_flags = 0x0 > the inode is lead to xfs shutdown is > 0x113 > 3. > I repair xfs and in lost+foud I find the inode: > [root@10.23.72.93 <mailto:root@10.23.72.93> lost+found]# pwd > /mnt/xfsd/lost+found > [root@10.23.72.93 <mailto:root@10.23.72.93> lost+found]# ls -l 275 > ---------T 1 root root 0 Apr 15 11:58 275 > [root@10.23.72.93 <mailto:root@10.23.72.93> lost+found]# stat 275 > File: `275' > Size: 0 Blocks: 0 IO Block: 4096 regular empty file > Device: 810h/2064d Inode: 275 Links: 1 > Access: (1000/---------T) Uid: ( 0/ root) Gid: ( 0/ root) > Access: 2013-04-15 11:58:25.833443445 +0800 > Modify: 2013-04-15 11:58:25.912461256 +0800 > Change: 2013-04-15 11:58:25.915442091 +0800 > This file is created aroud 2013-04-15 11:58. > And the other files in lost+foud has extended attribute but this file doesn't. Which means it is not part of glusterfs backend files. It should be the rebalance pid file. > > So may be unlink the rebalance pid file leads to xfs shutdown. > > Thank you. > > > > 2013/4/15 Eric Sandeen <sandeen@xxxxxxxxxxx <mailto:sandeen@xxxxxxxxxxx>> > > On 4/15/13 8:45 AM, 符永涛 wrote: > > And at the same time we got the following error log of glusterfs: > > [2013-04-15 20:43:03.851163] I [dht-rebalance.c:1611:gf_defrag_status_get] 0-glusterfs: Rebalance is completed > > [2013-04-15 20:43:03.851248] I [dht-rebalance.c:1614:gf_defrag_status_get] 0-glusterfs: Files migrated: 1629, size: 1582329065954, lookups: 11036, failures: 561 > > [2013-04-15 20:43:03.887634] W [glusterfsd.c:831:cleanup_and_exit] (-->/lib64/libc.so.6(clone+0x6d) [0x3bd16e767d] (-->/lib64/libpthread.so.0() [0x3bd1a07851] (-->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xdd) [0x405c9d]))) 0-: received signum (15), shutting down > > [2013-04-15 20:43:03.887878] E [rpcsvc.c:1155:rpcsvc_program_unregister_portmap] 0-rpc-service: Could not unregister with portmap > > > > We'll take a look, thanks. > > Going forward, could I ask that you take a few minutes to batch up the information, rather than sending several emails in a row? It makes it much harder to collect the information when it's spread across so many emails. > > Thanks, > -Eric > > > > > -- > 符永涛 > > > _______________________________________________ > xfs mailing list > xfs@xxxxxxxxxxx > http://oss.sgi.com/mailman/listinfo/xfs > _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs