On 04/15/2013 08:54 AM, 符永涛 wrote: > Dear Brian and xfs experts, > Brain your scripts works and I am able to reproduce it with glusterfs > rebalance on our test cluster. 2 of our server xfs shutdown during > glusterfs rebalance, the shutdown userspace stacktrace both related to > pthread. See logs bellow, What's your opinion? Thank you very much! > logs: Thanks for the data. Can you also create a metadump for the filesystem(s) associated with this output? Brian > [root@10.23.72.93 ~]# cat xfs.log > > --- xfs_imap -- > module("xfs").function("xfs_imap@fs/xfs/xfs_ialloc.c:1257").return > -- return=0x16 > vars: mp=0xffff882017a50800 tp=0xffff881c81797c70 ino=0xffffffff > imap=0xffff88100e2f7c08 flags=0x0 agbno=? agino=? agno=? blks_per_cluster=? > chunk_agbno=? cluster_agbno=? error=? offset=? offset_agbno=? __func__=[...] > mp: m_agno_log = 0x5, m_agino_log = 0x20 > mp->m_sb: sb_agcount = 0x1c, sb_agblocks = 0xffffff0, sb_inopblog = 0x4, > sb_agblklog = 0x1c, sb_dblocks = 0x1b4900000 > imap: im_blkno = 0x0, im_len = 0xa078, im_boffset = 0x86ea > kernel backtrace: > Returning from: 0xffffffffa02b3ab0 : xfs_imap+0x0/0x280 [xfs] > Returning to : 0xffffffffa02b9599 : xfs_inotobp+0x49/0xc0 [xfs] > 0xffffffffa02b96f1 : xfs_iunlink_remove+0xe1/0x320 [xfs] > 0xffffffff81501a69 > 0x0 (inexact) > user backtrace: > 0x3bd1a0e5ad [/lib64/libpthread-2.12.so+0xe5ad/0x219000] > > --- xfs_iunlink_remove -- > module("xfs").function("xfs_iunlink_remove@fs/xfs/xfs_inode.c:1680").return > -- return=0x16 > vars: tp=0xffff881c81797c70 ip=0xffff881003c13c00 next_ino=? mp=? agi=? > dip=? agibp=0xffff880109b47e20 ibp=? agno=? agino=? next_agino=? last_ibp=? > last_dip=0xffff882000000000 bucket_index=? offset=? > last_offset=0xffffffffffff8810 error=? __func__=[...] > ip: i_ino = 0x113, i_flags = 0x0 > ip->i_d: di_nlink = 0x0, di_gen = 0x0 > [root@10.23.72.93 ~]# > [root@10.23.72.94 ~]# cat xfs.log > > --- xfs_imap -- > module("xfs").function("xfs_imap@fs/xfs/xfs_ialloc.c:1257").return > -- return=0x16 > vars: mp=0xffff881017c6c800 tp=0xffff8801037acea0 ino=0xffffffff > imap=0xffff882017101c08 flags=0x0 agbno=? agino=? agno=? blks_per_cluster=? > chunk_agbno=? cluster_agbno=? error=? offset=? offset_agbno=? __func__=[...] > mp: m_agno_log = 0x5, m_agino_log = 0x20 > mp->m_sb: sb_agcount = 0x1c, sb_agblocks = 0xffffff0, sb_inopblog = 0x4, > sb_agblklog = 0x1c, sb_dblocks = 0x1b4900000 > imap: im_blkno = 0x0, im_len = 0xd98, im_boffset = 0x547 > kernel backtrace: > Returning from: 0xffffffffa02b3ab0 : xfs_imap+0x0/0x280 [xfs] > Returning to : 0xffffffffa02b9599 : xfs_inotobp+0x49/0xc0 [xfs] > 0xffffffffa02b96f1 : xfs_iunlink_remove+0xe1/0x320 [xfs] > 0xffffffff81501a69 > 0x0 (inexact) > user backtrace: > 0x30cd40e5ad [/lib64/libpthread-2.12.so+0xe5ad/0x219000] > > --- xfs_iunlink_remove -- > module("xfs").function("xfs_iunlink_remove@fs/xfs/xfs_inode.c:1680").return > -- return=0x16 > vars: tp=0xffff8801037acea0 ip=0xffff880e697c8800 next_ino=? mp=? agi=? > dip=? agibp=0xffff880d846c2d60 ibp=? agno=? agino=? next_agino=? last_ibp=? > last_dip=0xffff881017c6c800 bucket_index=? offset=? > last_offset=0xffffffffffff880e error=? __func__=[...] > ip: i_ino = 0x142, i_flags = 0x0 > ip->i_d: di_nlink = 0x0, di_gen = 0x3565732e > > > > 2013/4/15 符永涛 <yongtaofu@xxxxxxxxx> > >> Also glusterfs use a lot of hardlink for self-heal: >> --------T 2 root root 0 Apr 15 11:58 /mnt/xfsd/testbug/998416323 >> ---------T 2 root root 0 Apr 15 11:58 /mnt/xfsd/testbug/999296624 >> ---------T 2 root root 0 Apr 15 12:24 /mnt/xfsd/testbug/999568484 >> ---------T 2 root root 0 Apr 15 11:58 /mnt/xfsd/testbug/999956875 >> ---------T 2 root root 0 Apr 15 11:58 >> /mnt/xfsd/testbug/.glusterfs/05/2f/052f4e3e-c379-4a3c-b995-a10fdaca33d0 >> ---------T 2 root root 0 Apr 15 11:58 >> /mnt/xfsd/testbug/.glusterfs/05/95/0595272e-ce2b-45d5-8693-d02c00b94d9d >> ---------T 2 root root 0 Apr 15 11:58 >> /mnt/xfsd/testbug/.glusterfs/05/ca/05ca00a0-92a7-44cf-b6e3-380496aafaa4 >> ---------T 2 root root 0 Apr 15 12:24 >> /mnt/xfsd/testbug/.glusterfs/0a/23/0a238ca7-3cef-4540-9c98-6bf631551b21 >> ---------T 2 root root 0 Apr 15 11:58 >> /mnt/xfsd/testbug/.glusterfs/0a/4b/0a4b640b-f675-4708-bb59-e2369ffbbb9d >> Does it related? >> >> >> 2013/4/15 符永涛 <yongtaofu@xxxxxxxxx> >> >>> Dear xfs experts, >>> Now I'm deploying Brian's system script in out cluster. But from last >>> night till now 5 servers in our 24 servers xfs shutdown with the same >>> error. I run xfs_repair command and found all the lost inodes are glusterfs >>> dht link files. This explains why the xfs shutdown tend to happen during >>> glusterfs rebalance. During glusterfs rebalance procedure a lot of dhk link >>> files may be unlinked. For example the following inodes are found in >>> lost+found in one of the servers: >>> [root@* lost+found]# pwd >>> /mnt/xfsd/lost+found >>> [root@* lost+found]# ls -l >>> total 740 >>> ---------T 1 root root 0 Apr 8 21:06 100119 >>> ---------T 1 root root 0 Apr 8 21:11 101123 >>> ---------T 1 root root 0 Apr 8 21:19 102659 >>> ---------T 1 root root 0 Apr 12 14:46 1040919 >>> ---------T 1 root root 0 Apr 12 14:58 1041943 >>> ---------T 1 root root 0 Apr 8 21:32 105219 >>> ---------T 1 root root 0 Apr 8 21:37 105731 >>> ---------T 1 root root 0 Apr 12 17:48 1068055 >>> ---------T 1 root root 0 Apr 12 18:38 1073943 >>> ---------T 1 root root 0 Apr 8 21:54 108035 >>> ---------T 1 root root 0 Apr 12 21:49 1091095 >>> ---------T 1 root root 0 Apr 13 00:17 1111063 >>> ---------T 1 root root 0 Apr 13 03:51 1121815 >>> ---------T 1 root root 0 Apr 8 22:25 112387 >>> ---------T 1 root root 0 Apr 13 06:39 1136151 >>> ... >>> [root@* lost+found]# getfattr -m . -d -e hex * >>> >>> # file: 96007 >>> trusted.afr.mams-cq-mt-video-client-3=0x000000000000000000000000 >>> trusted.afr.mams-cq-mt-video-client-4=0x000000000000000000000000 >>> trusted.afr.mams-cq-mt-video-client-5=0x000000000000000000000000 >>> trusted.gfid=0xa0370d8a9f104dafbebbd0e6dd7ce1f7 >>> >>> trusted.glusterfs.dht.linkto=0x6d616d732d63712d6d742d766964656f2d7265706c69636174652d3600 >>> >>> trusted.glusterfs.quota.ca34e1ce-f046-4ed4-bbd1-261b21bfe0b8.contri=0x0000000049dff000 >>> >>> # file: 97027 >>> trusted.afr.mams-cq-mt-video-client-3=0x000000000000000000000000 >>> trusted.afr.mams-cq-mt-video-client-4=0x000000000000000000000000 >>> trusted.afr.mams-cq-mt-video-client-5=0x000000000000000000000000 >>> trusted.gfid=0xc1c1fe2ec7034442a623385f43b04c25 >>> >>> trusted.glusterfs.dht.linkto=0x6d616d732d63712d6d742d766964656f2d7265706c69636174652d3600 >>> >>> trusted.glusterfs.quota.ca34e1ce-f046-4ed4-bbd1-261b21bfe0b8.contri=0x000000006ac78000 >>> >>> # file: 97559 >>> trusted.afr.mams-cq-mt-video-client-3=0x000000000000000000000000 >>> trusted.afr.mams-cq-mt-video-client-4=0x000000000000000000000000 >>> trusted.afr.mams-cq-mt-video-client-5=0x000000000000000000000000 >>> trusted.gfid=0xcf7c17013c914511bda4d1c743fae118 >>> >>> trusted.glusterfs.dht.linkto=0x6d616d732d63712d6d742d766964656f2d7265706c69636174652d3500 >>> >>> trusted.glusterfs.quota.ca34e1ce-f046-4ed4-bbd1-261b21bfe0b8.contri=0x00000000519fb000 >>> >>> # file: 98055 >>> trusted.afr.mams-cq-mt-video-client-3=0x000000000000000000000000 >>> trusted.afr.mams-cq-mt-video-client-4=0x000000000000000000000000 >>> trusted.afr.mams-cq-mt-video-client-5=0x000000000000000000000000 >>> trusted.gfid=0xe86abc6e2c4b44c28d415fbbe34f2102 >>> >>> trusted.glusterfs.dht.linkto=0x6d616d732d63712d6d742d766964656f2d7265706c69636174652d3600 >>> >>> trusted.glusterfs.quota.ca34e1ce-f046-4ed4-bbd1-261b21bfe0b8.contri=0x000000004c098000 >>> >>> # file: 98567 >>> trusted.afr.mams-cq-mt-video-client-3=0x000000000000000000000000 >>> trusted.afr.mams-cq-mt-video-client-4=0x000000000000000000000000 >>> trusted.afr.mams-cq-mt-video-client-5=0x000000000000000000000000 >>> trusted.gfid=0x12543a2efbdf4b9fa61c6d89ca396f80 >>> >>> trusted.glusterfs.dht.linkto=0x6d616d732d63712d6d742d766964656f2d7265706c69636174652d3500 >>> >>> trusted.glusterfs.quota.ca34e1ce-f046-4ed4-bbd1-261b21bfe0b8.contri=0x000000006bc98000 >>> >>> # file: 98583 >>> trusted.afr.mams-cq-mt-video-client-3=0x000000000000000000000000 >>> trusted.afr.mams-cq-mt-video-client-4=0x000000000000000000000000 >>> trusted.afr.mams-cq-mt-video-client-5=0x000000000000000000000000 >>> trusted.gfid=0x760d16d3b7974cfb9c0a665a0982c470 >>> >>> trusted.glusterfs.dht.linkto=0x6d616d732d63712d6d742d766964656f2d7265706c69636174652d3500 >>> >>> trusted.glusterfs.quota.ca34e1ce-f046-4ed4-bbd1-261b21bfe0b8.contri=0x000000006cde9000 >>> >>> # file: 99607 >>> trusted.afr.mams-cq-mt-video-client-3=0x000000000000000000000000 >>> trusted.afr.mams-cq-mt-video-client-4=0x000000000000000000000000 >>> trusted.afr.mams-cq-mt-video-client-5=0x000000000000000000000000 >>> trusted.gfid=0x0849a732ea204bc3b8bae830b46881da >>> >>> trusted.glusterfs.dht.linkto=0x6d616d732d63712d6d742d766964656f2d7265706c69636174652d3500 >>> >>> trusted.glusterfs.quota.ca34e1ce-f046-4ed4-bbd1-261b21bfe0b8.contri=0x00000000513f1000 >>> ... >>> >>> What do you think about it? Thank you very much. >>> >>> >>> 2013/4/12 符永涛 <yongtaofu@xxxxxxxxx> >>> >>>> Hi Brian, >>>> >>>> Your scripts works for me now after I installed all the rpm built out >>>> from kernel srpm. I'll try it. Thank you. >>>> >>>> >>>> 2013/4/12 Brian Foster <bfoster@xxxxxxxxxx> >>>> >>>>> On 04/12/2013 04:32 AM, 符永涛 wrote: >>>>>> Dear xfs experts, >>>>>> Can I just call xfs_stack_trace(); in the second line of >>>>>> xfs_do_force_shutdown() to print stack and rebuild kernel to check >>>>>> what's the error? >>>>>> >>>>> >>>>> I suppose that's a start. If you're willing/able to create and run a >>>>> modified kernel for the purpose of collecting more debug info, perhaps >>>>> we can get a bit more creative in collecting more data on the problem >>>>> (but a stack trace there is a good start). >>>>> >>>>> BTW- you might want to place the call after the XFS_FORCED_SHUTDOWN(mp) >>>>> check almost halfway into the function to avoid duplicate messages. >>>>> >>>>> Brian >>>>> >>>>>> >>>>>> 2013/4/12 符永涛 <yongtaofu@xxxxxxxxx <mailto:yongtaofu@xxxxxxxxx>> >>>>>> >>>>>> Hi Brian, >>>>>> What else I'm missing? Thank you. >>>>>> stap -e 'probe module("xfs").function("xfs_iunlink"){}' >>>>>> >>>>>> WARNING: cannot find module xfs debuginfo: No DWARF information >>>>> found >>>>>> semantic error: no match while resolving probe point >>>>>> module("xfs").function("xfs_iunlink") >>>>>> Pass 2: analysis failed. Try again with another '--vp 01' option. >>>>>> >>>>>> >>>>>> 2013/4/12 符永涛 <yongtaofu@xxxxxxxxx <mailto:yongtaofu@xxxxxxxxx>> >>>>>> >>>>>> ls -l >>>>>> >>>>> /usr/lib/debug/lib/modules/2.6.32-279.el6.x86_64/kernel/fs/xfs/xfs.ko.debug >>>>>> -r--r--r-- 1 root root 21393024 Apr 12 12:08 >>>>>> >>>>> /usr/lib/debug/lib/modules/2.6.32-279.el6.x86_64/kernel/fs/xfs/xfs.ko.debug >>>>>> >>>>>> rpm -qa|grep kernel >>>>>> kernel-headers-2.6.32-279.el6.x86_64 >>>>>> kernel-devel-2.6.32-279.el6.x86_64 >>>>>> kernel-2.6.32-358.el6.x86_64 >>>>>> kernel-debuginfo-common-x86_64-2.6.32-279.el6.x86_64 >>>>>> abrt-addon-kerneloops-2.0.8-6.el6.x86_64 >>>>>> kernel-firmware-2.6.32-358.el6.noarch >>>>>> kernel-debug-2.6.32-358.el6.x86_64 >>>>>> kernel-debuginfo-2.6.32-279.el6.x86_64 >>>>>> dracut-kernel-004-283.el6.noarch >>>>>> libreport-plugin-kerneloops-2.0.9-5.el6.x86_64 >>>>>> kernel-devel-2.6.32-358.el6.x86_64 >>>>>> kernel-2.6.32-279.el6.x86_64 >>>>>> >>>>>> rpm -q kernel-debuginfo >>>>>> kernel-debuginfo-2.6.32-279.el6.x86_64 >>>>>> >>>>>> rpm -q kernel >>>>>> kernel-2.6.32-279.el6.x86_64 >>>>>> kernel-2.6.32-358.el6.x86_64 >>>>>> >>>>>> do I need to re probe it? >>>>>> >>>>>> >>>>>> 2013/4/12 Eric Sandeen <sandeen@xxxxxxxxxxx >>>>>> <mailto:sandeen@xxxxxxxxxxx>> >>>>>> >>>>>> On 4/11/13 11:32 PM, 符永涛 wrote: >>>>>> > Hi Brian, >>>>>> > Sorry but when I execute the script it says: >>>>>> > WARNING: cannot find module xfs debuginfo: No DWARF >>>>>> information found >>>>>> > semantic error: no match while resolving probe point >>>>>> module("xfs").function("xfs_iunlink") >>>>>> > >>>>>> > uname -a >>>>>> > 2.6.32-279.el6.x86_64 >>>>>> > kernel debuginfo has been installed. >>>>>> > >>>>>> > Where can I find the correct xfs debuginfo? >>>>>> >>>>>> it should be in the kernel-debuginfo rpm (of the same >>>>>> version/release as the kernel rpm you're running) >>>>>> >>>>>> You should have: >>>>>> >>>>>> >>>>> /usr/lib/debug/lib/modules/2.6.32-279.el6.x86_64/kernel/fs/xfs/xfs.ko.debug >>>>>> >>>>>> If not, can you show: >>>>>> >>>>>> # uname -a >>>>>> # rpm -q kernel >>>>>> # rpm -q kernel-debuginfo >>>>>> >>>>>> -Eric >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> 符永涛 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> 符永涛 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> 符永涛 >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> xfs mailing list >>>>>> xfs@xxxxxxxxxxx >>>>>> http://oss.sgi.com/mailman/listinfo/xfs >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> 符永涛 >>>> >>> >>> >>> >>> -- >>> 符永涛 >>> >> >> >> >> -- >> 符永涛 >> > > > _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs