Re: xfs_iunlink_remove: xfs_inotobp() returned error 22 -- debugging

Brian Foster <bfoster@xxxxxxxxxx> · Sat, 20 Apr 2013 08:58:32 -0400

On 04/20/2013 07:52 AM, 符永涛 wrote:
> Hi Brain,
> Here is the file:
> find . -inum 539274
> ./crashtest/.glusterfs/indices/xattrop
> [root@10.23.72.94 <mailto:root@10.23.72.94> xfsd]# ls -lih
> ./crashtest/.glusterfs/indices/xattrop
> total 0
> 539275 ---------- 2 root root 0 Apr 20 17:17
> 132ef294-71d1-4435-8daa-aa002e67cb6e
> 539275 ---------- 2 root root 0 Apr 20 17:17
> xattrop-f3ad589a-b8dc-4416-ab84-fc9ad4033540
> find . -inum 539275
> ./crashtest/.glusterfs/indices/xattrop/xattrop-f3ad589a-b8dc-4416-ab84-fc9ad4033540
> ./crashtest/.glusterfs/indices/xattrop/132ef294-71d1-4435-8daa-aa002e67cb6e
> I'm not sure if it is normal or glusterfs fall in infinite loop. Is
> there a change that the kernel fall into dead loop?
> I'll study it.
> 

Very interesting, thank you. I don't have the full context here yet, but
the short of it is that this particular indices/xattrop directory is
managed by a backend translator and driven by replication. It appears to
be doing some kind of transaction level tracking based on links. E.g.,
some quick behavioral observations:

- This directory is created and an xattrop-#### file created (no size).
- On starting a large sequential write from a client, I start observing
a continuous sequence of xfs_link/xfs_remove operations via tracing.
- The backend appears to create a link to the xattr-#### file for every
replication transaction (handwave) with the name of the link referring
to the gfid name of the file under write (but note again the link is not
to the inode under write, but this special xattr-#### file). It then
apparently removes this link on transaction completion and the process
repeats. I suspect this has to do with identifying what files were under
modification in the event of a crash before transaction completion, but
the underlying pattern/workload is what's more important to us here for
the time being...

So we're seeing a heavy link related workload on a directory inode
(0x83a8a) where the entries are all links to the same inode (0x83a8b).
These inodes happen to be close in proximity, which may or may not be a
factor. The translator that generates these link/unlink ops sits right
above a generic thread pool translator, so this is multi-threaded.

What isn't clear to me yet is where the xfs_iunlink() for this heavily
linked inode is induced. The primary dentry remains after my file copy
test, but then taking another look after a few minutes I see it removed.
The same thing occurs if I gracefully restart the volume. I'm going to
have to dig into that some more and also see if we can use this to
narrow in on a reproducer. I'm thinking something along the lines of:

- Create a directory/file. Ideally the directory and file inodes are in
the same cluster.
- Start a highly-threaded link-unlink workload against that file in the
same directory.
- Somewhere in the background unlink the main file.
- Check for multiple xfs_iunlink() ops, repeat.

... the assumption being that the xfs_iunlink() race could have lead to
a possible unlinked list corruption on the associated list, such that a
later inactivation/xfs_iunlink_remove of some other inode in that bucket
could fail.

Brian

> 
> 2013/4/20 Brian Foster <bfoster@xxxxxxxxxx <mailto:bfoster@xxxxxxxxxx>>
> 
>     On 04/20/2013 06:10 AM, 符永涛 wrote:
>     > Dear Eric,
>     > I have applied your latest patch and collected the following log:
>     >
>     > /var/log/message
>     > Apr 20 17:28:23 10 kernel: XFS (sdb): xfs_iunlink_remove:
>     xfs_inotobp()
>     > returned error 22 for inode 0x1b20b ag 0 agino 1b20b
>     > Apr 20 17:28:23 10 kernel:
>     > Apr 20 17:28:23 10 kernel: XFS (sdb): xfs_inactive: xfs_ifree returned
>     > error 22
>     > Apr 20 17:28:23 10 kernel: XFS (sdb): xfs_do_force_shutdown(0x1)
>     called
>     > from line 1184 of file fs/xfs/xfs_vnodeops.c.  Return address =
>     > 0xffffffffa02d4d0a
>     > Apr 20 17:28:23 10 kernel: XFS (sdb): I/O Error Detected. Shutting
>     down
>     > filesystem
>     > Apr 20 17:28:23 10 kernel: XFS (sdb): Please umount the filesystem and
>     > rectify the problem(s)
>     > Apr 20 17:28:37 10 kernel: XFS (sdb): xfs_log_force: error 5 returned.
>     > Apr 20 17:29:07 10 kernel: XFS (sdb): xfs_log_force: error 5 returned.
>     > Apr 20 17:29:37 10 kernel: XFS (sdb): xfs_log_force: error 5 returned.
>     > Apr 20 17:30:07 10 kernel: XFS (sdb): xfs_log_force: error 5 returned.
>     >
>     > debugfs trace:
>     >
>     https://docs.google.com/file/d/0B7n2C4T5tfNCTlZGUVpnZENrZ3M/edit?usp=sharing
>     >
> 
>     FWIW...
> 
>     <...>-6908  [001]  8739.967623: xfs_iunlink: dev 8:16 ino 0x83a8b mode
>     0100000, flags 0x0
>     <...>-6909  [001]  8739.970252: xfs_iunlink: dev 8:16 ino 0x83a8b mode
>     0100000, flags 0x0
> 
>     0x83a8b and 0x1b20b both hash to unlinked list bucket 11.
> 
>     As to the rest of the trace, there appears to be a significant amount of
>     link activity on (directory) inode 0x83a8a (the immediately prior inode
>     to the inode involved in the race). The name data in the trace suggests
>     activity somewhere under .glusterfs. A couple questions:
> 
>     1.) Any idea what entries point to this inode right now (e.g., how many
>     links on this inode) and where it resides in the fs (path)?
> 
>     2.) Can you associate this kind of heavy remove/link pattern on a single
>     inode to a higher level activity? For example, if you were to watch the
>     trace data live, is this a normal pattern you observe? Does it only
>     occur when a rebalance is in progress? Or when a rebalance finishes? Any
>     detailed observations you can make in that regard could be helpful.
> 
>     Brian
> 
>     > Thank you.
>     >
>     >
>     > 2013/4/20 符永涛 <yongtaofu@xxxxxxxxx <mailto:yongtaofu@xxxxxxxxx>
>     <mailto:yongtaofu@xxxxxxxxx <mailto:yongtaofu@xxxxxxxxx>>>
>     >
>     >     Hi Eric,
>     >     The xfs module is loaded from system kernel, it happens on our
>     >     production server too (I did not touch that till now) and if
>     the xfs
>     >     module is mess up the systemstap may also not working but now it
>     >     works. As you have mentioned, strange thing is xfs shutdown always
>     >     happens when glusterfs rebalance completes.
>     >
>     >
>     >     2013/4/20 Eric Sandeen <sandeen@xxxxxxxxxxx
>     <mailto:sandeen@xxxxxxxxxxx>
>     >     <mailto:sandeen@xxxxxxxxxxx <mailto:sandeen@xxxxxxxxxxx>>>
>     >
>     >         On 4/19/13 9:03 PM, 符永涛 wrote:
>     >         > Hi Eric,
>     >         > I will enable them and run test again. I can only
>     reproduce it
>     >         with
>     >         > glusterfs rebalance. Glusterfs uses a mechanism it called
>     >         syncop to
>     >         > unlink file. For rebalance it uses
>     >         > syncop_unlink(glusterfs/libglusterfs/src/syncop.c). In the
>     >         glusterfs
>     >         > sync_task framework(glusterfs/libglusterfs/src/syncop.c)
>     it uses
>     >         > "makecontext/swapcontext"
>     >         >
>     >        
>     <http://www.opengroup.org/onlinepubs/009695399/functions/makecontext.html>.
>     >         > Does it leads to racing unlink from different CPU core?
>     >
>     >         Yep, I understand that it's rebalance.  It dies when rebalance
>     >         finishes because an
>     >         open but unlinked file trips over the corrupted list from
>     >         earlier, it seems.
>     >
>     >         I don't know why makecontext would matter...
>     >
>     >         Just to be sure, you are definitely loading the xfs module
>     from
>     >         the kernel you built, right, and you don't have a "priority"
>     >         module getting loaded from elsewhere?  Seems unlikely, but
>     just
>     >         to be sure.
>     >
>     >         > Thank you.
>     >
>     >         You could also add this patch to the xfs tracepoints to print
>     >         more information about the inodes - the mode & flags.
>     >
>     >         -Eric
>     >
>     >
>     >         diff --git a/fs/xfs/linux-2.6/xfs_trace.h
>     >         b/fs/xfs/linux-2.6/xfs_trace.h
>     >         index e8ce644..c314b87 100644
>     >         --- a/fs/xfs/linux-2.6/xfs_trace.h
>     >         +++ b/fs/xfs/linux-2.6/xfs_trace.h
>     >         @@ -544,14 +544,18 @@ DECLARE_EVENT_CLASS(xfs_inode_class,
>     >                 TP_STRUCT__entry(
>     >                         __field(dev_t, dev)
>     >                         __field(xfs_ino_t, ino)
>     >         +               __field(__u16, mode)
>     >         +               __field(unsigned long, flags)
>     >                 ),
>     >                 TP_fast_assign(
>     >                         __entry->dev = VFS_I(ip)->i_sb->s_dev;
>     >                         __entry->ino = ip->i_ino;
>     >         +               __entry->mode = VFS_I(ip)->i_mode;
>     >         +               __entry->flags = ip->i_flags;
>     >                 ),
>     >         -       TP_printk("dev %d:%d ino 0x%llx",
>     >         +       TP_printk("dev %d:%d ino 0x%llx mode 0%o, flags
>     0x%lx",
>     >                           MAJOR(__entry->dev), MINOR(__entry->dev),
>     >         -                 __entry->ino)
>     >         +                 __entry->ino, __entry->mode, __entry->flags)
>     >          )
>     >
>     >          #define DEFINE_INODE_EVENT(name) \
>     >
>     >
>     >
>     >
>     >
>     >
>     >     --
>     >     符永涛
>     >
>     >
>     >
>     >
>     > --
>     > 符永涛
> 
> 
> 
> 
> -- 
> 符永涛

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs