Re: Data can't be wrote to XFS RIP [<ffffffffa041a99a>] xfs_dir2_sf_get_parent_ino+0xa/0x20

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jul 20, 2015 at 10:30:31PM +0800, Kuo Hugo wrote:
> Hi Brain,
> 
> >I don’t know much about the Swift bug. A BUG() or crash in the kernel is
> generally always a kernel bug, regardless of what userspace is doing. It
> >certainly could be that whatever userspace is doing to trigger the kernel
> bug is a bug in the userspace application, but either way it shouldn’t
> cause the >kernel to crash. By the same token, if Swift is updated to fix
> the aforementioned bug and the kernel crash no longer reproduces, that
> doesn’t >necessarily mean the kernel bug is fixed (just potentially hidden).
> 
> Understand.
> 
> [Previous Message]
> 
> The valid inode has an inode number of 13668207561.
> - The fsname for this inode is "sdb."
> - The inode does appear to have a non-NULL if_data:
> 
>     if_u1 = {
>       if_extents = 0xffff88084feaf5c0,
>       if_ext_irec = 0xffff88084feaf5c0,
>       if_data = 0xffff88084feaf5c0 "\004"
>     },
> 
>         find <mntpath> -inum 13668207561
> 
> Q1: Were you able to track down the directory inode mentioned in the
> previous message?
> 
> Ans: Yes, it’s the directory/file as below. /srv/node/d224 is the mount
> point of /dev/sdb . This is the original location of the path. This folder
> includes the file 1436266052.71893.ts now. The .ts file is 0 size
> 
> 
> [root@r2obj01 ~]# find /srv/node/d224 -inum 13668207561
> /srv/node/d224/objects/45382/b32/b146865bf8034bfc42570b747c341b32
> 
> [root@r2obj01 ~]# ls -lrt
> /srv/node/d224/objects/45382/b32/b146865bf8034bfc42570b747c341b32
> -rw------- 1 swift swift 0 Jul 7 22:37 1436266052.71893.ts
> 
> Q2: Is it some kind of internal directory used by the application (e.g.,
> perhaps related to the quarantine mechanism mentioned in the bug)?
> 
> Ans: Yes, it’s a directory which accessing by application.
> 

Ok, so I take it that we have a directory per object based on some kind
of hash. The directory presumably contains the object along with
whatever metadata is tracked.

> 
>  37 ffff8810718343c0 ffff88105b9d32c0 ffff8808745aa5e8 REG  [eventpoll]
>  38 ffff8808713da780 ffff880010c9a900 ffff88096368a188 REG
> /srv/node/d224/quarantined/objects/b146865bf8034bfc42570b747c341b32/1436266042.57775.ts
>  39 ffff880871cb03c0 ffff880495a8b380 ffff8808a5e6c988 REG
> /srv/node/d224/tmp/tmpSpnrHg
> 
>  40 ffff8808715b4540 ffff8804819c58c0 ffff8802381f8d88 DIR
> /srv/node/d224/quarantined/objects/b146865bf8034bfc42570b747c341b32
> 
> The above operation in the swift-object-server was doing python function
> call to rename the file
> * /srv/node/d224/objects/45382/b32/b146865bf8034bfc42570b747c341b32/1436266042.57775.ts*
> as
> */srv/node/d224/quarantined/objects/b146865bf8034bfc42570b747c341b32/1436266042.57775.ts*
> 
> os.rename(old, new)
> 
> And it crashed at this point. In the Q1, we found the inum is pointing to
> the directory
> /srv/node/d224/objects/45382/b32/b146865bf8034bfc42570b747c341b32 .
> 

The original stacktrace shows the crash in a readdir request. I'm sure
there are multiple things going on here (and there are a couple rename
traces in the vmcore sitting on locks), of course, but where does the
information about the rename come from?

> We found that multiple(over 10) DELETE from application against the target
> file at almost same moment. The DELETE is removing the original file in the
> directory and create new empty .ts file in this directory. I suspect that
> multiple os.rename on the same file in that directory will cause the kernel
> panic.
> 
> And the file
> /srv/node/d224/quarantined/objects/b146865bf8034bfc42570b747c341b32/1436266042.57775.ts
> was not created.
> 

I'm not quite following here because I don't have enough context about
what the application server is doing. So far, it sounds like we somehow
have multiple threads competing to rename the same file..? Is there
anything else in this directory at the time this sequence executes
(e.g., a file with object data that also gets quarantined)?

Ideally, we'd ultimately like to translate this into a sequence of
operations as seen by the fs that hopefully trigger the problem. We
might have to start by reproducing through the application server.
Looking back at that bug report, it sounds like a 'DELETE' is a
high-level server operation that can consist of multiple sub-operations
at the filesystem level (e.g., list, conditional rename if *.ts file
exists, etc.). Do you have enough information through any of the above
to try and run something against Swift that might explicitly reproduce
the problem? For example, have one thread that creates and recreates the
same object repeatedly and many more competing threads that try to
remove (or whatever results in the quarantine) it? Note that I'm just
grasping at straws here, you might be able to design a more accurate
reproducer based on what it looks like is happening within Swift.

Brian

> Regards // Hugo
>

> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs




[Index of Archives]     [Linux XFS Devel]     [Linux Filesystem Development]     [Filesystem Testing]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux