Re: Data can't be wrote to XFS RIP [<ffffffffa041a99a>] xfs_dir2_sf_get_parent_ino+0xa/0x20

Kuo Hugo <tonytkdk@xxxxxxxxx> · Thu, 18 Jun 2015 22:29:09 +0800

Hi all, 

>- Is this (and how often) reproducible?

This is the third time happened in three different servers in past 5 days. 

>- Have you identified which directory in your fs that the object server is attempting to enumerate when this occurs?

There's multiple object server workers R/W on over 30 XFS disks in a server.  I don't have clue about which object server request causes the kernel panic. I'm still investigating.  

>- Do you have any other, related output in /var/log/messages prior to this event? E.g., corruption messages or anything of that nature?

Seems no useful information in the /var/log/syslog

```
Jun 18 06:07:00 r1obj03 ovpn-454f2951-b955-11e4-8034-0cc47a1f36ee[4069]: Data Channel Decrypt: Using 160 bit message hash 'SHA1' for HMAC authentication
Jun 18 06:07:00 r1obj03 ovpn-454f2951-b955-11e4-8034-0cc47a1f36ee[4069]: Control Channel: TLSv1, cipher TLSv1/SSLv3 DHE-RSA-AES256-SHA, 2048 bit RSA
Jun 18 06:10:01 r1obj03 CRON[13595]: (swift) CMD ((date; test -f /etc/swift/object-server.conf && /opt/ss/bin/swift-recon-cron /etc/swift/object-server.conf || /opt/ss/bin/swift-recon-cron /etc/swift/object-server/1.conf) >> /var/log/swift-recon-cron.log 2>&1)
Jun 18 06:10:14 r1obj03 kernel: [7631629.083099] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
```

>- Have you tried an 'xfs_repair -n' of the affected filesystem? Note that -n will report problems only and prevent any modification by repair.

We might to to xfs_repair if we can address which disk causes the issue. 

Thanks // Hugo Kuo

2015-06-18 21:31 GMT+08:00 Brian Foster <bfoster@xxxxxxxxxx>:
On Thu, Jun 18, 2015 at 07:56:24PM +0800, Kuo Hugo wrote:

> Hi folks,

>

> Recently we found the following kernel message of XFS. I don’t really know

> how to read it in the right way to figure out the problem in the system.

> Is there any known bug for

> Linux-3.13.0-32-generic-x86_64-with-Ubuntu-14.04-trusty ? Or the problem is

> on the swift-object-se rather than XFS itself ?

>

Nothing that I know of, but others might have seen something like this.

> swift-object-se means swift-object-server which is a daemon handles data

> from http to XFS. I can’t address the problem came from XFS or the daemon

> swift-object-server.

> Any idea would be appreciated.

>

> Jun 15 09:49:30 r1obj02 kernel: [607696.798803] BUG: unable to handle

> kernel NULL pointer dereference at 0000000000000001

> Jun 15 09:49:30 r1obj02 kernel: [607696.800582] IP:

> [<ffffffffa041a99a>] xfs_dir2_sf_get_parent_ino+0xa/0x20 [xfs]

So that looks like a NULL header down in xfs_dir2_sf_get_ino(), as

hdr->i8count is at a 1 byte offset in the structure.

> Jun 15 09:49:30 r1obj02 kernel: [607696.802230] PGD 1046c6c067 PUD

> 1044eba067 PMD 0

> Jun 15 09:49:30 r1obj02 kernel: [607696.803308] Oops: 0000 [#1] SMP

> Jun 15 09:49:30 r1obj02 kernel: [607696.804058] Modules linked in:

> xt_conntrack xfs xt_REDIRECT iptable_nat nf_conntrack_ipv4

> nf_defrag_ipv4 nf_nat_ipv4 nf_nat xt_tcpudp iptable_filter ip_tables

> x_tables x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm

> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ip_vs aesni_intel

> aes_x86_64 gpio_ich lrw nf_conntrack gf128mul libcrc32c mei_me

> glue_helper sb_edac ablk_helper cryptd edac_core joydev mei lpc_ich

> ioatdma lp ipmi_si shpchp wmi mac_hid parport ses enclosure

> hid_generic igb usbhid ixgbe mpt2sas ahci hid i2c_algo_bit libahci dca

> raid_class ptp mdio scsi_transport_sas pps_core

> Jun 15 09:49:30 r1obj02 kernel: [607696.817125] CPU: 13 PID: 32401

> Comm: swift-object-se Not tainted 3.13.0-32-generic #57-Ubuntu

> Jun 15 09:49:30 r1obj02 kernel: [607696.819020] Hardware name: Silicon

> Mechanics Storform iServ R518.v4/X9DRH-7TF/7F/iTF/iF, BIOS 3.0b

> 04/28/2014

> Jun 15 09:49:30 r1obj02 kernel: [607696.821235] task: ffff880017d68000

> ti: ffff8808e87e4000 task.ti: ffff8808e87e4000

> Jun 15 09:49:30 r1obj02 kernel: [607696.822889] RIP:

> 0010:[<ffffffffa041a99a>] [<ffffffffa041a99a>]

> xfs_dir2_sf_get_parent_ino+0xa/0x20 [xfs]

> Jun 15 09:49:30 r1obj02 kernel: [607696.825117] RSP:

> 0018:ffff8808e87e5e38 EFLAGS: 00010202

> Jun 15 09:49:30 r1obj02 kernel: [607696.826296] RAX: ffffffffa0458360

> RBX: 0000000000000004 RCX: 0000000000000000

> Jun 15 09:49:30 r1obj02 kernel: [607696.905158] RDX: 0000000000000002

> RSI: 0000000000000002 RDI: 0000000000000000

> Jun 15 09:49:30 r1obj02 kernel: [607696.987107] RBP: ffff8808e87e5e88

> R08: 000000020079e3b9 R09: 0000000000000004

> Jun 15 09:49:30 r1obj02 kernel: [607697.069214] R10: 00000000000003e0

> R11: 00000000000005b0 R12: ffff88104d0c0800

> Jun 15 09:49:30 r1obj02 kernel: [607697.151676] R13: ffff8808e87e5f20

> R14: ffff88004988f000 R15: 0000000000000000

> Jun 15 09:49:30 r1obj02 kernel: [607697.234244] FS:

> 00007fe74c9fb740(0000) GS:ffff88085fce0000(0000)

> knlGS:0000000000000000

> Jun 15 09:49:30 r1obj02 kernel: [607697.318842] CS: 0010 DS: 0000 ES:

> 0000 CR0: 0000000080050033

> Jun 15 09:49:30 r1obj02 kernel: [607697.361609] CR2: 0000000000000001

> CR3: 0000000bcb9b1000 CR4: 00000000001407e0

> Jun 15 09:49:30 r1obj02 kernel: [607697.445360] Stack:

> Jun 15 09:49:30 r1obj02 kernel: [607697.485796] ffff8808e87e5e88

> ffffffffa03e2a33 ffff8808e87e5e58 ffffffff817205f9

> Jun 15 09:49:30 r1obj02 kernel: [607697.567306] ffff8808e87e5eb8

> ffff88084e1e6700 ffff88004988f000 ffff8808e87e5f20

> Jun 15 09:49:30 r1obj02 kernel: [607697.648568] 0000000000000082

> 00007fe7487aa7a6 ffff8808e87e5ec0 ffffffffa03e2e0b

> Jun 15 09:49:30 r1obj02 kernel: [607697.729785] Call Trace:

> Jun 15 09:49:30 r1obj02 kernel: [607697.769297] [<ffffffffa03e2a33>] ?

> xfs_dir2_sf_getdents+0x263/0x2a0 [xfs]

We're called from here attempting to list a directory, which appears to

be the following block of code:

        ...

        sfp = (xfs_dir2_sf_hdr_t *)dp->i_df.if_u1.if_data;

        ...

        if (ctx->pos <= dotdot_offset) {

                ino = dp->d_ops->sf_get_parent_ino(sfp);

                ctx->pos = dotdot_offset & 0x7fffffff;

                if (!dir_emit(ctx, "..", 2, ino, DT_DIR))

                        return 0;

        }

It wants to emit the ".." directory entry and apparently the in-core

data fork is NULL. There's an assertion against that earlier in the

function so I take it the expectation is that this has been read/set

beforehand. In fact, if this is a short form directory I also take it

this should be set to if_inline_data, which appears to be part of the

fork allocation itself.

It's not immediately clear to me how this could happen. First off, it

would probably be good to determine whether this is a runtime issue or

due to some kind of on-disk problem. Some questions:

- Is this (and how often) reproducible?

- Have you identified which directory in your fs that the object server

  is attempting to enumerate when this occurs?

- Do you have any other, related output in /var/log/messages prior to

  this event? E.g., corruption messages or anything of that nature?

- Have you tried an 'xfs_repair -n' of the affected filesystem? Note

  that -n will report problems only and prevent any modification by

  repair.

Brian

> Jun 15 09:49:30 r1obj02 kernel: [607697.809560] [<ffffffff817205f9>] ?

> schedule_preempt_disabled+0x29/0x70

> Jun 15 09:49:30 r1obj02 kernel: [607697.849087] [<ffffffffa03e2e0b>]

> xfs_readdir+0xeb/0x110 [xfs]

> Jun 15 09:49:30 r1obj02 kernel: [607697.887918] [<ffffffffa03e4a3b>]

> xfs_file_readdir+0x2b/0x40 [xfs]

> Jun 15 09:49:30 r1obj02 kernel: [607697.926061] [<ffffffff811d0035>]

> iterate_dir+0xa5/0xe0

> Jun 15 09:49:30 r1obj02 kernel: [607697.963349] [<ffffffff8109ddf4>] ?

> vtime_account_user+0x54/0x60

> Jun 15 09:49:30 r1obj02 kernel: [607698.000413] [<ffffffff811d0492>]

> SyS_getdents+0x92/0x120

> Jun 15 09:49:30 r1obj02 kernel: [607698.037112] [<ffffffff811d0150>] ?

> fillonedir+0xe0/0xe0

> Jun 15 09:49:30 r1obj02 kernel: [607698.072867] [<ffffffff8172c81c>] ?

> tracesys+0x7e/0xe6

> Jun 15 09:49:30 r1obj02 kernel: [607698.107679] [<ffffffff8172c87f>]

> tracesys+0xe1/0xe6

> Jun 15 09:49:30 r1obj02 kernel: [607698.141543] Code: 00 48 8b 06 48

> ba ff ff ff ff ff ff ff 00 5d 48 0f c8 48 21 d0 c3 66 66 2e 0f 1f 84

> 00 00 00 00 00 0f 1f 44 00 00 55 48 8d 77 02 <0f> b6 7f 01 48 89 e5 e8

> aa ff ff ff 5d c3 0f 1f 84 00 00 00 00

> Jun 15 09:49:30 r1obj02 kernel: [607698.244881] RIP

> [<ffffffffa041a99a>] xfs_dir2_sf_get_parent_ino+0xa/0x20 [xfs]

> Jun 15 09:49:30 r1obj02 kernel: [607698.310872] RSP <ffff8808e87e5e38>

> Jun 15 09:49:30 r1obj02 kernel: [607698.343092] CR2: 0000000000000001

> Jun 15 09:49:30 r1obj02 kernel: [607698.420933] ---[ end trace

> ba3fdf319346b7e6 ]---

>

> Thanks // Hugo Kuo

> 

> _______________________________________________

> xfs mailing list

> xfs@xxxxxxxxxxx

> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs