Re: extremely slow file creation/deletion after xfs ran full

Brian Foster <bfoster@xxxxxxxxxx> · Mon, 12 Jan 2015 07:44:05 -0500

On Mon, Jan 12, 2015 at 09:36:49AM +0100, Carsten Aulbert wrote:
> Hi
> 
> a couple of days ago, one of our XFS pools ran full which did not
> realize due to some bogged monitoring.
> 
> After deleting files, the FS is not filled to about 80-85% and file
> creation as well as deletion is extremely slow:
> 
> $ cd /srv
> $ time touch test
> touch test  0.00s user 0.17s system 5% cpu 2.993 total
> $ time touch test
> touch test  0.00s user 0.00s system 0% cpu 0.002 total
> $ time rm test
> rm test  0.00s user 0.00s system 0% cpu 4.954 total
> 

Based on the size and consumption of the fs, first thing that comes to
mind is perhaps fragmented use of inode records. E.g., inode records are
spread all over the storage with handfulls of inodes free here and
there, which means individual inode allocation can take a hit searching
for the record with free inodes. I don't think that explains rm
performance though.

It might be interesting to grab a couple perf traces of the touch and rm
commands and see what cpu usage looks like. E.g., 'perf record -g touch
<file>,' 'perf report -g,' etc.

> As we had seen some XFS error messages like XFS: Internal error
> XFS_WANT_CORRUPTED_GOTO[1] shortly before the file system was completely
> filled, we umounted it and ran successfully xfs_repair on it[2].
> 

Interesting, that is:

	XFS_WANT_CORRUPTED_GOTO(rlen <= flen, error0);

... which seems strange given that rlen should have been previously
trimmed. Taking a closer look at xfs_alloc_fix_len() reminded me of the
following commit, which actually documents and fixes instances of this
error, so that might be the most likely culprit:

	30265117 xfs: Fix rounding in xfs_alloc_fix_len()

That originally went into 3.16 and I don't see it in the 3.14 stable
branch. Did xfs_repair actually report anything wrong?

> We started to dig into fragmentation levels, but to be honest, I'm not
> too sure how to interpret the numbers. All seem pretty innocent to me
> except the very high directory fragmentation:
> 
> # mount
> [...]
> /dev/sdb1 on /srv type xfs (rw,relatime,attr2,inode64,noquota)
> 
> # uname -a
> Linux einstein4 3.14.24-atlas #1 SMP Wed Nov 19 08:29:23 UTC 2014 x86_64
> GNU/Linux
> 
> (after the reboot to add more RAM, before 3.14.11)
> # df -h /dev/sdb1
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/sdb1        15T   12T  2.7T  82% /srv
> 
> # df -i /dev/sdb1
> Filesystem         Inodes     IUsed      IFree IUse% Mounted on
> /dev/sdb1      3105468352 554441361 2551026991   18% /srv
> 
> # xfs_info /dev/sdb1
> meta-data=/dev/sdb1              isize=256    agcount=15,
> agsize=268435455 blks
>          =                       sectsz=512   attr=2
> data     =                       bsize=4096   blocks=3881835478, imaxpct=5
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=521728, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> (underneath is a Areca backed 16 disk raid 10 and yes, possibly we
> should have gone to more AGs, hindsight is great ;))
> 
> # xfs_db -r -c "frag" /dev/sdb1
> actual 562488447, ideal 553668062, fragmentation factor 1.57%
> # xfs_db -r -c "freesp" /dev/sdb1
>    from      to extents  blocks    pct
>       1       1 9669664 9669664   1.36
>       2       3 5634086 12956862   1.83
>       4       7 1720988 8320694   1.17
>       8      15  314377 3088702   0.44
>      16      31  231279 4941737   0.70
>      32      63 10501367 372068430  52.45
>      64     127 2604359 204737946  28.86
>     128     255  377595 63623350   8.97
>     256     511   58744 20198673   2.85
>     512    1023    8843 5759074   0.81
>    1024    2047    1044 1418464   0.20
>    2048    4095     255  685540   0.10
>    4096    8191      81  453953   0.06
>    8192   16383      14  126863   0.02
>   16384   32767      18  373876   0.05
>   32768   65535      11  435041   0.06
>   65536  131071       6  458767   0.06
> 
> # xfs_db -r -c "frag -d" /dev/sdb1
> actual 7987218, ideal 67532, fragmentation factor 99.15%
> # xfs_db -r -c "frag -f" /dev/sdb1
> actual 555079796, ideal 554171759, fragmentation factor 0.16%
> 
> Finally, we still xfs error messages in the kernel logs[3].
> 
> Questions:
>   * Would xfs_fsr be able to cure this problem?
>   * Would moving to xfsprogs 3.2 and kernel > 3.15 with on disk format 5
> plus crc help to protect against this "wear"?
> 

It seems like you have sufficiently large and available free space. That
said, it's fairly common for filesytems to naturally drop in performance
as free space becomes more limited. E.g., I think it's common practice
to avoid regular usage while over 80% used if performance is a major
concern. Also, I doubt xfs_fsr will do much to affect inode allocation
performance, but I could be wrong.

With regard to v5, I don't think there is anything that will affect
"wear" as such. Perhaps there are speculative preallocation fixes that
might affect long term allocation patterns, depending on your workload.
CRC support provides extra metadata validation. v5 also includes free
inode btree support, which can improve inode allocation performance in
the inode record fragmentation scenario described above.

Brian

> Cheers
> 
> Carsten
> 
> PS: Please Cc me, as I'm not subscribed ATM.
> 
> 
> [1]Jan  7 20:58:47 einstein4 kernel: [14029681.124548] XFS: Internal
> error XFS_WANT_CORRUPTED_GOTO at line 1419 of file fs/xfs/xfs_alloc.c.
> Caller 0xffffffffa023594d
> Jan  7 20:58:47 einstein4 kernel: [14029681.124657] CPU: 5 PID: 13437
> Comm: kworker/5:1 Not tainted 3.14.11-atlas #1
> Jan  7 20:58:47 einstein4 kernel: [14029681.124745] Hardware name:
> Supermicro X7DVL/X7DVL, BIOS 6.00 11/02/2007
> Jan  7 20:58:47 einstein4 kernel: [14029681.124860] Workqueue: xfsalloc
> xfs_bmapi_allocate_worker [xfs]
> Jan  7 20:58:47 einstein4 kernel: [14029681.124977]  0000000000000000
> ffff88013b7ebc80 ffffffff814e9555 ffff88017022bd30
> Jan  7 20:58:47 einstein4 kernel: [14029681.125247]  ffffffffa0234aed
> ffff88041877c380 0000000000000000 000000000000000c
> Jan  7 20:58:47 einstein4 kernel: [14029681.125514]  0000000000000000
> 0000000100000000 0ff0b98800000000 0000000100000002
> Jan  7 20:58:47 einstein4 kernel: [14029681.125783] Call Trace:
> Jan  7 20:58:47 einstein4 kernel: [14029681.125856]
> [<ffffffff814e9555>] ? dump_stack+0x41/0x51
> Jan  7 20:58:47 einstein4 kernel: [14029681.125937]
> [<ffffffffa0234aed>] ? xfs_alloc_ag_vextent_size+0x33d/0x650 [xfs]
> Jan  7 20:58:47 einstein4 kernel: [14029681.126035]
> [<ffffffffa023594d>] ? xfs_alloc_ag_vextent+0x9d/0x100 [xfs]
> Jan  7 20:58:47 einstein4 kernel: [14029681.126118]
> [<ffffffffa02366c5>] ? xfs_alloc_vextent+0x485/0x610 [xfs]
> Jan  7 20:58:47 einstein4 kernel: [14029681.126202]
> [<ffffffffa0245ff4>] ? xfs_bmap_btalloc+0x2d4/0x790 [xfs]
> Jan  7 20:58:47 einstein4 kernel: [14029681.126286]
> [<ffffffffa0246ab1>] ? __xfs_bmapi_allocate+0xd1/0x2e0 [xfs]
> Jan  7 20:58:47 einstein4 kernel: [14029681.126366]
> [<ffffffffa0216375>] ? xfs_bmapi_allocate_worker+0x35/0x70 [xfs]
> Jan  7 20:58:47 einstein4 kernel: [14029681.126456]
> [<ffffffff8107d5db>] ? process_one_work+0x16b/0x400
> Jan  7 20:58:47 einstein4 kernel: [14029681.126529]
> [<ffffffff8107e9c6>] ? worker_thread+0x116/0x380
> Jan  7 20:58:47 einstein4 kernel: [14029681.126601]
> [<ffffffff8107e8b0>] ? manage_workers.isra.21+0x290/0x290
> Jan  7 20:58:47 einstein4 kernel: [14029681.126675]
> [<ffffffff81084bcc>] ? kthread+0xbc/0xe0
> Jan  7 20:58:47 einstein4 kernel: [14029681.126745]
> [<ffffffff81084b10>] ? flush_kthread_worker+0x80/0x80
> Jan  7 20:58:47 einstein4 kernel: [14029681.126819]
> [<ffffffff814f724c>] ? ret_from_fork+0x7c/0xb0
> Jan  7 20:58:47 einstein4 kernel: [14029681.126890]
> [<ffffffff81084b10>] ? flush_kthread_worker+0x80/0x80
> Jan  7 20:58:47 einstein4 kernel: [14029681.126973] XFS (sdb1): page
> discard on page ffffea000b973650, inode 0x8e56233bd, offset 172167168.
> 
> [2] multiple times until we had enough RAM installed into the machine
> (24GB+swap space finally made the difference)
> 
> [3] e.g.
> [170369.029932] ------------[ cut here ]------------
> [170369.029940] WARNING: CPU: 4 PID: 65 at fs/xfs/xfs_aops.c:1184
> xfs_vm_releasepage+0xb5/0xc0 [xfs]()
> [170369.029942] Modules linked in: dm_mod nfsv3 tun xt_nat autofs4
> ipt_REJECT ipt_MASQUERADE xt_state iptable_nat nf_nat_ipv4 nf_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables
> xt_tcpudp x_tables nfsd auth_rpcgss oid_registry nfs_acl nfs lockd
> fscache sunrpc ext4 crc16 mbcache jbd2 ipmi_poweroff ipmi_watchdog
> ipmi_devintf ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support ttm
> drm_kms_helper drm coretemp i5000_edac edac_core lpc_ich i5k_amb
> mfd_core ioatdma sr_mod cdrom i2c_algo_bit i2c_i801 rng_core i2c_core
> joydev dca psmouse ehci_pci snd_pcm snd_timer snd soundcore pcspkr
> serio_raw shpchp processor parport_pc parport evdev kvm thermal_sys
> button xfs crc32c libcrc32c md_mod usb_storage hid_generic usbhid hid
> sd_mod sg crc_t10dif crct10dif_common floppy ahci libahci libata
> uhci_hcd e1000 ehci_hcd arcmsr usbcore usb_common scsi_mod e1000e ptp
> pps_core
> [170369.029986] CPU: 4 PID: 65 Comm: kswapd0 Tainted: G        W
> 3.14.24-atlas #1
> [170369.029986] Hardware name: Supermicro X7DVL/X7DVL, BIOS 6.00 11/02/2007
> [170369.029987]  0000000000000009 ffff88060d461a18 ffffffff81529af8
> 0000000000000000
> [170369.029989]  0000000000000000 ffff88060d461a58 ffffffff8106a37c
> ffff880610389d40
> [170369.029991]  ffff88060d461b28 ffffea000f39f8a0 ffff8805f722dd98
> ffffea000f39f8c0
> [170369.029994] Call Trace:
> [170369.029994]  [<ffffffff81529af8>] dump_stack+0x46/0x58
> [170369.029996]  [<ffffffff8106a37c>] warn_slowpath_common+0x8c/0xc0
> [170369.029998]  [<ffffffff8106a3ca>] warn_slowpath_null+0x1a/0x20
> [170369.030000]  [<ffffffffa0212225>] xfs_vm_releasepage+0xb5/0xc0 [xfs]
> [170369.030009]  [<ffffffff81182a10>] ? anon_vma_prepare+0x140/0x140
> [170369.030011]  [<ffffffff81149902>] try_to_release_page+0x32/0x50
> [170369.030014]  [<ffffffff8115ed7a>] shrink_active_list+0x3aa/0x3b0
> [170369.030016]  [<ffffffff8115f154>] shrink_lruvec+0x3d4/0x6a0
> [170369.030017]  [<ffffffff811aa001>] ?
> __mem_cgroup_largest_soft_limit_node+0x21/0xd0
> [170369.030019]  [<ffffffff811aa001>] ?
> __mem_cgroup_largest_soft_limit_node+0x21/0xd0
> [170369.030021]  [<ffffffff8115f486>] shrink_zone+0x66/0x1a0
> [170369.030023]  [<ffffffff8116065c>] balance_pgdat+0x3ac/0x600
> [170369.030025]  [<ffffffff81160a34>] kswapd+0x184/0x470
> [170369.030027]  [<ffffffff810ae460>] ? finish_wait+0x80/0x80
> [170369.030028]  [<ffffffff811608b0>] ? balance_pgdat+0x600/0x600
> [170369.030030]  [<ffffffff8108da79>] kthread+0xc9/0xe0
> [170369.030032]  [<ffffffff8108d9b0>] ? flush_kthread_worker+0x80/0x80
> [170369.030034]  [<ffffffff81538cfc>] ret_from_fork+0x7c/0xb0
> [170369.030035]  [<ffffffff8108d9b0>] ? flush_kthread_worker+0x80/0x80
> [170369.030038] ---[ end trace 9f4d705dbe4e039c ]---
> 
> -- 
> Dr. Carsten Aulbert - Max Planck Institute for Gravitational Physics
> Callinstrasse 38, 30167 Hannover, Germany
> phone/fax: +49 511 762-17185 / -17193
> https://wiki.atlas.aei.uni-hannover.de/foswiki/bin/view/ATLAS/WebHome
> 
> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs