Re: High Fragmentation with XFS and NFS Sync

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Sat, 2 Jul 2016 15:02:17 -0700

On Sat, Jul 02, 2016 at 10:30:56PM +0100, Nick Fisk wrote:
> On 2 July 2016 at 22:00, Nick Fisk <friskyfisk10@xxxxxxxxxxxxxx> wrote:
> > On 2 July 2016 at 21:12, Darrick J. Wong <darrick.wong@xxxxxxxxxx> wrote:
> >> On Sat, Jul 02, 2016 at 09:52:40AM +0100, Nick Fisk wrote:
> >>> Hi, hope someone can help me here.
> >>>
> >>> I'm exporting some XFS fs's to ESX via NFS with the sync option enabled. I'm
> >>> seeing really heavy fragmentation when multiple VM's are copied onto the
> >>> share at the same time. I'm also seeing kmem_alloc failures, which is
> >>> probably the biggest problem as this effectively takes everything down.
> >>
> >> (Probably a result of loading the millions of bmbt extents into memory?)
> >
> > Yes I thought that was the case.
> >
> >>
> >>> Underlying storage is a Ceph RBD, the server the FS is running on, is
> >>> running kernel 4.5.7. Mount options are currently default. I'm seeing
> >>> Millions of extents, where the ideal is listed as a couple of thousand when
> >>> running xfs_db, there is only a couple of 100 files on the FS. It looks
> >>> like roughly the extent sizes roughly match the IO size that the VM's were
> >>> written to XFS with. So it looks like each parallel IO thread is being
> >>> allocated next to each other rather than at spaced out regions of the disk.
> >>>
> >>> From what I understand, this is because each NFS write opens and closes the
> >>> file which throws off any chance that XFS will be able to use its allocation
> >>> features to stop parallel write streams from interleaving with each other.
> >>>
> >>> Is there anything I can tune to try and give each write to each file a
> >>> little bit of space, so that it at least gives readahead a chance when
> >>> reading, that it might hit at least a few MB of sequential data?
> >>
> >> /me wonders if setting an extent size hint on the rootdir before copying
> >> the files over would help here...
> >
> > I've set a 16M hint and will copy a new VM over, interested to see
> > what happens. Thanks for the suggestion.
> 
> Well, I set the 16M hint at the root of the FS and proceeded to copy
> two VM's in parallel and got this after ~30-60s. But after rebooting,
> it looks like the extents were being allocated in larger blocks, so I
> guess you can call it progress. Any ideas?

Yikes. :(

Is that with or without filestreams?

--D

> 
> 
> Jul  2 22:11:56 Proxy3 kernel: [48777.591415] XFS (rbd8): Access to
> block zero in inode 7054483473 start_block: 0 start_off: 0 blkcnt: 0
> extent-state: 0 lastx: 30d18
> Jul  2 22:11:56 Proxy3 kernel: [48777.602725] XFS (rbd8): Internal
> error XFS_WANT_CORRUPTED_GOTO at line 1947 of file
> /home/kernel/COD/linux/fs/xfs/libxfs/xfs_bmap.c.  Caller
> xfs_bmapi_write+0x749/0xa00 [xfs]
> Jul  2 22:11:56 Proxy3 kernel: [48777.608381] CPU: 3 PID: 1463 Comm:
> nfsd Tainted: G           OE   4.5.7-040507-generic #201606100436
> Jul  2 22:11:56 Proxy3 kernel: [48777.608385] Hardware name: VMware,
> Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS
> 6.00 09/17/2015
> Jul  2 22:11:56 Proxy3 kernel: [48777.608389]  0000000000000286
> 000000008a69edda ffff8800b76fb548 ffffffff813e1173
> Jul  2 22:11:56 Proxy3 kernel: [48777.608395]  00000000000000cc
> ffff8800b76fb6e0 ffff8800b76fb560 ffffffffc07ad60c
> Jul  2 22:11:56 Proxy3 kernel: [48777.608399]  ffffffffc077c959
> ffff8800b76fb658 ffffffffc0777a13 ffff8802116c9000
> Jul  2 22:11:56 Proxy3 kernel: [48777.608403] Call Trace:
> Jul  2 22:11:56 Proxy3 kernel: [48777.608471]  [<ffffffff813e1173>]
> dump_stack+0x63/0x90
> Jul  2 22:11:56 Proxy3 kernel: [48777.608566]  [<ffffffffc07ad60c>]
> xfs_error_report+0x3c/0x40 [xfs]
> Jul  2 22:11:56 Proxy3 kernel: [48777.608620]  [<ffffffffc077c959>] ?
> xfs_bmapi_write+0x749/0xa00 [xfs]
> Jul  2 22:11:56 Proxy3 kernel: [48777.608669]  [<ffffffffc0777a13>]
> xfs_bmap_add_extent_delay_real+0x883/0x1ce0 [xfs]
> Jul  2 22:11:56 Proxy3 kernel: [48777.608721]  [<ffffffffc077c959>]
> xfs_bmapi_write+0x749/0xa00 [xfs]
> Jul  2 22:11:56 Proxy3 kernel: [48777.608777]  [<ffffffffc07b8a8d>]
> xfs_iomap_write_allocate+0x16d/0x380 [xfs]
> Jul  2 22:11:56 Proxy3 kernel: [48777.608822]  [<ffffffffc07a25d3>]
> xfs_map_blocks+0x173/0x240 [xfs]
> Jul  2 22:11:56 Proxy3 kernel: [48777.608871]  [<ffffffffc07a33a8>]
> xfs_vm_writepage+0x198/0x660 [xfs]
> Jul  2 22:11:56 Proxy3 kernel: [48777.608902]  [<ffffffff811993a3>]
> __writepage+0x13/0x30
> Jul  2 22:11:56 Proxy3 kernel: [48777.608908]  [<ffffffff81199efe>]
> write_cache_pages+0x1fe/0x530
> Jul  2 22:11:56 Proxy3 kernel: [48777.608912]  [<ffffffff81199390>] ?
> wb_position_ratio+0x1f0/0x1f0
> Jul  2 22:11:56 Proxy3 kernel: [48777.608961]  [<ffffffffc07bbbaa>] ?
> xfs_iunlock+0xea/0x120 [xfs]
> Jul  2 22:11:56 Proxy3 kernel: [48777.608970]  [<ffffffff8119a281>]
> generic_writepages+0x51/0x80
> Jul  2 22:11:56 Proxy3 kernel: [48777.609019]  [<ffffffffc07a31c3>]
> xfs_vm_writepages+0x53/0xa0 [xfs]
> Jul  2 22:11:56 Proxy3 kernel: [48777.609028]  [<ffffffff8119c73e>]
> do_writepages+0x1e/0x30
> Jul  2 22:11:56 Proxy3 kernel: [48777.609048]  [<ffffffff8118f516>]
> __filemap_fdatawrite_range+0xc6/0x100
> Jul  2 22:11:56 Proxy3 kernel: [48777.609053]  [<ffffffff8118f691>]
> filemap_write_and_wait_range+0x41/0x90
> Jul  2 22:11:56 Proxy3 kernel: [48777.609103]  [<ffffffffc07af7d3>]
> xfs_file_fsync+0x63/0x210 [xfs]
> Jul  2 22:11:56 Proxy3 kernel: [48777.609130]  [<ffffffff81249e9b>]
> vfs_fsync_range+0x4b/0xb0
> Jul  2 22:11:56 Proxy3 kernel: [48777.609156]  [<ffffffffc04b856d>]
> nfsd_vfs_write+0x14d/0x380 [nfsd]
> Jul  2 22:11:56 Proxy3 kernel: [48777.609176]  [<ffffffffc04babf0>]
> nfsd_write+0x120/0x2f0 [nfsd]
> Jul  2 22:11:56 Proxy3 kernel: [48777.609190]  [<ffffffffc04c105c>]
> nfsd3_proc_write+0xbc/0x150 [nfsd]
> Jul  2 22:11:56 Proxy3 kernel: [48777.609207]  [<ffffffffc04b3348>]
> nfsd_dispatch+0xb8/0x200 [nfsd]
> Jul  2 22:11:56 Proxy3 kernel: [48777.609243]  [<ffffffffc03cd21c>]
> svc_process_common+0x40c/0x650 [sunrpc]
> Jul  2 22:11:56 Proxy3 kernel: [48777.609267]  [<ffffffffc03ce5c3>]
> svc_process+0x103/0x1b0 [sunrpc]
> Jul  2 22:11:56 Proxy3 kernel: [48777.609286]  [<ffffffffc04b2d8f>]
> nfsd+0xef/0x160 [nfsd]
> Jul  2 22:11:56 Proxy3 kernel: [48777.609297]  [<ffffffffc04b2ca0>] ?
> nfsd_destroy+0x60/0x60 [nfsd]
> Jul  2 22:11:56 Proxy3 kernel: [48777.609322]  [<ffffffff810a06a8>]
> kthread+0xd8/0xf0
> Jul  2 22:11:56 Proxy3 kernel: [48777.609329]  [<ffffffff810a05d0>] ?
> kthread_create_on_node+0x1a0/0x1a0
> Jul  2 22:11:56 Proxy3 kernel: [48777.609369]  [<ffffffff81825bcf>]
> ret_from_fork+0x3f/0x70
> Jul  2 22:11:56 Proxy3 kernel: [48777.609374]  [<ffffffff810a05d0>] ?
> kthread_create_on_node+0x1a0/0x1a0
> Jul  2 22:11:56 Proxy3 kernel: [48777.609444] XFS (rbd8): Internal
> error xfs_trans_cancel at line 990 of file
> /home/kernel/COD/linux/fs/xfs/xfs_trans.c.  Caller
> xfs_iomap_write_allocate+0x270/0x380 [xfs]
> 
> >
> >>
> >> --D
> >>
> >>>
> >>> I have read that inode32 allocates more randomly compared to inode64, so I'm
> >>> not sure if it's worth trying this as there will likely be less than a 1000
> >>> files per FS.
> >>>
> >>> Or am I best just to run fsr after everything has been copied on?
> >>>
> >>> Thanks for any advice
> >>> Nick
> >>
> >>> _______________________________________________
> >>> xfs mailing list
> >>> xfs@xxxxxxxxxxx
> >>> http://oss.sgi.com/mailman/listinfo/xfs
> >>

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs