[ hmmm, there's some whacky utf-8 whitespace characters in the copy-n-pasted text... ] On Thu, Oct 18, 2018 at 10:55:18AM +0300, Avi Kivity wrote: > > On 18/10/2018 04.37, Dave Chinner wrote: > >On Wed, Oct 17, 2018 at 10:52:48AM +0300, Avi Kivity wrote: > >>I have a user running a 1.7TB filesystem with ~10% usage (as shown > >>by df), getting sporadic ENOSPC errors. The disk is mounted with > >>inode64 and has a relatively small number of large files. The disk > >>is a single-member RAID0 array, with 1MB chunk size. There are 32 Ok, now I need to know what "single member RAID0 array" means, becuase this is clearly related to allocation alignment and I need to know why the FS was configured the way it was. It's one disk? Or is it a hardware RAID0 array that presents as a single lun with a stripe width of 1MB? if so, how many disks aer in it? If the chunk size the stripe unit (per disk chunk size) or the stripe width (all disks get hit by a 1MB IO) Or something else? > >>AGs. Running Linux 4.9.17. > >ENOSPC on what operation? write? open(O_CREAT)? something else? > > > Unknown. > > > >What's the filesystem config (xfs_info output)? > > > (restored from metadata dump) > > > meta-data=/dev/loop2 isize=512 agcount=32, agsize=14494720 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=0 spinodes=0 rmapbt=0 > = reflink=0 > data = bsize=4096 blocks=463831040, imaxpct=5 > = sunit=256 swidth=256 blks sunit=swidth is unusual for a RAID0 array, unless it's hardware RAID and the array only reports one number to mkfs. What this chosen by mkfs, or specifically configured by the user? If specifically configured, why? What is important is that it means aligned allocations will be used for any allocation that is over sunit (1MB) and that's where all the problems seem to come from. > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > log =internal bsize=4096 blocks=226480, version=2 > = sectsz=512 sunit=8 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > > Has xfs_fsr been run on this filesystem > >regularly? > > > xfs_fsr has never been run, until we saw the problem (and then did > not fix it). IIUC the workload should be self-defragmenting: it > consists of writing large files, then erasing them. I estimate that > around 100 files are written concurrently (from 14 threads), and > they are written with large extent hints. With every large file, > another smaller (but still large) file is written, and a few > smallish metadata files. Do those smaller files get removed when the big files are removed? > I understood from xfs_fsr that it attempts to defragment files, not > free space, although that may come as a side effect. In any case I > ran xfs_db after xfs_fsr and did not see an improvement. xfs_fsr takes fragmented files and contiguous free space and turns it into contiguous files and fragmented free space. You have fragmented free space, so I needed to know if xfs_fsr was responsible for that.... > >If the ENOSPC errors are only from files with a 32MB extent size > >hints on them, then it may be that there isn't sufficient contiguous > >free space to allocate an entire 32MB extent. I'm not sure what the > >allocator behaviour here is (the code is a maze of twisty passages), > >so I'll have to look more into this. > > There are other files with 32MB hints that do not show the error > (but on the other hand, the error has been observed few enough times > for that to be a fluke). *nod* > >In the mean time, can you post the output of the freespace command > >(both global and per-ag) so we can see just how much free space > >there is and how badly fragmented it has become? I might be able to > >reproduce the behaviour if I know the conditions under which it is > >occuring. > > > xfs_db> freesp > from to extents blocks pct > 1 1 5916 5916 0.00 > 2 3 10235 22678 0.01 > 4 7 12251 66829 0.02 > 8 15 5521 59556 0.01 > 16 31 5703 132031 0.03 > 32 63 9754 463825 0.11 > 64 127 16742 1590339 0.37 > 128 255 550511 390108625 89.87 > 256 511 71516 29178504 6.72 > 512 1023 19 15355 0.00 > 1024 2047 287 461824 0.11 > 2048 4095 528 1611413 0.37 > 4096 8191 1537 10352304 2.38 > 8192 16383 2 19015 0.00 > > Just 2 extents >= 32MB (and they may have been freed after the error). Yes, and the vast majority of free space is in lengths between 512kB and 1020kB. This is what I'd expect if you have large, stripe aligned allocations interleaved with smaller, sub-stripe unit allocations. As an example of behaviour that can leads to this sort of free space fragmentation, start with 10 stripe units of contiguous free space: 0 1 2 3 4 5 6 7 8 9 10 +----+----+----+----+----+----+----+----+----+----+----+ Now allocate a > stripe unit extent (say 2 units): 0 1 2 3 4 5 6 7 8 9 10 LLLLLLLLLL+----+----+----+----+----+----+----+----+----+ Now allocate a small file A: 0 1 2 3 4 5 6 7 8 9 10 LLLLLLLLLLAA---+----+----+----+----+----+----+----+----+ Now allocate another large extent: 0 1 2 3 4 5 6 7 8 9 10 LLLLLLLLLLAA---LLLLLLLLLL+----+----+----+----+----+----+ After a while, a significant part of your filesystem looks like this repeating pattern: 0 1 2 3 4 5 6 7 8 9 10 LLLLLLLLLLAA---LLLLLLLLLLBB---LLLLLLLLLLCC---LLLLLLLLLLDD---+ i.e. there are lots of small, isolated sub stripe unit free spaces. If you now start removing large extents but leaving the small files behind, you end up with this: 0 1 2 3 4 5 6 7 8 9 10 LLLLLLLLLLAA---+---------BB---LLLLLLLLLLCC---+----+----DD---+ And now we go to allocate a new large+small file pair (M+n) they'll get laid out like this: 0 1 2 3 4 5 6 7 8 9 10 LLLLLLLLLLAA---MMMMMMMMMMBB---LLLLLLLLLLCC---nn---+----DD---+ See how we lost a large aligned 2MB freespace @ 9 when the small file "nn" was laid down? repeat this fill and free pattern over and over again, and eventually it fragments the free space until there's no large contiguous free spaces left, and large aligned extents can no longer be allocated. For this to trigger you need the small files to be larger than 1 stripe unit, but still much smaller than the extent size hint, and the small files need to hang around as the large files come and go. > >>Is this a known issue? The effect and symptom is - it's a generic large aligned extent vs small unaligned extent issue, but I've never seen it manifest in a user workload outside of a very constrained multistream realtime video ingest/playout workload (i.e. the workload the filestreams allocator was written for). And before you ask, no, the filestreams allocator does not solve this problem. The most common manifestation of this problem has been inode allocation on filesystems full of small files - inodes are allocated in large aligned extents compared to small files, and so eventually the filesystem runs out of large contigouous freespace and inodes can't be allocated. The sparse inodes mkfs option fixed this by allowing inodes to be allocated as sparse chunks so they could interleave into any free space available.... > >>Would upgrading the kernel help? > >Not that I know of. If it's an extszhint vs free space fragmentation > >issue, then a kernel upgrade is unlikely to fix it. Upgrading the kernel won't fix it, because it's an extszhint vs free space fragmentation issue. Filesystems that get into this state are generally considered unrecoverable. Well, you can recover them by deleting everythign from them to reform contiguous free space, but you may as well just mkfs and restore from backup because it's much, much faster than waiting for rm -rf.... And, really, I expect that a different filesystem geometry and/or mount options are going to be needed to avoid getting into this state again. However, I don't really know enough yet about what in the workload and allocator is triggering to cause the issue to say yet. Can I get access to the metadump to dig around in the filesystem directly so I can see how everything has ended up laid out? that will help me work out what is actually occurring and determine if mkfs/mount options can address the problem or whether deeper allocator algorithm changes may be necessary.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx