On Thu, May 19, 2022 at 10:50:14AM +1000, Dave Chinner wrote:
On Thu, May 19, 2022 at 08:36:06AM +1000, Chris Dunlop wrote:
On Wed, May 18, 2022 at 08:59:00AM -0700, Darrick J. Wong wrote:
On Wed, May 18, 2022 at 05:07:13PM +1000, Chris Dunlop wrote:
Oh, sorry... on linux v5.15.34
On Wed, May 18, 2022 at 04:59:49PM +1000, Chris Dunlop wrote:
I have an fstrim that's been running for over 48 hours on a 256T thin
provisioned XFS fs containing around 55T of actual data on a slow
subsystem (ceph 8,3 erasure-encoded rbd). I don't think there would be
an an enourmous amount of data to trim, maybe a few T, but I've no idea
how long how long it might be expected to take. In an attempt to see
what the what the fstrim was doing, I ran an strace on it. The strace
has been sitting there without output and unkillable since then, now 5+
hours ago. Since the strace, on that same filesystem I now have 123 df
processes and 615 rm processes -- and growing -- that are blocked in
xfs_inodegc_flush, e.g.:
...
I suspect that it's just that your storage device is really slow at
small trims. If you didn't set a minimum trim size, XFS will issue
discards on every free space in it's trees. If you have fragmented
free space (quite possible if you're using reflink and removing
files that have been reflinked and modified) then you could have
millions of tiny free spaces that XFS is now asking the storage to
free.
Dumping the free space histogram of the filesystem will tell us just
how much work you asked the storage to do. e.g:
# xfs_spaceman -c freesp /vol
from to extents blocks pct
1 1 2368 2368 0.00
2 3 4493 11211 0.00
4 7 6827 38214 0.00
8 15 12656 144036 0.00
16 31 35988 878969 0.00
32 63 163747 8091729 0.01
64 127 248625 22572336 0.04
128 255 367889 71796010 0.11
256 511 135012 48176856 0.08
512 1023 92534 74126716 0.12
1024 2047 13464 18608536 0.03
2048 4095 3873 10930189 0.02
4096 8191 1374 7886168 0.01
8192 16383 598 6875977 0.01
16384 32767 340 7729264 0.01
32768 65535 146 6745043 0.01
65536 131071 48 4419901 0.01
131072 262143 12 2380800 0.00
262144 524287 5 1887092 0.00
524288 1048575 2 1105184 0.00
1048576 2097151 4 5316211 0.01
2097152 4194303 3 8747030 0.01
4194304 8388607 65 522142416 0.83
8388608 16777215 2 21411858 0.03
67108864 134217727 4 379247828 0.60
134217728 268434432 236 62042143854 98.05
I guess from/to are in units of filesystem blocks, 4kB in this case?
Not that it makes much difference here, but the sake of accuracy... does
the default fstrim without range/size args issue discard requests for
*all* the extents, or, if I'm reading this right:
fs/xfs/xfs_discard.c
--
xfs_ioc_trim(
struct xfs_mount *mp,
struct fstrim_range __user *urange)
{
struct request_queue *q = bdev_get_queue(mp->m_ddev_targp->bt_bdev);
unsigned int granularity = q->limits.discard_granularity;
...
if (copy_from_user(&range, urange, sizeof(range)))
return -EFAULT;
range.minlen = max_t(u64, granularity, range.minlen);
...
}
...does it take into account /sys/block/xxx/queue/discard_granularity, in
this case 64kB, or 16 blocks @ 4kB, so issuing discards only for extents
= 16 blocks?
The open question is what caused the stuck processes?
Oh, that's easy the easy bit to explain: discard runs with the AGF
locked because it is iterating the free space tree directly. Hence
operations on that AG are blocked until all the free space in that
AG have been discarded. Could be smarter, never needed to be
smarter.
Now inodegc comes along, and tries to free an inode in that AG, and
blocks getting the AGF lock during the inode free operation (likely
inode chunk freeing of finobt block allocation). Everythign then
backs up on inodegc flushes, which is backed up on discard
operations....
I'm not sure that explains how the first stuck process only appeared >48
hours after initiating the fstrim. Unless that's because it may have
finally got to the AG(s) with a lot of free extents?
#
# AGs w/ at least 400 free extents: only 31 out of 256 AGs
#
d5# xfs_spaceman -c "freesp -gs" /chroot | awk '$2>=400 {print}'
AG extents blocks
43 69435 29436263
47 14984 5623982
48 42482 166285283
49 56284 218913822
50 10354 240969820
{ sequential range...
54 60416 11292928
55 72742 15344807
56 88455 17204239
57 81594 15218624
58 126731 27719558
59 64525 10047593
60 37324 8591083
61 57267 113745589
62 36501 18360912
63 3998 255040699
64 2684 258737072
65 2047 263904736
66 1503 265467595
67 920 263457328
68 1393 264277701
}
70 1150 266485834
72 406 267609078
77 429 267479272
79 911 267625473
80 1182 267354313
{ sequential range...
105 39345 151535709
106 69512 57950504
107 46464 10236142
108 40795 8666966
109 14431 208027543
110 15800 258264227
}
total free extents 1090313
total free blocks 63273250933
average free extent size 58032.2
The number of free space extents per AG seems oddly "lumpy", e.g. the
sequential cluster ranges beginning with AGs 54 and 105 with a large
number of extents. Is that simply due to the pattern of frees in this
specific case or is there some underlying design to that?
(fs-size - fs-used) / discard-size
256T - 26T / 64k
3,858,759,680 requests
Won't be anywhere near that number - free space in a 256TB
filesystem with only 29TB used will have lots of really large
contiguous free spaces. Those will get broken down into max discard
length chunks, not minimum. Of course, if the bdev is setting a
really small max discard size, then that's going to be just a big a
problem for you....
Is this the bdev's max discard size?
# cat /sys/block/dm-0/queue/discard_max_bytes
4194304
And does that mean, for instance, these 236 extents will be split into
somewhere between 131072 and 262143 individual discard requests (i.e.
size of extent in bytes divided by discard_max_bytes) being sent to the
underlying "device" (ceph rbd)?
# xfs_spaceman -c freesp /vol
from to extents blocks pct
...
134217728 268434432 236 62042143854 98.05
Cheers,
Chris