On Thu, May 04, 2017 at 11:07:45AM +0300, Alex Lyakas wrote: > Hello Brian, Cristoph, > > Thank you for your responses. > > > The search overhead could be high due to either fragmented free space or > > perhaps waiting on busy extents (since you have enabled online discard). > > Do you have any threads freeing space and waiting on discard operations > > when this occurs? Also, what does 'xfs_db -c "freesp -s" <dev>' show for > > this filesystem? > I disabled the discard, but the problem still happens. Output of the freesp > command is at [1]. To my understanding this means that 60% of the free space > is 16-31 continuous blocks, i.e., 64kb-124kb. Does this count as a > fragmented free space? > Ok. Note that the discard implementation from kernels that old is known to produce similar stalls, so may want to consider using fstrim going forward independent from this problem. That aside, free space does appear to be fairly fragmented to me. You do have some larger extents available, but doing locality-based allocations certainly seems like it could drop into a range where a time consuming search is required. > I debugged the issue further, profiling the xfs_alloc_ag_vextent_near() call > and what it does. Some results: > > # it appears to not be triggering any READs of xfs_buf, i.e., no calls to > xfs_buf_ioapply_map() with rw==READ or rw==READA in the same thread > # most of the time (about 95%) is spent in xfs_buf_lock() waiting in > "down(&bp->b_sema)" call > # the average time to lock an xfs_buf is about 10-12 ms > > For example, in one test it took 45778 ms to complete the > xfs_alloc_ag_vextent_near() execution. During this time, 6240 xfs_buf were > locked, totalling to 42810 ms spent in locking the buffers, which is about > 93%. On average 7 ms to lock a buffer. > This is probably dropping into the fallback allocation algorithm in xfs_alloc_ag_vextent_near(), explained by the following comment: /* * Second algorithm. * Search in the by-bno tree to the left and to the right * simultaneously, until in each case we find a space big enough, * or run into the edge of the tree. When we run into the edge, * we deallocate that cursor. * If both searches succeed, we compare the two spaces and pick * the better one. * With alignment, it's possible for both to fail; the upper * level algorithm that picks allocation groups for allocations * is not supposed to do this. */ So what is happening here is that the algorithm is starting at a point in an AG based on a starting blockno (e.g., such as the inode block) and searching left and right from there for a suitable range of free blocks. Depending on the size of the tree, fragmentation of free space, size of the allocation request, etc., it can certainly take a while to seek/read all of the btree blocks required to satisfy the allocation. I suspect this is ultimately caused by the sync mount option, presumably converting smallish chunks of delalloc blocks to real blocks repeatedly and in parallel with other allocations, and fragmenting free space over time. I don't think there is any easy way out of the current state of the fs outside of reformatting and migrating data to a new fs without using the sync mount option. Brian > # it is still not clear who is holding the lock > > Cristoph, I understand that kernel 3.18 is EOL at the moment, but it used to > be a long-term kernel, so there is an expectation of stability, but perhaps > not community support at this point. > > Thanks, > Alex. > > > [1] > from to extents blocks pct > 1 1 155759 155759 0.00 > 2 3 1319 3328 0.00 > 4 7 13153 56265 0.00 > 8 15 152663 1752813 0.03 > 16 31 143626908 4019133338 60.17 > 32 63 1484214 72838775 1.09 > 64 127 9799130 876068428 13.12 > 128 255 1929928 310722786 4.65 > 256 511 150035 49779357 0.75 > 512 1023 26496 19658529 0.29 > 1024 2047 27800 41418636 0.62 > 2048 4095 26369 77587481 1.16 > 4096 8191 13872 80270202 1.20 > 8192 16383 6653 77527746 1.16 > 16384 32767 4384 100576452 1.51 > 32768 65535 3967 200958816 3.01 > 65536 131071 1346 127613203 1.91 > 131072 262143 753 141530959 2.12 > 262144 524287 473 168900109 2.53 > 524288 1048575 202 147607986 2.21 > 1048576 2097151 65 95394305 1.43 > 2097152 4194303 16 42998164 0.64 > 4194304 8388607 5 26710209 0.40 > total free extents 157425510 > total free blocks 6679263646 > average free extent size 42.4281 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html