On Tue, Mar 12, 2013 at 12:01:00PM +0100, Jan Kara wrote: > On Fri 08-03-13 12:35:25, Dave Chinner wrote: > > On Thu, Mar 07, 2013 at 11:24:06AM +0100, Jan Kara wrote: > > > But really I was wondering about usefulness of XFS_ALLOCTYPE_NEAR_BNO > > > heuristic. Sure the seek time depends on the distance so if we are speaking > > > about allocating single extent then XFS_ALLOCTYPE_NEAR_BNO is useful but > > > once that strategy would allocate two or three consecutive extents you've > > > lost all the benefit and you would be better off if you started allocating > > > from the start of the free space. > <snip> > > Hence repeated backwards allocation is, in many cases, desirable > > behaviour, and in others it doesn't make any difference. And with it > > being easy to work around for the few cases where it is tripped over > > on aging filesystems, there hasn't been a great deal of need to > > change the behaviour. > Thanks for explanation! No problems - there's a lot of stuff in my head that simply isn't documented anywhere, and it would take months of work to document it so I've never written it down... > > I suspect we don't even need to detect the backwards allocation > > pattern at all - if we are falling back from an > > XFS_ALLOCTYPE_THIS_BNO allocation, we know that the target cannot be > > allocated and so we *might* end up doing a backwards allocation. > > Hence I don't think it will not hurt at all to simply say "select > > the start of the next candidate free extent" for a file data > > allocations in this case. That would completely avoid needing to > > detect repeated backwards allocations and ensure that subsequent > > allocations after the backwards jump to the start of the large free > > extent then go forwards as intended... > Yes, I was wondering about whether this strategy won't be good enough as > well. But there's one case I'm afraid of: Suppose you have a small file so > that it gets written out in one request and fits in a single extent of > blocks. Then it is somewhat advantageous to allocate from the end of the > free extent because that is closer to the inode and thus reduces > inode-to-data seek time. That's why I was suggesting maybe we could use > this logic only for larger files or something like that. Hum, maybe if we > didn't use this strategy for initial file extent, it would be enough. > That's already handled specially in the allocator so it shouldn't be a > problem to detect. What do you think? It took me a couple of readings to work out what you meant here. You are wondering if selecting the start of a free extent for the initial data for small files is the right thing to do as it may not be "closest", right? The thing is, we don't know if the inode is going to be written to again, and so there is no context to be able to determine if we need to be able to do a subsequent THIS_BNO exact allocation in subsequent allocations. Because XFS is designed for large files and high bandwidth IO, it is assumed that contiguous allocation is the most important factor in determining where to allocate a block. Hence even for a small allocation, the assumption is that there will be another contiguous allocation request and another and another.... Hence XFS always tries to place new files at the start of a large freespace extent, not at the end, because it is assumed that files will always grow in the short term. The thing is, this is all *very* complicated, and there's a lot of context that needs to be understood that, as I said above, isn't really documented anywhere. So, more context: We are searching left/right because the extent size we want to allocate resulted in a by-size btree lookup returning a btree block that is not at the end of the tree. i.e. we've hit a btree block that contains free space extents of roughly the same size as the extent we want to allocate rather than free space extents mcuh larger than the size we need. If we hit the last block - we have large contiguous free space extents and so we do a linear search to select the freespace extent that is *closest* to the target block, and allocate from the front of it. Hence subsequent THIS_BNO allocations will work just fine as there's plenty of contiguous free space at the end of the allocation. However, if we don't hit the last block it means there is some amount of free space fragmentation present in the filesystem. This means we can't use a by-size lookup to determine the closest free extent that would fit the allocation. If freespace fragementation is bad, there might be tens of thousands of free extents of that size in the tree, and selecting the closest would require a linear search through all the blocks. It's prohibitively expensive from both a CPU and Io point of view, so that isn't done. [ Freespace fragmentation of this style will occur from having an application that does concurrent, identically sized allocations that are small compared to the eventual file size. e.g. using direct IO in 16MB chunks to write files that are tens of GB in length. Over time, extents will get interleaved and then when you remove some files you end up with smaller freespace extents, which then get interleaved into more files, and when some of them get removed you get even smaller freespace extents. ] It is when we see this sort of freespace fragmentation that the NEAR_BNO allocation falls back to the second algorithm and search left and right from the target block in the by-block freespace tree. This *may* find a large extent that is in the last block of by-size tree and this is the trigger for backwards allocation to occur. If we were to have found the last by-size btree block in the initial by-size lookup, this is the extent that would have been selected by that algorithm, and we would have selected the front of the freespace extent, not the end of it. So for data allocation, it is quite clear that there is a significant difference between the way the free space extent is treated when we carve a chunk out of it when we compare the two algorithms. Hence if we simply change the way we select the best part of the extent to use for data allocation to match the method used when we select a large free space from the by-size tree, we'll stop this repeated backwards allocation problem from occurring and have similar extent selection behaviour for data allocation not matter which algorithm finds a the closest large freespace extent.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs