On Tue, Jun 15, 2010 at 02:28:22PM +0400, Evgeniy Polyakov wrote: > On Tue, Jun 15, 2010 at 04:36:43PM +1000, Dave Chinner (david@xxxxxxxxxxxxx) wrote: > > > Nope. Large-number-of-small-files is a pretty common case. If the fs > > > doesn't handle that well (ie: by placing them nearby on disk), it's > > > borked. > > > > Filesystems already handle this case just fine as we see it from > > writeback all the time. Untarring a kernel is a good example of > > this... > > > > I suggested sorting all the IO to be issued into per-mapping page > > groups because: > > a) makes IO issued from reclaim look almost exactly the same > > to the filesytem as if writeback is pushing out the IO. > > b) it looks to be a trivial addition to the new code. > > > > To me that's a no-brainer. > > That doesn't coverup large-number-of-small-files pattern, since > untarring subsequently means creating something new, which FS can > optimize. Much more interesting case is when we have dirtied large > number of small files in kind-of random order and submitted them > down to disk. > > Per-mapping sorting will not do anything good in this case, even if > files were previously created in a good facion being placed closely and > so on, and only block layer will find a correlation between adjacent > blocks in different files. But with existing queue management it has > quite a small opportunity, and that's what I think Andrew is arguing > about. The solution is not to sort pages on their way to be submitted either, really. What I do in fsblock is to maintain a block-nr sorted tree of dirty blocks. This works nicely because fsblock dirty state is properly synchronized with page dirty state. So writeout can just walk this in order and it provides pretty optimal submission pattern of any interleavings of data and metadata. No need for buffer boundary or hacks like that. (needs some intelligence for delalloc, though). But even with all that, it's not the complete story. It doesn't know about direct IO, sync IO, or fsyncs, and it would be very hard and ugly to try to synchronise and sort all that from the pagecache level. It also is a heuristic in terms of optimal block scheduling behaviour. With smarter devices and drivers there might be better ways to go. So what is needed is to get as much info into the block layer as possible. As Andrew says, there shouldn't be such a big difference between pages being writeback or dirty in pagecache. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>