On Thu, Aug 23, 2007 at 08:13:41AM -0400, Chris Mason wrote: > On Thu, 23 Aug 2007 12:47:23 +1000 > David Chinner <dgc@xxxxxxx> wrote: > > > On Wed, Aug 22, 2007 at 08:42:01AM -0400, Chris Mason wrote: > > > I think we should assume a full scan of s_dirty is impossible in the > > > presence of concurrent writers. We want to be able to pick a start > > > time (right now) and find all the inodes older than that start time. > > > New things will come in while we're scanning. But perhaps that's > > > what you're saying... > > > > > > At any rate, we've got two types of lists now. One keeps track of > > > age and the other two keep track of what is currently being > > > written. I would try two things: > > > > > > 1) s_dirty stays a list for FIFO. s_io becomes a radix tree that > > > indexes by inode number (or some arbitrary field the FS can set in > > > the inode). Radix tree tags are used to indicate which things in > > > s_io are already in progress or are pending (hand waving because > > > I'm not sure exactly). > > > > > > inodes are pulled off s_dirty and the corresponding slot in s_io is > > > tagged to indicate IO has started. Any nearby inodes in s_io are > > > also sent down. > > > > the problem with this approach is that it only looks at inode > > locality. Data locality is ignored completely here and the data for > > all the inodes that are close together could be splattered all over > > the drive. In that case, clustering by inode location is exactly the > > wrong thing to do. > > Usually it won't be less wrong than clustering by time. > > > > > For example, XFs changes allocation strategy at 1TB for 32bit inode > > filesystems which makes the data get placed way away from the inodes. > > i.e. inodes in AGs below 1TB, all data in AGs > 1TB. clustering > > by inode number for data writeback is mostly useless in the >1TB > > case. > > I agree we'll want a way to let the FS provide the clustering key. But > for the first cut on the patch, I would suggest keeping it simple. > > > > > The inode32 for <1Tb and inode64 allocators both try to keep data > > close to the inode (i.e. in the same AG) so clustering by inode number > > might work better here. > > > > Also, it might be worthwhile allowing the filesystem to supply a > > hint or mask for "closeness" for inode clustering. This would help > > the gernic code only try to cluster inode writes to inodes that > > fall into the same cluster as the first inode.... > > Yes, also a good idea after things are working. > > > > > > > Notes: > > > > (1) I'm not sure inode number is correlated to disk location in > > > > filesystems other than ext2/3/4. Or parent dir? > > > > > > In general, it is a better assumption than sorting by time. It may > > > make sense to one day let the FS provide a clustering hint > > > (corresponding to the first block in the file?), but for starters it > > > makes sense to just go with the inode number. > > > > Perhaps multiple hints are needed - one for data locality and one > > for inode cluster locality. > > So, my feature creep idea would have been more data clustering. I'm > mainly trying to solve this graph: > > http://oss.oracle.com/~mason/compilebench/makej/compare-create-dirs-0.png > > Where background writing of the block device inode is making ext3 do > seeky writes while directory trees. My simple idea was to kick > off a 'I've just written block X' call back to the FS, where it may > decide to send down dirty chunks of the block device inode that also > happen to be dirty. > > But, maintaining the kupdate max dirty time and congestion limits in > the face of all this clustering gets tricky. So, I wasn't going to > suggest it until the basic machinery was working. > > Fengguang, this isn't a small project ;) But, lots of people will be > interested in the results. Exactly, the current writeback logics are unsatisfactory in many ways. As for writeback clustering, inode/data localities can be different. But I'll follow your suggestion to start simple first and give the idea a spin on ext3. -fengguang - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html