Hi, On Mon, 2011-06-20 at 16:09 -0400, Christoph Hellwig wrote: > On Sun, Jun 19, 2011 at 11:01:15PM +0800, Wu Fengguang wrote: > > When there are only one (or several) dirtiers, dirty_exceeded is always > > (or mostly) off. Converting to timestamp avoids this problem. It helps > > to use smaller write_chunk for smoother throttling. > > In current mainline gfs2 has grown a non-trivial reference to > backing_dev_info.dirty_exceeded, which needs to be dealt with. > So let me try and explain whats going on there... the basic issue is that writeback is done on a per-inode basis, but pages are accounted for on a per-address space basis. In GFS2, glocks referring to inodes and rgrps (resource groups) both have an address space associated with them. These address spaces contain the metadata that would normally be in the block device address space, but have been separated so that we can sync and/or invalidate metadata easily on a per-inode basis. Note that we have the additional requirement to be able to track clean metadata, so that the existing per-inode list of dirty metadata doesn't work for GFS2. Due to the lifetime rules for the glocks, and the lack of an inode for rgrps, the mapping->host for the glock address spaces has to point at the block device inode. Now in the normal inode case, that isn't a problem - writeback calls ->write_inode which can then write out the dirty metadata pages (if any). The issue we've hit has been with rgrps and in particular if the total dirty data associated with rgrps exceeds the per-bdi dirty limit. In that case we found that writeback was spinning without making any progress since it was trying to writeback inodes (all by that stage clean) and it didn't have any way to start writeback on rgrps. So the simplest solution was to check the dirty exceeded flag during inode writeback, and if set try writing back more data than actually requested via the ail lists. This list contains all the dirty metadata, so it includes the rgrps too. Due to the way in which rgrps are used, it is impossible to dirty one without also dirtying at least one inode. In addition to that, the ordering of data blocks on the ail list is often more optimal (especially for workloads with lots of small files) and we get a performance improvement by doing writeback that way too. Having said that, I know its not ideal, and I'm open to any suggestions for better solutions, Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html