On Sun, Mar 01, 2009 at 02:50:57PM +0100, Nick Piggin wrote: > On Sun, Mar 01, 2009 at 07:17:44PM +1100, Dave Chinner wrote: > > On Wed, Feb 25, 2009 at 10:36:29AM +0100, Nick Piggin wrote: > > > I need this in fsblock because I am working to ensure filesystem metadata > > > can be correctly allocated and refcounted. This means that page cleaning > > > should not require memory allocation (to be really robust). > > > > Which, unfortunately, is just a dream for any filesystem that uses > > delayed allocation. i.e. they have to walk the free space trees > > which may need to be read from disk and therefore require memory > > to succeed.... > > Well it's a dream because probably none of them get it right, but > that doesn't mean its impossible. > > You don't need complete memory allocation up-front to be robust, > but having reserves or degraded modes that simply guarantee > forward progress is enough. > > For example, if you need to read/write filesystem metadata to find > and allocate free space, then you really only need a page to do all > the IO. For journalling filesystems, dirty metadata is pinned for at least the duration of the transaction and in many cases it is pinned for multiple transactions (i.e. in memory aggregation of commits like XFS does). And then once the transaction is complete, it can't be reused until it is written to disk. For the worst case usage in XFS, think about a complete btree split of both free space trees, plus a complete btree split of the extent tree. That is two buffers per level per btree that are pinned by the transaction. The free space trees are bound in depth by the AG size so the limit is (IIRC) 15 buffers per tree at 1TB AG size. However, the inode extent tree can be deeper than that (bound by filesystem size). In effect, writing back a single page could require memory allocation of 30-40 pages just for metadata that is dirtied by the allocation transaction. And then the next page written back goes into a different AG and splits the trees there. And then the next does the same. Luckily, this sort of thing doesn't happen very often, but it does serve to demonstrate how difficult it is to quantify how much memory the writeback path really needs to guarantee forward progress. Hence the dream...... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html