On Mon, Mar 02, 2009 at 07:19:53PM +1100, Dave Chinner wrote: > On Sun, Mar 01, 2009 at 02:50:57PM +0100, Nick Piggin wrote: > > On Sun, Mar 01, 2009 at 07:17:44PM +1100, Dave Chinner wrote: > > > On Wed, Feb 25, 2009 at 10:36:29AM +0100, Nick Piggin wrote: > > > > I need this in fsblock because I am working to ensure filesystem metadata > > > > can be correctly allocated and refcounted. This means that page cleaning > > > > should not require memory allocation (to be really robust). > > > > > > Which, unfortunately, is just a dream for any filesystem that uses > > > delayed allocation. i.e. they have to walk the free space trees > > > which may need to be read from disk and therefore require memory > > > to succeed.... > > > > Well it's a dream because probably none of them get it right, but > > that doesn't mean its impossible. > > > > You don't need complete memory allocation up-front to be robust, > > but having reserves or degraded modes that simply guarantee > > forward progress is enough. > > > > For example, if you need to read/write filesystem metadata to find > > and allocate free space, then you really only need a page to do all > > the IO. > > For journalling filesystems, dirty metadata is pinned for at least the > duration of the transaction and in many cases it is pinned for > multiple transactions (i.e. in memory aggregation of commits like > XFS does). And then once the transaction is complete, it can't be > reused until it is written to disk. > > For the worst case usage in XFS, think about a complete btree split > of both free space trees, plus a complete btree split of the extent > tree. That is two buffers per level per btree that are pinned by > the transaction. > > The free space trees are bound in depth by the AG size so the limit > is (IIRC) 15 buffers per tree at 1TB AG size. However, the inode > extent tree can be deeper than that (bound by filesystem size). In > effect, writing back a single page could require memory allocation > of 30-40 pages just for metadata that is dirtied by the allocation > transaction. > > And then the next page written back goes into a different > AG and splits the trees there. And then the next does the same. So assuming there is no reasonable way to do out of core algorithms on the filesystem metadata (and likely you don't want to anyway because it would be a significant slowdown or diverge of code paths), you still only need to reserve one set of those 30-40 pages for the entire kernel. You only ever need to reserve enough memory for a *single* page to be processed. In the worst case that there are multiple pages under writeout but can't allocate memory, only one will be allowed access to reserves and the others will block until it is finished and can unpin them all. > Luckily, this sort of thing doesn't happen very often, but it does > serve to demonstrate how difficult it is to quantify how much memory > the writeback path really needs to guarantee forward progress. > Hence the dream...... Well I'm not saying it is an immediate problem or it would be a good use of anybody's time to rush out and try to redesign their fs code to fix it ;) But at least for any new core/generic library functionality like fsblock, it would be silly not to close the hole there (not least because the problem is simpler here than in a complex fs). -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html