On Tue, Mar 03, 2009 at 05:25:36PM +0000, Jamie Lokier wrote: > Nick Piggin wrote: > > The block layer below the filesystem should be robust. Well > > actually the core block layer is (except maybe for the new > > bio integrity stuff that looks pretty nasty). Not sure about > > md/dm, but they really should be safe (they use mempools etc). > > Are mempools fully safe, or just statistically safer? They will guarantee forward progress if used correctly, so yes fully safe. > > > it so "we can always make forward progress". But it won't > > > matter because once a real user drives the system off this > > > cliff there is no difference between "hung" and "really slow > > > progress". They are going to crash it and report a hang. > > > > I don't think that is the case. These are situations that > > would be *really* rare and transient. It is not like thrashing > > in that your working set size exceeds physical RAM, but just > > a combination of conditions that causes an unusual spike in the > > required memory to clean some dirty pages (eg. Dave's example > > of several IOs requiring btree splits over several AGs). Could > > cause a resource deadlock. > > Suppose the systems has two pages to be written. The first must > _reserve_ 40 pages of scratch space just in case the operation will > need them. If the second page write is initiated concurrently with > the first, the second must reserve another 40 pages concurrently. > > If 10 page writes are concurrent, that's 400 pages of scratch space > needed in reserve... You only need to guarantee forward progress, so you would reserve 40 pages up front for the entire machine (some mempools have more memory than strictly needed to improve performance, so you could toy with that, but let's just describe the baseline). So allocations happen as normal, except when an allocation fails, then the task which fails the allocation is given access to this reserve memory, any other task requiring reserve will then block. Now the reserve provides enough pages to guarantee forward progress, so that one task is going to be able to proceed and eventually its pages will become freeable and can be returned to the reserve. Once the writeout has finished, the reserve will become available to other tasks. So this way you only have to reserve enough to write out 1 page, and you only start blocking things when their memory allocations wolud have failed *anyway*. And you guarantee forward progress. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html