Re: A unresponsive file system can hang all I/O in the system on linux-2.6.23-rc6 (dirty_thresh problem?)

Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> · Fri, 28 Sep 2007 10:40:00 +0200

[ please don't top-post! ]

On Fri, 2007-09-28 at 01:27 -0700, Chakri n wrote:

> On 9/27/07, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
> > On Thu, 2007-09-27 at 23:50 -0700, Andrew Morton wrote:
> >
> > > What we _don't_ want to happen is for other processes which are writing to
> > > other, non-dead devices to get collaterally blocked.  We have patches which
> > > might fix that queued for 2.6.24.  Peter?
> >
> > Nasty problem, don't do that :-)
> >
> > But yeah, with per BDI dirty limits we get stuck at whatever ratio that
> > NFS server/mount (?) has - which could be 100%. Other processes will
> > then work almost synchronously against their BDIs but it should work.
> >
> > [ They will lower the NFS-BDI's ratio, but some fancy clipping code will
> >   limit the other BDIs their dirty limit to not exceed the total limit.
> >   And with all these NFS pages stuck, that will still be nothing. ]
> >
> Thanks.
> 
> The BDI dirty limits sounds like a good idea.
> 
> Is there already a patch for this, which I could try?

v2.6.23-rc8-mm2

> I believe it works like this,
> 
> Each BDI, will have a limit. If the dirty_thresh exceeds the limit,
> all the I/O on the block device will be synchronous.
> 
> so, if I have sda & a NFS mount, the dirty limit can be different for
> each of them.
> 
> I can set dirty limit for
>  -  sda to be 90% and
>  -  NFS mount to be 50%.
> 
> So, if the dirty limit is greater than 50%, NFS does synchronously,
> but sda can work asynchronously, till dirty limit reaches 90%.

Not quite, the system determines the limit itself in an adaptive
fashion.

  bdi_limit = total_limit * p_bdi

Where p is a faction [0,1], and is determined by the relative writeout
speed of the current BDI vs all other BDIs.

So if you were to have 3 BDIs (sda, sdb and 1 nfs mount), and sda is
idle, and the nfs mount gets twice as much traffic as sdb, the ratios
will look like:

 p_sda: 0
 p_sdb: 1/3
 p_nfs: 2/3

Once the traffic exceeds the write speed of the device we build up a
backlog and stuff gets throttled, so these proportions converge to the
relative write speed of the BDIs when saturated with data.

So what can happen in your case is that the NFS mount is the only one
with traffic is will get a fraction of 1. If it then disconnects like in
your case, it will still have all of the dirty limit pinned for NFS.

However other devices will at that moment try to maintain a limit of 0,
which ends up being similar to a sync mount.

So they'll not get stuck, but they will be slow.
Attachment:
signature.asc

Description: This is a digitally signed message part
_______________________________________________
linux-pm mailing list
linux-pm@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linux-foundation.org/mailman/listinfo/linux-pm