Re: How to handle TIF_MEMDIE stalls?

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Thu, 19 Feb 2015 19:48:16 +0900

Michal Hocko wrote:
> Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > Because they cannot perform any IO/FS transactions and that would lead
> > > > > to a premature OOM conditions way too easily. OOM killer is a _last
> > > > > resort_ reclaim opportunity not something that would happen just because
> > > > > you happen to be not able to flush dirty pages. 
> > > > 
> > > > But you should not have applied such change without making necessary
> > > > changes to GFP_NOFS / GFP_NOIO users with such expectation and testing
> > > > at linux-next.git . Applying such change after 3.19-rc6 is a sucker punch.
> > > 
> > > This is a nonsense. OOM was disbaled for !__GFP_FS for ages (since
> > > before git era).
> > >  
> > Then, at least I expect that filesystem error actions will not be taken so
> > trivially. Can we apply http://marc.info/?l=linux-mm&m=142418465615672&w=2 for
> > Linux 3.19-stable?
> 
> I do not understand. What kind of bug would be fixed by that change?

That change fixes significant loss of file I/O reliability under extreme
memory pressure.

Today I tested how frequent filesystem errors occurs using scripted environment.
( Source code of a.out is http://marc.info/?l=linux-fsdevel&m=142425860904849&w=2 )

----------
#!/bin/sh
: > ~/trial.log
for i in `seq 1 100`
do
    mkfs.ext4 -q /dev/sdb1 || exit 1
    mount -o errors=remount-ro /dev/sdb1 /tmp || exit 2
    chmod 1777 /tmp
    su - demo -c ~demo/a.out
    if [ -w /tmp/ ]
    then
        echo -n "S" >> ~/trial.log
    else
        echo -n "F" >> ~/trial.log
    fi
    umount /tmp
done
----------

We can see that filesystem errors are occurring frequently if GFP_NOFS / GFP_NOIO
allocations give up without retrying. On the other hand, as far as these trials,
TIF_MEMDIE stall was not observed if GFP_NOFS / GFP_NOIO allocations give up
without retrying. Maybe giving up without retrying is keeping away from hitting
stalls for this test case?

  Linux 3.19-rc6 (Console log is http://I-love.SAKURA.ne.jp/tmp/serial-20150219-3.19-rc6.txt.xz )

    0 filesystem errors out of 100 trials. 2 stalls.
    SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS

  Linux 3.19 (Console log is http://I-love.SAKURA.ne.jp/tmp/serial-20150219-3.19.txt.xz )

    44 filesystem errors out of 100 trials. 0 stalls.
    SSFFSSSFSSSFSFFFFSSFSSFSSSSSSFFFSFSFFSSSSSSFFFFSFSSFFFSSSSFSSFFFFFSSSSSFSSFSFSSFSFFFSFFFFFFFSSSSSSSS

  Linux 3.19 with http://marc.info/?l=linux-mm&m=142418465615672&w=2 applied.
  (Console log is http://I-love.SAKURA.ne.jp/tmp/serial-20150219-3.19-patched.txt.xz )

    0 filesystem errors out of 100 trials. 2 stalls.
    SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS

If result of Linux 3.19 is what you wanted, we should chime fs developers
for immediate action. (But __GFP_NOFAIL discussion between you and Dave
is in progress. I don't know whether ext4 and underlying subsystems should
start using __GFP_NOFAIL.)

P.S. Just for experimental purpose, Linux 3.19 with below change applied
gave better result than retrying GFP_NOFS / GFP_NOIO allocations without
invoking the OOM killer. Short-lived small GFP_NOFS / GFP_NOIO allocations
can use GFP_ATOMIC instead? How many bytes does blk_rq_map_kern() want?

  --- a/mm/page_alloc.c
  +++ b/mm/page_alloc.c
  @@ -2867,6 +2867,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
           int classzone_idx;

           gfp_mask &= gfp_allowed_mask;
  +        if (gfp_mask == GFP_NOFS || gfp_mask == GFP_NOIO)
  +                gfp_mask = GFP_ATOMIC;

           lockdep_trace_alloc(gfp_mask);

    0 filesystem errors out of 100 trials. 0 stalls.
    SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html