Re: [PATCH] generic/558: limit the number of spawned subprocesses

Mikulas Patocka <mpatocka@xxxxxxxxxx> · Wed, 12 Jul 2023 12:10:05 +0200 (CEST)

On Tue, 11 Jul 2023, Kent Overstreet wrote:

> On Tue, Jul 11, 2023 at 04:44:39PM -0700, Darrick J. Wong wrote:
> > On Tue, Jul 11, 2023 at 05:51:42PM +0200, Mikulas Patocka wrote:
> > > When I run the test 558 on bcachefs, it works like a fork-bomb and kills
> > > the machine. The reason is that the "while" loop spawns "create_file"
> > > subprocesses faster than they are able to complete.
> > > 
> > > This patch fixes the crash by limiting the number of subprocesses to 128.
> > > 
> > > Signed-off-by: Mikulas Patocka <mpatocka@xxxxxxxxxx>
> > > 
> > > ---
> > >  tests/generic/558 |    1 +
> > >  1 file changed, 1 insertion(+)
> > > 
> > > Index: xfstests-dev/tests/generic/558
> > > ===================================================================
> > > --- xfstests-dev.orig/tests/generic/558
> > > +++ xfstests-dev/tests/generic/558
> > > @@ -48,6 +48,7 @@ echo "Create $((loop * file_per_dir)) fi
> > >  while [ $i -lt $loop ]; do
> > >  	create_file $SCRATCH_MNT/testdir $file_per_dir $i >>$seqres.full 2>&1 &
> > >  	let i=$i+1
> > > +	if [ $((i % 128)) = 0 ]; then wait; fi
> > 
> > Hm.  $loop is (roughly) the number of free inodes divided by 1000.  This
> > test completes nearly instantly on XFS; how many free inodes does
> > bcachefs report after _scratch_mount?
> > 
> > XFS reports ~570k inodes, so it's "only" starting 570 processes.
> > 
> > I think it's probably wise to clamp $loop to something sane, but let's
> > get to the bottom of how the math went wrong and we got a forkbomb.
> 
> It's because:
>  - bcachefs doesn't even report a maximum number of inodes (IIRC);
>    inodes are small and variable size (most fields are varints, typical
>    inode size is 50-100 bytes).
> 
>  - and the kernel has a sysctl to limit the maximum number of open
>    files, and it's got a sane default; this is what's supposed to save
>    us from pinned inodes eating up all ram), but systemd conveniently
>    overwrites it to some absurd value...
> 
> I'd prefer to see this fixed properly, rather than just "fixing" the
> test; userspace being able to pin all kernel memory this easily is a
> real bug.
> 
> We could put a second hard cap on the maximum number of open files, and
> base that on a percentage of total memory; a VFS inode is somewhere in
> the ballpack of a kilobyte, so easy enough to calculate. And we could
> make that percentage itself a sysctl, for the people who are really
> crazy...

If we hit the limit of total open files, we already killed the system. At 
this point the user can't execute any program because executing a programs 
requires opening files.

I think that it is possible to setup cgroups so that a process inside a 
cgroup can't kill the machine by exhausting resources. But distributions 
don't do it. And they don't do it for a root user (the test runs under 
root).

Mikulas