Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS

Yang Shi <shy828301@xxxxxxxxx> · Wed, 26 Apr 2023 11:26:07 -0700

On Wed, Apr 26, 2023 at 10:38 AM Phillip Lougher
<phillip@xxxxxxxxxxxxxxx> wrote:
>
>
> On 26/04/2023 17:44, Phillip Lougher wrote:
> >
> > On 26/04/2023 12:07, Hui Wang wrote:
> >>
> >> On 4/26/23 16:33, Michal Hocko wrote:
> >>> [CC squashfs maintainer]
> >>>
> >>> On Wed 26-04-23 13:10:30, Hui Wang wrote:
> >>>> If we run the stress-ng in the filesystem of squashfs, the system
> >>>> will be in a state something like hang, the stress-ng couldn't
> >>>> finish running and the console couldn't react to users' input.
> >>>>
> >>>> This issue happens on all arm/arm64 platforms we are working on,
> >>>> through debugging, we found this issue is introduced by oom handling
> >>>> in the kernel.
> >>>>
> >>>> The fs->readahead() is called between memalloc_nofs_save() and
> >>>> memalloc_nofs_restore(), and the squashfs_readahead() calls
> >>>> alloc_page(), in this case, if there is no memory left, the
> >>>> out_of_memory() will be called without __GFP_FS, then the oom killer
> >>>> will not be triggered and this process will loop endlessly and wait
> >>>> for others to trigger oom killer to release some memory. But for a
> >>>> system with the whole root filesystem constructed by squashfs,
> >>>> nearly all userspace processes will call out_of_memory() without
> >>>> __GFP_FS, so we will see that the system enters a state something like
> >>>> hang when running stress-ng.
> >>>>
> >>>> To fix it, we could trigger a kthread to call page_alloc() with
> >>>> __GFP_FS before returning from out_of_memory() due to without
> >>>> __GFP_FS.
> >>> I do not think this is an appropriate way to deal with this issue.
> >>> Does it even make sense to trigger OOM killer for something like
> >>> readahead? Would it be more mindful to fail the allocation instead?
> >>> That being said should allocations from squashfs_readahead use
> >>> __GFP_RETRY_MAYFAIL instead?
> >>
> >> Thanks for your comment, and this issue could hardly be reproduced on
> >> ext4 filesystem, that is because the ext4->readahead() doesn't call
> >> alloc_page(). If changing the ext4->readahead() as below, it will be
> >> easy to reproduce this issue with the ext4 filesystem (repeatedly
> >> run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0
> >> --timeout 30s --skip-silent --verbose)
> >>
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index ffbbd9626bd8..8b9db0b9d0b8 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file *file,
> >> struct folio *folio)
> >>  static void ext4_readahead(struct readahead_control *rac)
> >>  {
> >>         struct inode *inode = rac->mapping->host;
> >> +       struct page *tmp_page;
> >>
> >>         /* If the file has inline data, no need to do readahead. */
> >>         if (ext4_has_inline_data(inode))
> >>                 return;
> >>
> >> +       tmp_page = alloc_page(GFP_KERNEL);
> >> +
> >>         ext4_mpage_readpages(inode, rac, NULL);
> >> +
> >> +       if (tmp_page)
> >> +               __free_page(tmp_page);
> >>  }
> >>
> >>
> >> BTW, I applied my patch to the linux-next and ran the oom stress-ng
> >> testcases overnight, there is no hang, oops or crash, looks like
> >> there is no big problem to use a kthread to trigger the oom killer in
> >> this case.
> >>
> >> And Hi squashfs maintainer, I checked the code of filesystem, looks
> >> like most filesystems will not call alloc_page() in the readahead(),
> >> could you please help take a look at this issue, thanks.
> >
> >
> > This will be because most filesystems don't need to do so. Squashfs is
> > a compressed filesystem with large blocks covering much more than one
> > page, and it decompresses these blocks in squashfs_readahead().   If
> > __readahead_batch() does not return the full set of pages covering the
> > Squashfs block, it allocates a temporary page for the decompressors to
> > decompress into to "fill in the hole".
> >
> > What can be done here as far as Squashfs is concerned ....  I could
> > move the page allocation out of the readahead path (e.g. do it at
> > mount time).
>
> You could try this patch which does that.  Compile tested only.

The kmalloc_array() may call alloc_page() to trigger this problem too
IIUC. It should be pre-allocated as well?

>
>   fs/squashfs/page_actor.c     | 10 +---------
>   fs/squashfs/page_actor.h     |  1 -
>   fs/squashfs/squashfs_fs_sb.h |  1 +
>   fs/squashfs/super.c          | 10 ++++++++++
>   4 files changed, 12 insertions(+), 10 deletions(-)
>
> diff --git a/fs/squashfs/page_actor.c b/fs/squashfs/page_actor.c
> index 81af6c4ca115..6cce239eca66 100644
> --- a/fs/squashfs/page_actor.c
> +++ b/fs/squashfs/page_actor.c
> @@ -110,15 +110,7 @@ struct squashfs_page_actor *squashfs_page_actor_init_special(struct squashfs_sb_
>         if (actor == NULL)
>                 return NULL;
>
> -       if (msblk->decompressor->alloc_buffer) {
> -               actor->tmp_buffer = kmalloc(PAGE_SIZE, GFP_KERNEL);
> -
> -               if (actor->tmp_buffer == NULL) {
> -                       kfree(actor);
> -                       return NULL;
> -               }
> -       } else
> -               actor->tmp_buffer = NULL;
> +       actor->tmp_buffer = msblk->actor_page;
>
>         actor->length = length ? : pages * PAGE_SIZE;
>         actor->page = page;
> diff --git a/fs/squashfs/page_actor.h b/fs/squashfs/page_actor.h
> index 97d4983559b1..df5e999afa42 100644
> --- a/fs/squashfs/page_actor.h
> +++ b/fs/squashfs/page_actor.h
> @@ -34,7 +34,6 @@ static inline struct page *squashfs_page_actor_free(struct squashfs_page_actor *
>   {
>         struct page *last_page = actor->last_page;
>
> -       kfree(actor->tmp_buffer);
>         kfree(actor);
>         return last_page;
>   }
> diff --git a/fs/squashfs/squashfs_fs_sb.h b/fs/squashfs/squashfs_fs_sb.h
> index 72f6f4b37863..8feddc9e6cce 100644
> --- a/fs/squashfs/squashfs_fs_sb.h
> +++ b/fs/squashfs/squashfs_fs_sb.h
> @@ -47,6 +47,7 @@ struct squashfs_sb_info {
>         struct squashfs_cache                   *block_cache;
>         struct squashfs_cache                   *fragment_cache;
>         struct squashfs_cache                   *read_page;
> +       void                                    *actor_page;
>         int                                     next_meta_index;
>         __le64                                  *id_table;
>         __le64                                  *fragment_index;
> diff --git a/fs/squashfs/super.c b/fs/squashfs/super.c
> index e090fae48e68..674dc187d961 100644
> --- a/fs/squashfs/super.c
> +++ b/fs/squashfs/super.c
> @@ -329,6 +329,15 @@ static int squashfs_fill_super(struct super_block *sb, struct fs_context *fc)
>                 goto failed_mount;
>         }
>
> +
> +       /* Allocate page for squashfs_readahead()/squashfs_read_folio() */
> +       if (msblk->decompressor->alloc_buffer) {
> +               msblk->actor_page = kmalloc(PAGE_SIZE, GFP_KERNEL);
> +
> +               if(msblk->actor_page == NULL)
> +                       goto failed_mount;
> +       }
> +
>         msblk->stream = squashfs_decompressor_setup(sb, flags);
>         if (IS_ERR(msblk->stream)) {
>                 err = PTR_ERR(msblk->stream);
> @@ -454,6 +463,7 @@ static int squashfs_fill_super(struct super_block *sb, struct fs_context *fc)
>         squashfs_cache_delete(msblk->block_cache);
>         squashfs_cache_delete(msblk->fragment_cache);
>         squashfs_cache_delete(msblk->read_page);
> +       kfree(msblk->actor_page);
>         msblk->thread_ops->destroy(msblk);
>         kfree(msblk->inode_lookup_table);
>         kfree(msblk->fragment_index);
> --
> 2.35.1
>
> >
> > Adding __GFP_RETRY_MAYFAIL so the alloc() can fail will mean Squashfs
> > returning I/O failures due to no memory.  That will cause a lot of
> > applications to crash in a low memory situation.  So a crash rather
> > than a hang.
> >
> > Phillip
> >
> >
> >
> >