On Wed, Apr 26, 2023 at 10:38 AM Phillip Lougher <phillip@xxxxxxxxxxxxxxx> wrote: > > > On 26/04/2023 17:44, Phillip Lougher wrote: > > > > On 26/04/2023 12:07, Hui Wang wrote: > >> > >> On 4/26/23 16:33, Michal Hocko wrote: > >>> [CC squashfs maintainer] > >>> > >>> On Wed 26-04-23 13:10:30, Hui Wang wrote: > >>>> If we run the stress-ng in the filesystem of squashfs, the system > >>>> will be in a state something like hang, the stress-ng couldn't > >>>> finish running and the console couldn't react to users' input. > >>>> > >>>> This issue happens on all arm/arm64 platforms we are working on, > >>>> through debugging, we found this issue is introduced by oom handling > >>>> in the kernel. > >>>> > >>>> The fs->readahead() is called between memalloc_nofs_save() and > >>>> memalloc_nofs_restore(), and the squashfs_readahead() calls > >>>> alloc_page(), in this case, if there is no memory left, the > >>>> out_of_memory() will be called without __GFP_FS, then the oom killer > >>>> will not be triggered and this process will loop endlessly and wait > >>>> for others to trigger oom killer to release some memory. But for a > >>>> system with the whole root filesystem constructed by squashfs, > >>>> nearly all userspace processes will call out_of_memory() without > >>>> __GFP_FS, so we will see that the system enters a state something like > >>>> hang when running stress-ng. > >>>> > >>>> To fix it, we could trigger a kthread to call page_alloc() with > >>>> __GFP_FS before returning from out_of_memory() due to without > >>>> __GFP_FS. > >>> I do not think this is an appropriate way to deal with this issue. > >>> Does it even make sense to trigger OOM killer for something like > >>> readahead? Would it be more mindful to fail the allocation instead? > >>> That being said should allocations from squashfs_readahead use > >>> __GFP_RETRY_MAYFAIL instead? > >> > >> Thanks for your comment, and this issue could hardly be reproduced on > >> ext4 filesystem, that is because the ext4->readahead() doesn't call > >> alloc_page(). If changing the ext4->readahead() as below, it will be > >> easy to reproduce this issue with the ext4 filesystem (repeatedly > >> run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0 > >> --timeout 30s --skip-silent --verbose) > >> > >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > >> index ffbbd9626bd8..8b9db0b9d0b8 100644 > >> --- a/fs/ext4/inode.c > >> +++ b/fs/ext4/inode.c > >> @@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file *file, > >> struct folio *folio) > >> static void ext4_readahead(struct readahead_control *rac) > >> { > >> struct inode *inode = rac->mapping->host; > >> + struct page *tmp_page; > >> > >> /* If the file has inline data, no need to do readahead. */ > >> if (ext4_has_inline_data(inode)) > >> return; > >> > >> + tmp_page = alloc_page(GFP_KERNEL); > >> + > >> ext4_mpage_readpages(inode, rac, NULL); > >> + > >> + if (tmp_page) > >> + __free_page(tmp_page); > >> } > >> > >> > >> BTW, I applied my patch to the linux-next and ran the oom stress-ng > >> testcases overnight, there is no hang, oops or crash, looks like > >> there is no big problem to use a kthread to trigger the oom killer in > >> this case. > >> > >> And Hi squashfs maintainer, I checked the code of filesystem, looks > >> like most filesystems will not call alloc_page() in the readahead(), > >> could you please help take a look at this issue, thanks. > > > > > > This will be because most filesystems don't need to do so. Squashfs is > > a compressed filesystem with large blocks covering much more than one > > page, and it decompresses these blocks in squashfs_readahead(). If > > __readahead_batch() does not return the full set of pages covering the > > Squashfs block, it allocates a temporary page for the decompressors to > > decompress into to "fill in the hole". > > > > What can be done here as far as Squashfs is concerned .... I could > > move the page allocation out of the readahead path (e.g. do it at > > mount time). > > You could try this patch which does that. Compile tested only. The kmalloc_array() may call alloc_page() to trigger this problem too IIUC. It should be pre-allocated as well? > > fs/squashfs/page_actor.c | 10 +--------- > fs/squashfs/page_actor.h | 1 - > fs/squashfs/squashfs_fs_sb.h | 1 + > fs/squashfs/super.c | 10 ++++++++++ > 4 files changed, 12 insertions(+), 10 deletions(-) > > diff --git a/fs/squashfs/page_actor.c b/fs/squashfs/page_actor.c > index 81af6c4ca115..6cce239eca66 100644 > --- a/fs/squashfs/page_actor.c > +++ b/fs/squashfs/page_actor.c > @@ -110,15 +110,7 @@ struct squashfs_page_actor *squashfs_page_actor_init_special(struct squashfs_sb_ > if (actor == NULL) > return NULL; > > - if (msblk->decompressor->alloc_buffer) { > - actor->tmp_buffer = kmalloc(PAGE_SIZE, GFP_KERNEL); > - > - if (actor->tmp_buffer == NULL) { > - kfree(actor); > - return NULL; > - } > - } else > - actor->tmp_buffer = NULL; > + actor->tmp_buffer = msblk->actor_page; > > actor->length = length ? : pages * PAGE_SIZE; > actor->page = page; > diff --git a/fs/squashfs/page_actor.h b/fs/squashfs/page_actor.h > index 97d4983559b1..df5e999afa42 100644 > --- a/fs/squashfs/page_actor.h > +++ b/fs/squashfs/page_actor.h > @@ -34,7 +34,6 @@ static inline struct page *squashfs_page_actor_free(struct squashfs_page_actor * > { > struct page *last_page = actor->last_page; > > - kfree(actor->tmp_buffer); > kfree(actor); > return last_page; > } > diff --git a/fs/squashfs/squashfs_fs_sb.h b/fs/squashfs/squashfs_fs_sb.h > index 72f6f4b37863..8feddc9e6cce 100644 > --- a/fs/squashfs/squashfs_fs_sb.h > +++ b/fs/squashfs/squashfs_fs_sb.h > @@ -47,6 +47,7 @@ struct squashfs_sb_info { > struct squashfs_cache *block_cache; > struct squashfs_cache *fragment_cache; > struct squashfs_cache *read_page; > + void *actor_page; > int next_meta_index; > __le64 *id_table; > __le64 *fragment_index; > diff --git a/fs/squashfs/super.c b/fs/squashfs/super.c > index e090fae48e68..674dc187d961 100644 > --- a/fs/squashfs/super.c > +++ b/fs/squashfs/super.c > @@ -329,6 +329,15 @@ static int squashfs_fill_super(struct super_block *sb, struct fs_context *fc) > goto failed_mount; > } > > + > + /* Allocate page for squashfs_readahead()/squashfs_read_folio() */ > + if (msblk->decompressor->alloc_buffer) { > + msblk->actor_page = kmalloc(PAGE_SIZE, GFP_KERNEL); > + > + if(msblk->actor_page == NULL) > + goto failed_mount; > + } > + > msblk->stream = squashfs_decompressor_setup(sb, flags); > if (IS_ERR(msblk->stream)) { > err = PTR_ERR(msblk->stream); > @@ -454,6 +463,7 @@ static int squashfs_fill_super(struct super_block *sb, struct fs_context *fc) > squashfs_cache_delete(msblk->block_cache); > squashfs_cache_delete(msblk->fragment_cache); > squashfs_cache_delete(msblk->read_page); > + kfree(msblk->actor_page); > msblk->thread_ops->destroy(msblk); > kfree(msblk->inode_lookup_table); > kfree(msblk->fragment_index); > -- > 2.35.1 > > > > > Adding __GFP_RETRY_MAYFAIL so the alloc() can fail will mean Squashfs > > returning I/O failures due to no memory. That will cause a lot of > > applications to crash in a low memory situation. So a crash rather > > than a hang. > > > > Phillip > > > > > > > >