Hello All, There seems to be a below race with current ext4_mb_group_discard_preallocations() function. This could be fairly easily reproduced on a system with only 1 group. But internally this was even reported with more than 1 group. We could reproduce this on both 64K and 4K blocksize filesystem. This is a RFC patch sent out for reviews, feedback and/or any comments. Below mail provide all the necessary details. Test setup ========== 1. It's a multithreaded test case where each thread is trying to create a file using open() -> ftruncate(). 2. Then we are doing mmap of this file for filesize bytes. 3. Then start writing sequentially byte by byte for full filesize. Details for creating it on loop device:- ====================================== 1. truncate -s 240M filefs (easier on a smaller filesystem) 2. losetup /dev/loop0 filefs 3. mkfs.ext4 -F -b 65536 /dev/loop0 4. mkdir mnt 5. mount -t ext4 /dev/loop0 mnt/ 6. cd mnt/ 7. Start running the test case mentioned above in above "Test setup". (for our test we were keeping no. of threads to ~70 and filling the available filesystem space (df -h) to around ~80%/70%. Based on that each filesize is calculated). 8. cd .. (once the test finishes) 9. umount mnt 10. Go to step 3. Test (test-ext4-mballoc.c) file and script which does the unmount/mount and run the ./a.out is mentioned at [1], [2]. Analysis:- ========== It seems below race could be occurring P1 P2 ext4_mb_new_blocks() | | ext4_mb_new_blocks() | ext4_mb_group_discard_preallocations() | | ext4_mb_group_discard_preallocations() if (list_empty(&grp->bb_prealloc_list) return 0; | | | ext4_lock_group() | list_for_each_entry_safe() { | <..> | list_del(&pa->pa_group_list); | list_add(&pa->u.pa_tmp_list, &list) | } | | processing-local-list() if(list_empty(&grp->bb_prealloc_list) | return 0 <...> ext4_unlock_group() What we see here is that, there are multiple threads which are trying to allocate. But since there is not enough space, they try to discard the group preallocations. (will be easy to understand if we only consider group 0, though it could be reproduced with multiple groups as well). Now while more than 1 thread tries to free up the group preallocations, there could be 1 thread (P2) which sees that the bb_prealloc_list is already empty and will assume that there is nothing to free from here. Hence return 0. Now consider this happens with thread P2 for all other groups as well (where some other thread came in early and freed up the group preallocations). At that point, P2 sees that the total freed blocks returned by ext4_mb_discard_preallocations() back to ext4_mb_new_blocks() is 0 and hence it does not retry the allocation, instead it returns -ENOSPC error. This causes SIGBUS to the application who was doing an mmap write. Once the application crashes we could still see that the filesystem available space is more than ~70-80% (worst case scenario). So ideally P2 should have waited for P1 to get over and should have checked how much P1 could free up. Solution (based on the understanding of the mballoc code) ========================================================= We think that it is best to check if there is anything to be freed within ext4_group_lock(). i.e. to check if the bb_prealloc_list is empty. This patch attempts to fix this race by checking if nothing could be collected in the local list. This could mean that someone else might have freed all of this group PAs for us. So simply return group->bb_free which should also give us an upper bound on the total available space for allocation in this group. We need not worry about the fast path of whether the list is empty without taking ext4_group_lock(), since we are anyway in the slow path where the ext4_mb_regular_allocator() failed and hence we are now desperately trying to discard all the group PAs to free up some space for allocation. Please correct if any of below understanding is incorrect:- ========================================================== 1. grp->bb_free is the available number of free blocks in that group for allocation by anyone. 2. If grp->bb_free is non-zero and we call ext4_mb_regular_allocator(ac), then it will always return ac->ac_status == AC_STATUS_FOUND (and it could even allocate and return less than the requested no. of blocks). 3. There shouldn't be any infinte loop in ext4_mb_new_blocks() after we return grp->bb_free in this patch. (i.e. between ext4_mb_regular_allocator() and ext4_mb_discard_preallocations()) It could only happen if ext4_mb_regular_allocator cannot make any forward progress even if grp->bb_free is non-zero. But IIUC, that won't happen. Please correct here. Tests run:- ========== For now I have only done unit testing with the shared test code [1] [2]. Wanted to post this RFC for review comments/discussion. Resources: ========== [1] - https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/tools/test-ext4-mballoc.c [2] - https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/tools/test_mballoc.sh Ritesh Harjani (1): ext4: Fix race in ext4_mb_discard_group_preallocations() fs/ext4/mballoc.c | 31 ++++++++++++++++++++++++------- 1 file changed, 24 insertions(+), 7 deletions(-) -- 2.21.0