Re: [PATCH] ext4: handle NULL p_ext in ext4_ext_next_allocated_block()

Curt Wohlgemuth <curtw@xxxxxxxxxx> · Mon, 10 Oct 2011 08:28:23 -0700

Hi Lukas:

On Mon, Oct 10, 2011 at 12:19 AM, Lukas Czerner <lczerner@xxxxxxxxxx> wrote:
> On Sat, 8 Oct 2011, Curt Wohlgemuth wrote:
>
>> In ext4_ext_next_allocated_block(), the path[depth] might
>> have a p_ext that is NULL -- see ext4_ext_binsearch().  In
>> such a case, dereferencing it will crash the machine.
>>
>> This patch checks for p_ext == NULL in
>> ext4_ext_next_allocated_block() before dereferencinging it.
>>
>> Tested using a hand-crafted an inode with eh_entries == 0 in
>> an extent block, verified that running FIEMAP on it crashes
>> without this patch, works fine with it.
>
> Hi Curt,
>
> It seems to me that even that the patch fixes the NULL dereference, it
> is not a complete solution. Is it possible that in "normal" case p_ext
> would be NULL ? I think that this is only possible in extent split/add
> case (as noted in ext4_ext_binsearch()) which should be atomic to the
> other operations (locked i_mutex?).

Yes, unfortunately, it is possible in "normal" cases for p_ext to be NULL.

We've seen this problem during what appears to be a race between an
inode growth (or truncate?) and another task doing a FIEMAP ioctl.
The real problem is that FIEMAP handing in ext4 is just, well, buggy?

ext4_ext_walk_space() will get the i_data_sem, construct the path
array, then release the semaphore.  But then it does a bazillion
accesses on the extent/header/index pointers in the path array, with
no protection against truncate, growth, or any other changes.  As far
as I can tell, this is the only use of a path array retrieved from
ext4_ext_find_extent() that isn't completely covered by i_data_sem.

> However this seems like an inode corruption so we should probably be
> more verbose about it and print an appropriate EXT4_ERROR_INODE() or
> even better check for the corrupted tree in the ext4_ext_find_extent()
> (instead in ext4_ext_map_blocks()), however this will need to distinguish
> between the normal and the tree modification case.

What we've observed many times is a crash during a FIEMAP call to
ext4_ext_next_allocated_block(), which appears to me to be during a
race with another thread that's splitting the extent tree.  This
causes the machine to go down with the inode in a bad state.  But of
course, fsck won't detect and fix this, so when the machine comes back
up, and a FIEMAP call is done on this same inode -- without any other
threads -- it'll crash again.  Hence a nasty crash loop.

So you're right, in that this isn't the "real solution."  But devising
a safe, non-racy design for FIEMAP is not so simple, unless of course
you want to just hold the i_data_sem during the entire loop body of
ext4_ext_walk_space(), which would be pretty ugly.  Hence the
"band-aid" approach in my patch, which at least seems correct, if not
thorough.

Thanks,
Curt

>
> Thanks!
> -Lukas
>
>>
>> Signed-off-by: Curt Wohlgemuth <curtw@xxxxxxxxxx>
>> ---
>>  fs/ext4/extents.c |    4 +++-
>>  1 files changed, 3 insertions(+), 1 deletions(-)
>>
>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>> index 57cf568..063a5b8 100644
>> --- a/fs/ext4/extents.c
>> +++ b/fs/ext4/extents.c
>> @@ -1395,7 +1395,9 @@ ext4_ext_next_allocated_block(struct ext4_ext_path *path)
>>       while (depth >= 0) {
>>               if (depth == path->p_depth) {
>>                       /* leaf */
>> -                     if (path[depth].p_ext !=
>> +                     /* p_ext can be NULL */
>> +                     if (path[depth].p_ext &&
>> +                             path[depth].p_ext !=
>>                                       EXT_LAST_EXTENT(path[depth].p_hdr))
>>                         return le32_to_cpu(path[depth].p_ext[1].ee_block);
>>               } else {
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html