Re: [PATCH] mm: save/restore current->journal_info in handle_mm_fault

"Yan, Zheng" <ukernel@xxxxxxxxx> · Thu, 14 Dec 2017 22:30:26 +0800

On Thu, Dec 14, 2017 at 9:43 PM, Jan Kara <jack@xxxxxxx> wrote:
> On Thu 14-12-17 18:55:27, Yan, Zheng wrote:
>> We recently got an Oops report:
>>
>> BUG: unable to handle kernel NULL pointer dereference at (null)
>> IP: jbd2__journal_start+0x38/0x1a2
>> [...]
>> Call Trace:
>>   ext4_page_mkwrite+0x307/0x52b
>>   _ext4_get_block+0xd8/0xd8
>>   do_page_mkwrite+0x6e/0xd8
>>   handle_mm_fault+0x686/0xf9b
>>   mntput_no_expire+0x1f/0x21e
>>   __do_page_fault+0x21d/0x465
>>   dput+0x4a/0x2f7
>>   page_fault+0x22/0x30
>>   copy_user_generic_string+0x2c/0x40
>>   copy_page_to_iter+0x8c/0x2b8
>>   generic_file_read_iter+0x26e/0x845
>>   timerqueue_del+0x31/0x90
>>   ceph_read_iter+0x697/0xa33 [ceph]
>>   hrtimer_cancel+0x23/0x41
>>   futex_wait+0x1c8/0x24d
>>   get_futex_key+0x32c/0x39a
>>   __vfs_read+0xe0/0x130
>>   vfs_read.part.1+0x6c/0x123
>>   handle_mm_fault+0x831/0xf9b
>>   __fget+0x7e/0xbf
>>   SyS_read+0x4d/0xb5
>>
>> ceph_read_iter() uses current->journal_info to pass context info to
>> ceph_readpages(). Because ceph_readpages() needs to know if its caller
>> has already gotten capability of using page cache (distinguish read
>> from readahead/fadvise). ceph_read_iter() set current->journal_info,
>> then calls generic_file_read_iter().
>>
>> In above Oops, page fault happened when copying data to userspace.
>> Page fault handler called ext4_page_mkwrite(). Ext4 code read
>> current->journal_info and assumed it is journal handle.
>>
>> I checked other filesystems, btrfs probably suffers similar problem
>> for its readpage. (page fault happens when write() copies data from
>> userspace memory and the memory is mapped to a file in btrfs.
>> verify_parent_transid() can be called during readpage)
>>
>> Cc: stable@xxxxxxxxxxxxxxx
>> Signed-off-by: "Yan, Zheng" <zyan@xxxxxxxxxx>
>
> I agree with the analysis but the patch is too ugly too live. Ceph just
> should not be abusing current->journal_info for passing information between
> two random functions or when it does a hackery like this, it should just
> make sure the pieces hold together. Poluting generic code to accommodate
> this hack in Ceph is not acceptable. Also bear in mind there are likely
> other code paths (e.g. memory reclaim) which could recurse into another
> filesystem confusing it with non-NULL current->journal_info in the same
> way.

But ...

some filesystem set journal_info in its write_begin(), then clear it
in write_end(). If buffer for write is mapped to another filesystem,
current->journal can leak to the later filesystem's page_readpage().
The later filesystem may read current->journal and treat it as its own
journal handle.  Besides, most filesystem's vm fault handle is
filemap_fault(), filemap also may tigger memory reclaim.

>
> In this particular case I'm not sure why does ceph pass 'filp' into
> readpage() / readpages() handler when it already gets that pointer as part
> of arguments...

It actually a flag which tells ceph_readpages() if its caller is
ceph_read_iter or readahead/fadvise/madvise. because when there are
multiple clients read/write a file a the same time, page cache should
be disabled.

Regards
Yan, Zheng

>
>                                                                 Honza
>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index a728bed16c20..db2a50233c49 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4044,6 +4044,7 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>               unsigned int flags)
>>  {
>>       int ret;
>> +     void *old_journal_info;
>>
>>       __set_current_state(TASK_RUNNING);
>>
>> @@ -4065,11 +4066,24 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>       if (flags & FAULT_FLAG_USER)
>>               mem_cgroup_oom_enable();
>>
>> +     /*
>> +      * Fault can happen when filesystem A's read_iter()/write_iter()
>> +      * copies data to/from userspace. Filesystem A may have set
>> +      * current->journal_info. If the userspace memory is MAP_SHARED
>> +      * mapped to a file in filesystem B, we later may call filesystem
>> +      * B's vm operation. Filesystem B may also want to read/set
>> +      * current->journal_info.
>> +      */
>> +     old_journal_info = current->journal_info;
>> +     current->journal_info = NULL;
>> +
>>       if (unlikely(is_vm_hugetlb_page(vma)))
>>               ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
>>       else
>>               ret = __handle_mm_fault(vma, address, flags);
>>
>> +     current->journal_info = old_journal_info;
>> +
>>       if (flags & FAULT_FLAG_USER) {
>>               mem_cgroup_oom_disable();
>>               /*
>> --
>> 2.13.6
>>
> --
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html