Re: [PATCH] mm: save/restore current->journal_info in handle_mm_fault

Andreas Dilger <adilger@xxxxxxxxx> · Thu, 14 Dec 2017 13:48:24 -0700

[remove stable@ as this is not really a stable patch]

On Dec 14, 2017, at 7:30 AM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
> 
> On Thu, Dec 14, 2017 at 9:43 PM, Jan Kara <jack@xxxxxxx> wrote:
>> On Thu 14-12-17 18:55:27, Yan, Zheng wrote:
>>> We recently got an Oops report:
>>> 
>>> BUG: unable to handle kernel NULL pointer dereference at (null)
>>> IP: jbd2__journal_start+0x38/0x1a2
>>> [...]
>>> Call Trace:
>>>  ext4_page_mkwrite+0x307/0x52b
>>>  _ext4_get_block+0xd8/0xd8
>>>  do_page_mkwrite+0x6e/0xd8
>>>  handle_mm_fault+0x686/0xf9b
>>>  mntput_no_expire+0x1f/0x21e
>>>  __do_page_fault+0x21d/0x465
>>>  dput+0x4a/0x2f7
>>>  page_fault+0x22/0x30
>>>  copy_user_generic_string+0x2c/0x40
>>>  copy_page_to_iter+0x8c/0x2b8
>>>  generic_file_read_iter+0x26e/0x845
>>>  timerqueue_del+0x31/0x90
>>>  ceph_read_iter+0x697/0xa33 [ceph]
>>>  hrtimer_cancel+0x23/0x41
>>>  futex_wait+0x1c8/0x24d
>>>  get_futex_key+0x32c/0x39a
>>>  __vfs_read+0xe0/0x130
>>>  vfs_read.part.1+0x6c/0x123
>>>  handle_mm_fault+0x831/0xf9b
>>>  __fget+0x7e/0xbf
>>>  SyS_read+0x4d/0xb5
>>> 
>>> ceph_read_iter() uses current->journal_info to pass context info to
>>> ceph_readpages(). Because ceph_readpages() needs to know if its caller
>>> has already gotten capability of using page cache (distinguish read
>>> from readahead/fadvise). ceph_read_iter() set current->journal_info,
>>> then calls generic_file_read_iter().
>>> 
>>> In above Oops, page fault happened when copying data to userspace.
>>> Page fault handler called ext4_page_mkwrite(). Ext4 code read
>>> current->journal_info and assumed it is journal handle.
>>> 
>>> I checked other filesystems, btrfs probably suffers similar problem
>>> for its readpage. (page fault happens when write() copies data from
>>> userspace memory and the memory is mapped to a file in btrfs.
>>> verify_parent_transid() can be called during readpage)
>>> 
>>> Cc: stable@xxxxxxxxxxxxxxx
>>> Signed-off-by: "Yan, Zheng" <zyan@xxxxxxxxxx>
>> 
>> I agree with the analysis but the patch is too ugly too live. Ceph just
>> should not be abusing current->journal_info for passing information between
>> two random functions or when it does a hackery like this, it should just
>> make sure the pieces hold together. Poluting generic code to accommodate
>> this hack in Ceph is not acceptable. Also bear in mind there are likely
>> other code paths (e.g. memory reclaim) which could recurse into another
>> filesystem confusing it with non-NULL current->journal_info in the same
>> way.
> 
> But ...
> 
> some filesystem set journal_info in its write_begin(), then clear it
> in write_end(). If buffer for write is mapped to another filesystem,
> current->journal can leak to the later filesystem's page_readpage().
> The later filesystem may read current->journal and treat it as its own
> journal handle.  Besides, most filesystem's vm fault handle is
> filemap_fault(), filemap also may tigger memory reclaim.

Shouldn't the memory reclaim be prevented from recursing into the other
filesystem by use of GFP_NOFS, or the new memalloc_nofs annotation?

I don't think that ext4 is ever using current->journal on any read paths,
only in case of writes.

>> In this particular case I'm not sure why does ceph pass 'filp' into
>> readpage() / readpages() handler when it already gets that pointer as part
>> of arguments...
> 
> It actually a flag which tells ceph_readpages() if its caller is
> ceph_read_iter or readahead/fadvise/madvise. because when there are
> multiple clients read/write a file a the same time, page cache should
> be disabled.

I've wanted something similar for other reasons.  It would be better to
have a separate fs-specific pointer in the task struct to handle this
kind of information.  This can be used by the filesystem "upper half" to
communicate with the "lower half" (doing the writeout or other IO below
the VFS), and the "lower half" can use ->journal for handling the writeout.

However, some care would be needed to ensure that other processes accessing
this pointer would only do so if it is their own.  Something like
->fs_private_sb and ->fs_private_data would allow this sanely.  If the
->fs_private_sb != sb in the filesystem then ->fs_private_data is not valid
for this fs and cannot be used by the current filesystem code.  Alternately,
we could have a single ->fs_private pointer to reduce impact on task_struct
so long as all filesystems used the first field of the structure to point to
"sb", probably with a library helper to ensure this was done consistently:

	data = current_fs_private_get(sb);
        current_fs_private_set(sb, data);
	data = current_fs_private_alloc(sb, size, gfp);

or whatever.

> Regards
> Yan, Zheng
> 
>> 
>>                                                                Honza
>> 
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index a728bed16c20..db2a50233c49 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4044,6 +4044,7 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>>              unsigned int flags)
>>> {
>>>      int ret;
>>> +     void *old_journal_info;
>>> 
>>>      __set_current_state(TASK_RUNNING);
>>> 
>>> @@ -4065,11 +4066,24 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>>      if (flags & FAULT_FLAG_USER)
>>>              mem_cgroup_oom_enable();
>>> 
>>> +     /*
>>> +      * Fault can happen when filesystem A's read_iter()/write_iter()
>>> +      * copies data to/from userspace. Filesystem A may have set
>>> +      * current->journal_info. If the userspace memory is MAP_SHARED
>>> +      * mapped to a file in filesystem B, we later may call filesystem
>>> +      * B's vm operation. Filesystem B may also want to read/set
>>> +      * current->journal_info.
>>> +      */
>>> +     old_journal_info = current->journal_info;
>>> +     current->journal_info = NULL;
>>> +
>>>      if (unlikely(is_vm_hugetlb_page(vma)))
>>>              ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
>>>      else
>>>              ret = __handle_mm_fault(vma, address, flags);
>>> 
>>> +     current->journal_info = old_journal_info;
>>> +
>>>      if (flags & FAULT_FLAG_USER) {
>>>              mem_cgroup_oom_disable();
>>>              /*
>>> --
>>> 2.13.6
>>> 
>> --
>> Jan Kara <jack@xxxxxxxx>
>> SUSE Labs, CR

Cheers, Andreas

Attachment:
signature.asc

Description: Message signed with OpenPGP