Dear
All:
Sorry
for disturbing all of you. Recently I was traped by a memory issue, and finally
I find it may be sync problems of mprotect sys call.
Background
of this problem:
Our
kernel version is 3.14.29, When playing Netflix and switch with other apps on
Android platform for stress test, after a whole day test, Netflix become no
response. Using top command we find it took 25% CPU usage on a 4x cortex-a53 cpu
platform. Which means that one cpu was 100% occupied by netfilx. We have tried
to debug it and found over 150000 page fault happen in binder thread
in one seconds. And fault addresses were always same. Finally we found
that attribute of PTE for this address was not match property of it's VMA
flags:
Log:
[00:11:40][ 8025.477715@1] f176f000-f1773000 r-xp 00000000 00:00 0
[00:11:40][ 8025.478935@1] (1)[8223:Binder:269_3]pgd = ffffffc034007000
[00:11:40][ 8025.480320@1] [f17701bc] pgd:ffffffc034007018, *pgd=000000002dbcd003, pmd:ffffffc02dbcdc58, *pmd=000000003408b003, pte:ffffffc03408bb80, *pte=006000001e270fd3 Property of
this vma was R+X, but PTE for fault address did not have execuate
permission, because BIT54(PTE_UXN ) was set. So MMU gave permmsion fault for
this address but fault handler of arm64 default code did not check this problem,
it found that this address had a non-zero pte and other flags seems ok, so
handler did nothing and return to user space to keep execute on this page. And
that caused page fault happen again and again, with 100% CPU.
After
many times of debug we got a log which show almost full steps of this
scenario(debug patch see "60fead12.diff.zip" in attach):
1. Bad pte was at address ffffffc034155eb8. First time set pte in this address was
here(see around line 10271 in attached log file),
from an empty value 0 to a write pte for virtual address e91d7000, at this time vma range was e91d7000-e91d8000, call stack show it
was a normal page fault.
[18297.354385@2] !(2)[6576:Binder:272_5]set pte at ffffffc034155eb8, e91d7000, 0->2e00000000acf53, ip:handle_mm_fault, mm:ffffffc034a3ea00
[18297.357234@2] !(2)[6576:Binder:272_5]vma:ffffffc0074b8d48 f:80100072 (null),[e91d7000-e91d8000] p:0, fc:1, ip:mmap_region
[18297.360162@2] !(2)[6576:Binder:272_5]Call trace:
[18297.361278@2] !(2)[6576:Binder:272_5][<ffffffc0010894a4>] dump_backtrace+0x0/0x174
[18297.363105@2] !(2)[6576:Binder:272_5][<ffffffc001089634>] show_stack+0x1c/0x28
[18297.364898@2] !(2)[6576:Binder:272_5][<ffffffc001096630>] set_pte+0xe8/0x124
[18297.366590@2] !(2)[6576:Binder:272_5][<ffffffc00119889c>] handle_mm_fault+0x74c/0xc74
[18297.368534@2] !(2)[6576:Binder:272_5][<ffffffc001096b58>] do_page_fault+0x220/0x40c
[18297.370370@2] !(2)[6576:Binder:272_5][<ffffffc0010812cc>] do_mem_abort+0x4c/0xb0
2. Then mprotect syscall to change it from write page to R+X page, pte value also change to execute attribute, with pte bit 54 PTE_UXN cleared, during mprotect, this vma range [e91d7000 - e91d8000] was merged with range [e91d8000 - e91db000] to [e91d7000 - e91db000], vma is R+X property:
[18297.398925@2] (2)[6576:Binder:272_5]mprotect_fixup, prop range[e91d7000-e91d8000], for vma:ffffffc0074b8d48,[e91d7000-e91db000], flag:80100072->80100075, fc:1, ip:mmap_region, mm:ffffffc034a3ea00
[18297.403127@2] !(2)[6576:Binder:272_5]set pte at ffffffc034155eb8, e91d7000, 2e00000000acf53->0, ip:change_protection_range, mm:ffffffc034a3ea00
[18297.406262@2] !(2)[6576:Binder:272_5]vma:ffffffc0074b8d48 f:80100075 (null),[e91d7000-e91db000] p:1, fc:1, ip:mmap_region
[18297.409140@2] !(2)[6576:Binder:272_5]set pte at ffffffc034155eb8, e91d7000, 0->a00000000acfd3, ip:change_protection_range, mm:ffffffc034a3ea00
[18297.412266@2] !(2)[6576:Binder:272_5]vma:ffffffc0074b8d48 f:80100075 (null),[e91d7000-e91db000] p:1, fc:1, ip:mmap_region
[18297.415141@2] !(2)[6576:Binder:272_5]change_pte_range 79, vma:ffffffc0074b8d48, modify e91d7000, pte:ffffffc034155eb8, 2e00000000acf53->a00000000acf53
3. Then another page fault happened for virtual address e91d5000, it's vma is ffffffc006c05190, range [e91d5000-e91d6000], only 4KB, with write property:
[18297.815796@1] !(1)[6576:Binder:272_5]set pte at ffffffc034155ea8, e91d5000, 0->2e000001c9f8f53, ip:handle_mm_fault, mm:ffffffc034a3ea00
[18297.818694@1] !(1)[6576:Binder:272_5]vma:ffffffc006c05190 f:80100072 (null),[e91d5000-e91d6000] p:0, fc:1, ip:mmap_region
[18297.821521@1] !(1)[6576:Binder:272_5]Call trace:
[18297.822635@1] !(1)[6576:Binder:272_5][<ffffffc0010894a4>] dump_backtrace+0x0/0x174
[18297.824474@1] !(1)[6576:Binder:272_5][<ffffffc001089634>] show_stack+0x1c/0x28
[18297.826210@1] !(1)[6576:Binder:272_5][<ffffffc001096630>] set_pte+0xe8/0x124
[18297.827955@1] !(1)[6576:Binder:272_5][<ffffffc00119889c>] handle_mm_fault+0x74c/0xc74
[18297.829846@1] !(1)[6576:Binder:272_5][<ffffffc001096b58>] do_page_fault+0x220/0x40c
[18297.831774@1] !(1)[6576:Binder:272_5][<ffffffc0010812cc>] do_mem_abort+0x4c/0xb0
4. Then pte at address ffffffc034155eb8 was modified by migration process. Migrate function first unmap vma with this pte, in function try_to_unmap, pte value was set from nomal R+X value to an swap entry. Ummaped vma was ffffffc006c05000, range [e91d6000-e91db000] with R+X property. Virtual address range for this migrate process is e91d7000.
Problem is in last step of fucntion remove_migration_pte, it find vma ffffffc006c05190, range [e91d5000-e91db000](24KB) with write-only proptery for virtual adress e91d7000. Which was not same as in try_to_unmap function. In step 3 we know that vma ffffffc006c05190 is only have 4KB range [e91d5000-e91d6000], with write-only property. But at here it grown up to e91db000. Before remove_migration_pte function vma ffffffc006c05000 holds vma range [e91d6000-e91db000] with R+X property.
After function remove_migration_pte, pte value at address ffffffc034155eb8 was changed to 6000000f9c6bd3, bit 54 PTE_UXN was set and it can't execute.
[18297.860182@3] !(3)[6396:ce.core.androi]set pte at ffffffc034155eb8, e91d7000, a00000000acfd3->0, ip:ptep_clear_flush, mm:ffffffc034a3ea00
[18297.863204@3] !(3)[6396:ce.core.androi]vma:ffffffc006c05000 f:80100075 (null),[e91d6000-e91db000] p:1, fc:1, ip:mmap_region
[18297.866143@3] !(3)[6396:ce.core.androi]set pte at ffffffc034155eb8, e91d7000, 0->158f0, ip:try_to_unmap_one, mm:ffffffc034a3ea00
[18297.868963@3] !(3)[6396:ce.core.androi]vma:ffffffc006c05000 f:80100075 (null),[e91d6000-e91db000] p:1, fc:1, ip:mmap_region
~~~~ These two steps were ok, it find right vma and right range, pte attribute were ok.
[18297.874197@3] !(3)[6396:ce.core.androi]set pte at ffffffc034155eb8, e91d7000, 158f0->6000000f9c6bd3, ip:remove_migration_pte, mm:ffffffc034a3ea00
[18297.876036@3] !(3)[6396:ce.core.androi]vma:ffffffc006c05190 f:80100072 (null),[e91d5000-e91db000] p:0, fc:1, ip:mmap_region
[18297.878995@3] !(3)[6396:ce.core.androi]Call trace:
[18297.880134@3] !(3)[6396:ce.core.androi][<ffffffc0010894a4>] dump_backtrace+0x0/0x174
[18297.882062@3] !(3)[6396:ce.core.androi][<ffffffc001089634>] show_stack+0x1c/0x28
[18297.883829@3] !(3)[6396:ce.core.androi][<ffffffc001096630>] set_pte+0xe8/0x124
[18297.885632@3] !(3)[6396:ce.core.androi][<ffffffc0011b966c>] remove_migration_pte+0x198/0x274
[18297.887682@3] !(3)[6396:ce.core.androi][<ffffffc0011a2c68>] rmap_walk+0x1fc/0x2bc
[18297.889483@3] !(3)[6396:ce.core.androi][<ffffffc0011b94c4>] remove_migration_ptes+0x48/0x58
[18297.891557@3] !(3)[6396:ce.core.androi][<ffffffc0011ba774>] move_to_new_page+0x1a4/0x27c
[18297.893498@3] !(3)[6396:ce.core.androi][<ffffffc0011baf78>] migrate_pages+0x574/0x648
[18297.895442@3] !(3)[6396:ce.core.androi][<ffffffc0011910e0>] compact_zone+0x380/0x4c4
[18297.897300@3] !(3)[6396:ce.core.androi][<ffffffc0011912f0>] compact_zone_order+0xcc/0x110
[18297.899332@3] !(3)[6396:ce.core.androi][<ffffffc00119160c>] try_to_compact_pages+0x108/0x1b8
[18297.901390@3] !(3)[6396:ce.core.androi][<ffffffc001174ab4>] __alloc_pages_direct_compact+0x8c/0x260
[18297.903571@3] !(3)[6396:ce.core.androi][<ffffffc001175298>] __alloc_pages_nodemask+0x610/0x978
[18297.905719@3] !(3)[6396:ce.core.androi][<ffffffc00109db20>] copy_process.part.49+0x12c/0x14a8
[18297.907795@3] !(3)[6396:ce.core.androi][<ffffffc00109f05c>] do_fork+0xec/0x3b0
[18297.909520@3] !(3)[6396:ce.core.androi][<ffffffc00109f40c>] SyS_clone+0x3c/0x4c
5.Just after this call stack an
mprotect sys call continue execute to change vma ffffffc006c05190,
range[e91d5000-e91d6000] from write only to R+X:
[18297.914242@1] (1)[6576:Binder:272_5]mprotect_fixup, prop range[e91d5000-e91d6000], for vma:ffffffc006c05190,[e91d5000-e91db000], flag:80100072->80100075, fc:1, ip:mmap_region, mm:ffffffc034a3ea00
[18297.917205@1] !(1)[6576:Binder:272_5]set pte at ffffffc034155ea8, e91d5000, 2e000001c9f8f53->0, ip:change_protection_range, mm:ffffffc034a3ea00
[18297.920369@1] !(1)[6576:Binder:272_5]vma:ffffffc006c05190 f:80100075 (null),[e91d5000-e91db000] p:1, fc:1, ip:mmap_region
[18297.923277@1] !(1)[6576:Binder:272_5]set pte at ffffffc034155ea8, e91d5000, 0->a000001c9f8fd3, ip:change_protection_range, mm:ffffffc034a3ea00
[18297.926367@1] !(1)[6576:Binder:272_5]vma:ffffffc006c05190 f:80100075 (null),[e91d5000-e91db000] p:1, fc:1, ip:mmap_region
[18297.929244@1] !(1)[6576:Binder:272_5]change_pte_range 79, vma:ffffffc006c05190, modify e91d5000, pte:ffffffc034155ea8, 2e000001c9f8f53->a000001c9f8f53
[18297.932559@1] !(1)[6576:Binder:272_5]change_pte_range,[e91d6000-e91d6000] vma ffffffc006c05190, changed page:1, ptes:1, empty:0, fc:1
[18297.935498@1] (1)[6576:Binder:272_5]SYSC_mprotect, vma:ffffffc006c05190[e91d5000-e91db000], tmp:e91d6000, start:e91d5000, prot:5, len:4096, prev:ffffffc006c05190 6. Finally application jump back to run at virtual adddress e91d70bc, but it's pte value is wrong, so page fault happen again and again.
[18298.889388@1] !(1)[6576:Binder:272_5]set pte at ffffffc034155eb8, e91d7000, 6000000f9c6bd3->6000000f9c6fd3, ip:ptep_set_access_flags, mm:ffffffc034a3ea00
[18298.892664@1] !(1)[6576:Binder:272_5]vma:ffffffc006c05bb8 f:80100075 (null),[e91d7000-e91db000] p:1, fc:2, ip:mmap_region
[18298.895511@1] !(1)[6576:Binder:272_5]Call trace:
[18298.896614@1] !(1)[6576:Binder:272_5][<ffffffc0010894a4>] dump_backtrace+0x0/0x174
[18298.898505@1] !(1)[6576:Binder:272_5][<ffffffc001089634>] show_stack+0x1c/0x28
[18298.900224@1] !(1)[6576:Binder:272_5][<ffffffc001096630>] set_pte+0xe8/0x124
[18298.901978@1] !(1)[6576:Binder:272_5][<ffffffc0011a6fe4>] ptep_set_access_flags+0x5c/0xd0
[18298.903978@1] !(1)[6576:Binder:272_5][<ffffffc001198578>] handle_mm_fault+0x428/0xc74
[18298.905890@1] !(1)[6576:Binder:272_5][<ffffffc001096b58>] do_page_fault+0x220/0x40c
[18298.907791@1] !(1)[6576:Binder:272_5][<ffffffc0010812cc>] do_mem_abort+0x4c/0xb0
[18298.909529@1] !(1)[6576:Binder:272_5]Exception stack(0xffffffc003797e30 to 0xffffffc003797f50)
[18298.911667@1] !(1)[6576:Binder:272_5]7e20: 00000000 00000000 00000000 00000000
[18298.914231@1] !(1)[6576:Binder:272_5]7e40: ffffffff ffffffff e91d70bc 00000000 34a3ea60 ffffffc0 01103fcc ffffffc0
[18298.916692@1] !(1)[6576:Binder:272_5]7e60: 03797ea0 ffffffc0 0119eb04 ffffffc0 e91d6000 00000000 00000000 00000000
[18298.919279@1] !(1)[6576:Binder:272_5]7e80: ffffffff ffffffff 0119eaf8 ffffffc0 0108e000 ffffffc0 000001f0 00000000
[18298.921796@1] !(1)[6576:Binder:272_5]7ea0: 00000000 00000000 0108426c ffffffc0 00000000 00000000 e91d5000 00000000
[18298.924328@1] !(1)[6576:Binder:272_5]7ec0: 00000000 00000000 000001f0 00000000 53266ef5 00000000 53266ef5 00000000
[18298.926805@1] !(1)[6576:Binder:272_5]7ee0: e73ff260 00000000 e7e9f525 00000000 e82968f8 00000000 e7e9f371 00000000
[18298.929396@1] !(1)[6576:Binder:272_5]7f00: 00000320 00000000 00000003 00000000 e73ff27c 00000000 00000000 00000000
[18298.931909@1] !(1)[6576:Binder:272_5]7f20: e73ff11c 00000000 e7effde0 00000000 e91d6000 00000000 e73ff108 00000000
[18298.934441@1] !(1)[6576:Binder:272_5]7f40: e91d70bd 00000000 00000000 00000000
[18298.936280@1] (1)[6576:Binder:272_5]check_pte 3832, pte:ffffffc034155eb8, 6000000f9c6fd3, 6000000f9c6fd3 vma:ffffffc006c05bb8, e91d7000 e91db000, addr:e91d7000, fc:2
[18298.939893@1] (1)[6576:Binder:272_5]check_pte 3832, ip:mmap_region, pte attr missmatch, back up pte:
[18298.942126@1] (1)[6576:Binder:272_5]ffffffc034155e98: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[18298.944794@1] (1)[6576:Binder:272_5]ffffffc034155eb8: 006000000f9c6bd3 00a0000031445fd3 00200000099edbd3 0060000023612bd3
Following picture show situation around step 4: vma@ffffffc006c05190 has range
e91d5000-e91d6000, vma@ffffffc006c05000 has range e91d6000 -
e91db000, they were neighbour, but property not same.
At this time mprtoect syscall change vma@ffffffc006c05190 from write only to
R+X, so it can be merged with vma@ffffffc006c05000, but during
merge process a migration of virtual address e91d7000 happended,
and at last step of migration process, it find vma@ffffffc006c05190 with property W(NOT
R+X) and set this property for it's PTE. That caused the problem.
I reviewed code on 3.14.29 kernel of mprotect function and migrate functin,
found that there was a race condtion during vma merge:
mm/mprotect.c
mprotect()
down_write(¤t->mm->mmap_sem);
...
mprotect_fixup()
254 *pprev = vma_merge(mm, *pprev, start, end, newflags,
255 vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
256 vma_get_anon_name(vma));
257 if (*pprev) {
258 vma = *pprev;
259 goto success;
260 }
261
262 *pprev = vma;
263
264 if (start != vma->vm_start) {
265 error = split_vma(mm, vma, start, 1);
266 if (error)
267 goto fail;
268 }
269
270 if (end != vma->vm_end) {
271 error = split_vma(mm, vma, end, 0);
272 if (error)
273 goto fail;
274 }
275
276 success:
277 /*
278 * vm_flags and vm_page_prot are protected by the mmap_sem
279 * held in write mode.
280 */
281 vma->vm_flags = newflags;
282 vma->vm_page_prot = pgprot_modify(vma->vm_page_prot,
283 vm_get_page_prot(newflags));
284
285 if (vma_wants_writenotify(vma)) {
286 vma->vm_page_prot = vm_get_page_prot(newflags & ~VM_SHARED);
287 dirty_accountable = 1;
288 }
289
290 change_protection(vma, start, end, vma->vm_page_prot,
291 dirty_accountable, 0); ...
up_write(¤t->mm->mmap_sem);
mm/migrate.c
move_to_new_page()
...
remove_migration_ptes()
rmap_walk_anon()
rmap_walk_anon_lock()
down_read(&anon_vma->root->rwsem);
remove_migration_pte()
223 get_page(new);
224 pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
225 if (pte_swp_soft_dirty(*ptep))
226 pte = pte_mksoft_dirty(pte);
227
228 /* Recheck VMA as permissions can change since migration started */
229 if (is_write_migration_entry(entry))
230 pte = maybe_mkwrite(pte, vma);
231
232 #ifdef CONFIG_HUGETLB_PAGE
233 if (PageHuge(new)) {
234 pte = pte_mkhuge(pte);
235 pte = arch_make_huge_pte(pte, vma, new, 0);
236 }
237 #endif
238 flush_dcache_page(new);
239 set_pte_at(mm, addr, ptep, pte);
240 anon_vma_unlock_read
up_read(&anon_vma->root->rwsem);
...
During
mprotect syscall, it hold rw-lock of mm->mmap_sem, then try to merge vma with
nearby same vma. But if it just merged vma at line 257 in mm/mprotect.c and
interrupted by other tasks, at this time this vma has a new vma rage but
vm_page_prot did not changed(still keep old value). If at this time other
task triggered a migration process and just migrate page of this new vma, in
finction remove_migration_pte it hold anon_vma's root rwsem lock and get this
vma's old vm_page_prot and set it to it's PTE. Then
mprotect resume to running and this vma's vm_page_prot changed at line 282 in
mm/mprotect.c. It's too late for migration process. And step change_protection
only changed ptes in range with input parameter, not full range of this VMA. So
the wrong pte value set by migration step would not be corrected. Finally caused
page fault happen again and again on ARM64 platform. Problem may be migration
and mprotect hold different rwsem, and can't prevent this race condition.
I have
checked with latest linux stable code, and found migration and mprotect
operation almost same.
Please
help to check this problem or help me to find if something is wrong with my
conclude. |
<<attachment: 60fead12.diff.zip>>
Attachment:
EVT35-813-5163-boot4-0518_.log
Description: Binary data