This document describes the concept of Dept. Signed-off-by: Byungchul Park <byungchul@xxxxxx> --- Documentation/dependency/dept.txt | 735 ++++++++++++++++++++++++++++++ 1 file changed, 735 insertions(+) create mode 100644 Documentation/dependency/dept.txt diff --git a/Documentation/dependency/dept.txt b/Documentation/dependency/dept.txt new file mode 100644 index 000000000000..5dd358b96734 --- /dev/null +++ b/Documentation/dependency/dept.txt @@ -0,0 +1,735 @@ +DEPT(DEPendency Tracker) +======================== + +Started by Byungchul Park <max.byungchul.park@xxxxxx> + +How lockdep works +----------------- + +Lockdep detects a deadlock by checking lock acquisition order. For +example, a graph to track acquisition order built by lockdep might look +like: + + A -> B - + \ + -> E + / + C -> D - + + where 'A -> B' means that acquisition A is prior to acquisition B + with A still held. + +Lockdep keeps adding each new acquisition order into the graph in +runtime. For example, 'E -> C' will be added when the two locks have +been acquired in the order, E and then C. The graph will look like: + + A -> B - + \ + -> E - + / \ + -> C -> D - \ + / / + \ / + ------------------ + + where 'A -> B' means that acquisition A is prior to acquisition B + with A still held. + +This graph contains a subgraph that demonstrates a loop like: + + -> E - + / \ + -> C -> D - \ + / / + \ / + ------------------ + + where 'A -> B' means that acquisition A is prior to acquisition B + with A still held. + +Lockdep reports it as a deadlock on detection of a loop and stops its +working. + +CONCLUSION + +Lockdep detects a deadlock by checking if a loop has been created after +adding a new acquisition order into the graph. + + +Limitation of lockdep +--------------------- + +Lockdep deals with a deadlock by typical lock e.g. spinlock and mutex, +that are supposed to be released within the acquisition context. However, +when it comes to a deadlock by folio lock that is not supposed to be +released within the acquisition context or other general synchronization +mechanisms, lockdep doesn't work. + +Can lockdep detect the following deadlock? + + context X context Y context Z + + mutex_lock A + folio_lock B + folio_lock B <- DEADLOCK + mutex_lock A <- DEADLOCK + folio_unlock B + folio_unlock B + mutex_unlock A + mutex_unlock A + +No. What about the following? + + context X context Y + + mutex_lock A + mutex_lock A <- DEADLOCK + wait_for_complete B <- DEADLOCK + complete B + mutex_unlock A + mutex_unlock A + +No. + +CONCLUSION + +Lockdep cannot detect a deadlock by folio lock or other general +synchronization mechanisms. + + +What leads a deadlock +--------------------- + +A deadlock occurs when one or multi contexts are waiting for events that +will never happen. For example: + + context X context Y context Z + + | | | + v | | + 1 wait for A v | + . 2 wait for C v + event C . 3 wait for B + event B . + event A + +Event C cannot be triggered because context X is stuck at 1, event B +cannot be triggered because context Y is stuck at 2, and event A cannot +be triggered because context Z is stuck at 3. All the contexts are stuck. +We call this *deadlock*. + +If an event occurrence is a prerequisite to reaching another event, we +call it *dependency*. In this example: + + Event A occurrence is a prerequisite to reaching event C. + Event C occurrence is a prerequisite to reaching event B. + Event B occurrence is a prerequisite to reaching event A. + +In terms of dependency: + + Event C depends on event A. + Event B depends on event C. + Event A depends on event B. + +Dependency graph reflecting this example will look like: + + -> C -> A -> B - + / \ + \ / + ---------------- + + where 'A -> B' means that event A depends on event B. + +A circular dependency exists. Such a circular dependency leads a +deadlock since no waiters can have desired events triggered. + +CONCLUSION + +A circular dependency of events leads a deadlock. + + +Introduce DEPT +-------------- + +DEPT(DEPendency Tracker) tracks wait and event instead of lock +acquisition order so as to recognize the following situation: + + context X context Y context Z + + | | | + v | | + wait for A v | + . wait for C v + event C . wait for B + event B . + event A + +and builds up a dependency graph in runtime that is similar to lockdep. +The graph might look like: + + -> C -> A -> B - + / \ + \ / + ---------------- + + where 'A -> B' means that event A depends on event B. + +DEPT keeps adding each new dependency into the graph in runtime. For +example, 'B -> D' will be added when event D occurrence is a +prerequisite to reaching event B like: + + | + v + wait for D + . + event B + +After the addition, the graph will look like: + + -> D + / + -> C -> A -> B - + / \ + \ / + ---------------- + + where 'A -> B' means that event A depends on event B. + +DEPT is going to report a deadlock on detection of a new loop. + +CONCLUSION + +DEPT works on wait and event so as to theoretically detect all the +potential deadlocks. + + +How DEPT works +-------------- + +Let's take a look how DEPT works with the 1st example in the section +'Limitation of lockdep'. + + context X context Y context Z + + mutex_lock A + folio_lock B + folio_lock B <- DEADLOCK + mutex_lock A <- DEADLOCK + folio_unlock B + folio_unlock B + mutex_unlock A + mutex_unlock A + +Adding comments to describe DEPT's view in terms of wait and event: + + context X context Y context Z + + mutex_lock A + /* wait for A */ + folio_lock B + /* wait for A */ + /* start event A context */ + + folio_lock B + /* wait for B */ <- DEADLOCK + /* start event B context */ + + mutex_lock A + /* wait for A */ <- DEADLOCK + /* start event A context */ + + folio_unlock B + /* event B */ + folio_unlock B + /* event B */ + + mutex_unlock A + /* event A */ + mutex_unlock A + /* event A */ + +Adding more supplementary comments to describe DEPT's view in detail: + + context X context Y context Z + + mutex_lock A + /* might wait for A */ + /* start to take into account event A's context */ + /* 1 */ + folio_lock B + /* might wait for B */ + /* start to take into account event B's context */ + /* 2 */ + + folio_lock B + /* might wait for B */ <- DEADLOCK + /* start to take into account event B's context */ + /* 3 */ + + mutex_lock A + /* might wait for A */ <- DEADLOCK + /* start to take into account + event A's context */ + /* 4 */ + + folio_unlock B + /* event B that's been valid since 2 */ + folio_unlock B + /* event B that's been valid since 3 */ + + mutex_unlock A + /* event A that's been valid since 1 */ + + mutex_unlock A + /* event A that's been valid since 4 */ + +Let's build up dependency graph with this example. Firstly, context X: + + context X + + folio_lock B + /* might wait for B */ + /* start to take into account event B's context */ + /* 2 */ + +There are no events to create dependency. Next, context Y: + + context Y + + mutex_lock A + /* might wait for A */ + /* start to take into account event A's context */ + /* 1 */ + + folio_lock B + /* might wait for B */ + /* start to take into account event B's context */ + /* 3 */ + + folio_unlock B + /* event B that's been valid since 3 */ + + mutex_unlock A + /* event A that's been valid since 1 */ + +There are two events. For event B, folio_unlock B, since there are no +waits between 3 and the event, event B does not create dependency. For +event A, there is a wait, folio_lock B, between 1 and the event. Which +means event A cannot be triggered if event B does not wake up the wait. +Therefore, we can say event A depends on event B, say, 'A -> B'. The +graph will look like after adding the dependency: + + A -> B + + where 'A -> B' means that event A depends on event B. + +Lastly, context Z: + + context Z + + mutex_lock A + /* might wait for A */ + /* start to take into account event A's context */ + /* 4 */ + + folio_unlock B + /* event B that's been valid since 2 */ + + mutex_unlock A + /* event A that's been valid since 4 */ + +There are also two events. For event B, folio_unlock B, there is a +wait, mutex_lock A, between 2 and the event - remind 2 is at a very +start and before the wait in timeline. Which means event B cannot be +triggered if event A does not wake up the wait. Therefore, we can say +event B depends on event A, say, 'B -> A'. The graph will look like +after adding the dependency: + + -> A -> B - + / \ + \ / + ----------- + + where 'A -> B' means that event A depends on event B. + +A new loop has been created. So DEPT can report it as a deadlock. For +event A, mutex_unlock A, since there are no waits between 4 and the +event, event A does not create dependency. That's it. + +CONCLUSION + +DEPT works well with any general synchronization mechanisms by focusing +on wait, event and its context. + + +Interpret DEPT report +--------------------- + +The following is the example in the section 'How DEPT works'. + + context X context Y context Z + + mutex_lock A + /* might wait for A */ + /* start to take into account event A's context */ + /* 1 */ + folio_lock B + /* might wait for B */ + /* start to take into account event B's context */ + /* 2 */ + + folio_lock B + /* might wait for B */ <- DEADLOCK + /* start to take into account event B's context */ + /* 3 */ + + mutex_lock A + /* might wait for A */ <- DEADLOCK + /* start to take into account + event A's context */ + /* 4 */ + + folio_unlock B + /* event B that's been valid since 2 */ + folio_unlock B + /* event B that's been valid since 3 */ + + mutex_unlock A + /* event A that's been valid since 1 */ + + mutex_unlock A + /* event A that's been valid since 4 */ + +We can Simplify this by replacing each waiting point with [W], each +point where its event's context starts with [S] and each event with [E]. +This example will look like after the replacement: + + context X context Y context Z + + [W][S] mutex_lock A + [W][S] folio_lock B + [W][S] folio_lock B <- DEADLOCK + + [W][S] mutex_lock A <- DEADLOCK + [E] folio_unlock B + [E] folio_unlock B + [E] mutex_unlock A + [E] mutex_unlock A + +DEPT uses the symbols [W], [S] and [E] in its report as described above. +The following is an example reported by DEPT for a real problem. + + Link: https://lore.kernel.org/lkml/6383cde5-cf4b-facf-6e07-1378a485657d@xxxxxxxxxxxxxxxxxxx/#t + Link: https://lore.kernel.org/lkml/1674268856-31807-1-git-send-email-byungchul.park@xxxxxxx/ + + =================================================== + DEPT: Circular dependency has been detected. + 6.2.0-rc1-00025-gb0c20ebf51ac-dirty #28 Not tainted + --------------------------------------------------- + summary + --------------------------------------------------- + *** DEADLOCK *** + + context A + [S] lock(&ni->ni_lock:0) + [W] folio_wait_bit_common(PG_locked_map:0) + [E] unlock(&ni->ni_lock:0) + + context B + [S] (unknown)(PG_locked_map:0) + [W] lock(&ni->ni_lock:0) + [E] folio_unlock(PG_locked_map:0) + + [S]: start of the event context + [W]: the wait blocked + [E]: the event not reachable + --------------------------------------------------- + context A's detail + --------------------------------------------------- + context A + [S] lock(&ni->ni_lock:0) + [W] folio_wait_bit_common(PG_locked_map:0) + [E] unlock(&ni->ni_lock:0) + + [S] lock(&ni->ni_lock:0): + [<ffffffff82b396fb>] ntfs3_setattr+0x54b/0xd40 + stacktrace: + ntfs3_setattr+0x54b/0xd40 + notify_change+0xcb3/0x1430 + do_truncate+0x149/0x210 + path_openat+0x21a3/0x2a90 + do_filp_open+0x1ba/0x410 + do_sys_openat2+0x16d/0x4e0 + __x64_sys_creat+0xcd/0x120 + do_syscall_64+0x41/0xc0 + entry_SYSCALL_64_after_hwframe+0x63/0xcd + + [W] folio_wait_bit_common(PG_locked_map:0): + [<ffffffff81b228b0>] truncate_inode_pages_range+0x9b0/0xf20 + stacktrace: + folio_wait_bit_common+0x5e0/0xaf0 + truncate_inode_pages_range+0x9b0/0xf20 + truncate_pagecache+0x67/0x90 + ntfs3_setattr+0x55a/0xd40 + notify_change+0xcb3/0x1430 + do_truncate+0x149/0x210 + path_openat+0x21a3/0x2a90 + do_filp_open+0x1ba/0x410 + do_sys_openat2+0x16d/0x4e0 + __x64_sys_creat+0xcd/0x120 + do_syscall_64+0x41/0xc0 + entry_SYSCALL_64_after_hwframe+0x63/0xcd + + [E] unlock(&ni->ni_lock:0): + (N/A) + --------------------------------------------------- + context B's detail + --------------------------------------------------- + context B + [S] (unknown)(PG_locked_map:0) + [W] lock(&ni->ni_lock:0) + [E] folio_unlock(PG_locked_map:0) + + [S] (unknown)(PG_locked_map:0): + (N/A) + + [W] lock(&ni->ni_lock:0): + [<ffffffff82b009ec>] attr_data_get_block+0x32c/0x19f0 + stacktrace: + attr_data_get_block+0x32c/0x19f0 + ntfs_get_block_vbo+0x264/0x1330 + __block_write_begin_int+0x3bd/0x14b0 + block_write_begin+0xb9/0x4d0 + ntfs_write_begin+0x27e/0x480 + generic_perform_write+0x256/0x570 + __generic_file_write_iter+0x2ae/0x500 + ntfs_file_write_iter+0x66d/0x1d70 + do_iter_readv_writev+0x20b/0x3c0 + do_iter_write+0x188/0x710 + vfs_iter_write+0x74/0xa0 + iter_file_splice_write+0x745/0xc90 + direct_splice_actor+0x114/0x180 + splice_direct_to_actor+0x33b/0x8b0 + do_splice_direct+0x1b7/0x280 + do_sendfile+0xb49/0x1310 + + [E] folio_unlock(PG_locked_map:0): + [<ffffffff81f10222>] generic_write_end+0xf2/0x440 + stacktrace: + generic_write_end+0xf2/0x440 + ntfs_write_end+0x42e/0x980 + generic_perform_write+0x316/0x570 + __generic_file_write_iter+0x2ae/0x500 + ntfs_file_write_iter+0x66d/0x1d70 + do_iter_readv_writev+0x20b/0x3c0 + do_iter_write+0x188/0x710 + vfs_iter_write+0x74/0xa0 + iter_file_splice_write+0x745/0xc90 + direct_splice_actor+0x114/0x180 + splice_direct_to_actor+0x33b/0x8b0 + do_splice_direct+0x1b7/0x280 + do_sendfile+0xb49/0x1310 + __x64_sys_sendfile64+0x1d0/0x210 + do_syscall_64+0x41/0xc0 + entry_SYSCALL_64_after_hwframe+0x63/0xcd + --------------------------------------------------- + information that might be helpful + --------------------------------------------------- + CPU: 1 PID: 8060 Comm: a.out Not tainted + 6.2.0-rc1-00025-gb0c20ebf51ac-dirty #28 + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), + BIOS Bochs 01/01/2011 + Call Trace: + <TASK> + dump_stack_lvl+0xf2/0x169 + print_circle.cold+0xca4/0xd28 + ? lookup_dep+0x240/0x240 + ? extend_queue+0x223/0x300 + cb_check_dl+0x1e7/0x260 + bfs+0x27b/0x610 + ? print_circle+0x240/0x240 + ? llist_add_batch+0x180/0x180 + ? extend_queue_rev+0x300/0x300 + ? __add_dep+0x60f/0x810 + add_dep+0x221/0x5b0 + ? __add_idep+0x310/0x310 + ? add_iecxt+0x1bc/0xa60 + ? add_iecxt+0x1bc/0xa60 + ? add_iecxt+0x1bc/0xa60 + ? add_iecxt+0x1bc/0xa60 + __dept_wait+0x600/0x1490 + ? add_iecxt+0x1bc/0xa60 + ? truncate_inode_pages_range+0x9b0/0xf20 + ? check_new_class+0x790/0x790 + ? dept_enirq_transition+0x519/0x9c0 + dept_wait+0x159/0x3b0 + ? truncate_inode_pages_range+0x9b0/0xf20 + folio_wait_bit_common+0x5e0/0xaf0 + ? filemap_get_folios_contig+0xa30/0xa30 + ? dept_enirq_transition+0x519/0x9c0 + ? lock_is_held_type+0x10e/0x160 + ? lock_is_held_type+0x11e/0x160 + truncate_inode_pages_range+0x9b0/0xf20 + ? truncate_inode_partial_folio+0xba0/0xba0 + ? setattr_prepare+0x142/0xc40 + truncate_pagecache+0x67/0x90 + ntfs3_setattr+0x55a/0xd40 + ? ktime_get_coarse_real_ts64+0x1e5/0x2f0 + ? ntfs_extend+0x5c0/0x5c0 + ? mode_strip_sgid+0x210/0x210 + ? ntfs_extend+0x5c0/0x5c0 + notify_change+0xcb3/0x1430 + ? do_truncate+0x149/0x210 + do_truncate+0x149/0x210 + ? file_open_root+0x430/0x430 + ? process_measurement+0x18c0/0x18c0 + ? ntfs_file_release+0x230/0x230 + path_openat+0x21a3/0x2a90 + ? path_lookupat+0x840/0x840 + ? dept_enirq_transition+0x519/0x9c0 + ? lock_is_held_type+0x10e/0x160 + do_filp_open+0x1ba/0x410 + ? may_open_dev+0xf0/0xf0 + ? find_held_lock+0x2d/0x110 + ? lock_release+0x43c/0x830 + ? dept_ecxt_exit+0x31a/0x590 + ? _raw_spin_unlock+0x3b/0x50 + ? alloc_fd+0x2de/0x6e0 + do_sys_openat2+0x16d/0x4e0 + ? __ia32_sys_get_robust_list+0x3b0/0x3b0 + ? build_open_flags+0x6f0/0x6f0 + ? dept_enirq_transition+0x519/0x9c0 + ? dept_enirq_transition+0x519/0x9c0 + ? lock_is_held_type+0x4e/0x160 + ? lock_is_held_type+0x4e/0x160 + __x64_sys_creat+0xcd/0x120 + ? __x64_compat_sys_openat+0x1f0/0x1f0 + do_syscall_64+0x41/0xc0 + entry_SYSCALL_64_after_hwframe+0x63/0xcd + RIP: 0033:0x7f8b9e4e4469 + Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 + 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> + 3d 01 f0 ff ff 73 01 c3 48 8b 0d ff 49 2b 00 f7 d8 64 89 01 48 + RSP: 002b:00007f8b9eea4ef8 EFLAGS: 00000202 ORIG_RAX: 0000000000000055 + RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8b9e4e4469 + RDX: 0000000000737562 RSI: 0000000000000000 RDI: 0000000020000000 + RBP: 00007f8b9eea4f20 R08: 0000000000000000 R09: 0000000000000000 + R10: 0000000000000000 R11: 0000000000000202 R12: 00007fffa75511ee + R13: 00007fffa75511ef R14: 00007f8b9ee85000 R15: 0000000000000003 + </TASK> + +Let's take a look at the summary that is the most important part. + + --------------------------------------------------- + summary + --------------------------------------------------- + *** DEADLOCK *** + + context A + [S] lock(&ni->ni_lock:0) + [W] folio_wait_bit_common(PG_locked_map:0) + [E] unlock(&ni->ni_lock:0) + + context B + [S] (unknown)(PG_locked_map:0) + [W] lock(&ni->ni_lock:0) + [E] folio_unlock(PG_locked_map:0) + + [S]: start of the event context + [W]: the wait blocked + [E]: the event not reachable + +The summary shows the following scenario: + + context A context B context ?(unknown) + + [S] folio_lock(&f1) + [S] lock(&ni->ni_lock:0) + [W] folio_wait_bit_common(PG_locked_map:0) + + [W] lock(&ni->ni_lock:0) + [E] folio_unlock(&f1) + + [E] unlock(&ni->ni_lock:0) + +Adding supplementary comments to describe DEPT's view in detail: + + context A context B context ?(unknown) + + [S] folio_lock(&f1) + /* start to take into account context + B heading for folio_unlock(&f1) */ + /* 1 */ + [S] lock(&ni->ni_lock:0) + /* start to take into account this context heading for + unlock(&ni->ni_lock:0) */ + /* 2 */ + + [W] folio_wait_bit_common(PG_locked_map:0) (= folio_lock(&f1)) + /* might wait for folio_unlock(&f1) */ + + [W] lock(&ni->ni_lock:0) + /* might wait for unlock(&ni->ni_lock:0) */ + + [E] folio_unlock(&f1) + /* event that's been valid since 1 */ + + [E] unlock(&ni->ni_lock:0) + /* event that's been valid since 2 */ + +Let's build up dependency graph with this report. Firstly, context A: + + context A + + [S] lock(&ni->ni_lock:0) + /* start to take into account this context heading for + unlock(&ni->ni_lock:0) */ + /* 2 */ + + [W] folio_wait_bit_common(PG_locked_map:0) (= folio_lock(&f1)) + /* might wait for folio_unlock(&f1) */ + + [E] unlock(&ni->ni_lock:0) + /* event that's been valid since 2 */ + +There is one interesting event, unlock(&ni->ni_lock:0). There is a +wait, folio_lock(&f1), between 2 and the event. Which means +unlock(&ni->ni_lock:0) is not reachable if folio_unlock(&f1) does not +wake up the wait. Therefore, we can say unlock(&ni->ni_lock:0) depends +on folio_unlock(&f1), say, 'unlock(&ni->ni_lock:0) -> folio_unlock(&f1)'. +The graph will look like after adding the dependency: + + unlock(&ni->ni_lock:0) -> folio_unlock(&f1) + + where 'A -> B' means that event A depends on event B. + +Secondly, context B: + + context B + + [W] lock(&ni->ni_lock:0) + /* might wait for unlock(&ni->ni_lock:0) */ + + [E] folio_unlock(&f1) + /* event that's been valid since 1 */ + +There is also one interesting event, folio_unlock(&f1). There is a +wait, lock(&ni->ni_lock:0), between 1 and the event - remind 1 is at a +very start and before the wait in timeline. Which means folio_unlock(&f1) +is not reachable if unlock(&ni->ni_lock:0) does not wake up the wait. +Therefore, we can say folio_unlock(&f1) depends on unlock(&ni->ni_lock:0), +say, 'folio_unlock(&f1) -> unlock(&ni->ni_lock:0)'. The graph will look +like after adding the dependency: + + -> unlock(&ni->ni_lock:0) -> folio_unlock(&f1) - + / \ + \ / + ------------------------------------------------ + + where 'A -> B' means that event A depends on event B. + +A new loop has been created. So DEPT can report it as a deadlock! Cool! + +CONCLUSION + +DEPT works awesome! -- 2.17.1