Re: [4.9-rc1+] overlayfs lockdep

Miklos Szeredi <miklos@xxxxxxxxxx> · Mon, 24 Oct 2016 14:57:17 +0200

On Fri, Oct 21, 2016 at 5:38 PM, CAI Qian <caiqian@xxxxxxxxxx> wrote:
>
> ----- Original Message -----
>> From: "Jan Kara" <jack@xxxxxxx>
>> Sent: Friday, October 7, 2016 3:08:38 AM
>> Subject: Re: local DoS - systemd hang or timeout (WAS: Re: [RFC][CFT] splice_read reworked)
>>
>>
>> So I believe this may be just a problem in overlayfs lockdep annotation
>> (see below). Added Miklos to CC.
>>
>> > Wait. There is also a lockep happened before the xfs internal error as
>> > well.
>> >
>> > [ 5839.452325] ======================================================
>> > [ 5839.459221] [ INFO: possible circular locking dependency detected ]
>> > [ 5839.466215] 4.8.0-rc8-splice-fixw-proc+ #4 Not tainted
>> > [ 5839.471945] -------------------------------------------------------
>> > [ 5839.478937] trinity-c220/69531 is trying to acquire lock:
>> > [ 5839.484961]  (&p->lock){+.+.+.}, at: [<ffffffff812ac69c>]
>> > seq_read+0x4c/0x3e0
>> > [ 5839.492967]
>> > but task is already holding lock:
>> > [ 5839.499476]  (sb_writers#8){.+.+.+}, at: [<ffffffff81284be1>]
>> > __sb_start_write+0xd1/0xf0
>> > [ 5839.508560]
>> > which lock already depends on the new lock.
>> >
>> > [ 5839.517686]
>> > the existing dependency chain (in reverse order) is:
>> > [ 5839.526036]
>> > -> #3 (sb_writers#8){.+.+.+}:
>> > [ 5839.530751]        [<ffffffff810ff174>] lock_acquire+0xd4/0x240
>> > [ 5839.537368]        [<ffffffff810f8f4a>] percpu_down_read+0x4a/0x90
>> > [ 5839.544275]        [<ffffffff81284be1>] __sb_start_write+0xd1/0xf0
>> > [ 5839.551181]        [<ffffffff812a8544>] mnt_want_write+0x24/0x50
>> > [ 5839.557892]        [<ffffffffa04a398f>] ovl_want_write+0x1f/0x30
>> > [overlay]
>> > [ 5839.565577]        [<ffffffffa04a6036>] ovl_do_remove+0x46/0x480
>> > [overlay]
>> > [ 5839.573259]        [<ffffffffa04a64a3>] ovl_unlink+0x13/0x20 [overlay]
>> > [ 5839.580555]        [<ffffffff812918ea>] vfs_unlink+0xda/0x190
>> > [ 5839.586979]        [<ffffffff81293698>] do_unlinkat+0x268/0x2b0
>> > [ 5839.593599]        [<ffffffff8129419b>] SyS_unlinkat+0x1b/0x30
>> > [ 5839.600120]        [<ffffffff81003c9c>] do_syscall_64+0x6c/0x1e0
>> > [ 5839.606836]        [<ffffffff817d4a3f>] return_from_SYSCALL_64+0x0/0x7a
>> > [ 5839.614231]
>>
>> So here is IMO the real culprit: do_unlinkat() grabs fs freeze protection
>> through mnt_want_write(), we grab also i_rwsem in do_unlinkat() in
>> I_MUTEX_PARENT class a bit after that and further down in vfs_unlink() we
>> grab i_rwsem for the unlinked inode itself in default I_MUTEX class. Then
>> in ovl_want_write() we grab freeze protection again, but this time for the
>> upper filesystem. That establishes sb_writers (overlay) -> I_MUTEX_PARENT
>> (overlay) -> I_MUTEX (overlay) -> sb_writers (FS-A) lock ordering
>> (we maintain locking classes per fs type so that's why I'm showing fs type
>> in parenthesis).
>>
>> Now this nesting is nasty because once you add locks that are not tracked
>> per fs type into the mix, you get cycles. In this case we've got
>> seq_file->lock and cred_guard_mutex into the mix - the splice path is
>> doing sb_writers (FS-A) -> seq_file->lock -> cred_guard_mutex (splicing
>> from seq_file into the real filesystem). Exec path further establishes
>> cred_guard_mutex -> I_MUTEX (overlay) which closes the full cycle:
>>
>> sb_writers (FS-A) -> seq_file->lock -> cred_guard_mutex -> i_mutex
>> (overlay) -> sb_writers (FS-A)
>>
>> If I analyzed the lockdep trace, this looks like a real (although remote)
>> deadlock possibility. Miklos?

Yeah, you can leave out seq_file->lock, the chain can be made up from
just 3 parts:

unlink : i_mutex(ov) -> sb_writers(fs-a)
splice: sb_writers(fs-a) ->cred_guard_mutex (though proc_tgid_io_accounting)
exec:  cred_guard_mutex -> i_mutex(ov)

None of those are incorrect, but the cred_guard_mutex usage is also
pretty weird: taken outside path lookup as well as inside ->read() in
proc.

Doesn't look a serious worry in practice, I don't think anybody would
trigger the actual deadlock in a 1000years (an fs freeze is needed at
just the right moment in addition to the above, very unlikely chain).

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html