Hi,
using drivers/mmc/host/tmio_mmc_core.c (on 4.14.x Renesas BSP) what is
quite similar to [1] we are trying to analyze a KASAN out of bounds
report [2] in
...
mmc_request_done+0xcc/0x30c
tmio_mmc_reset_work+0x1fc/0x248
...
Trying to resolve mmc_request_done+0xcc we think this is the access to
mrq->cap_cmd_during_tfr
in mmc_command_done() called inline from mmc_request_done() with mrq
becoming invalid. See [3].
Guessing about options why a wrong mrq might be used here, we have some
questions regarding tmio_mmc_reset_work() [1] what is a delayed work
handler (running asynchronously). Comparing that with similar delayed
work handlers like bcm2835_timeout()/bcm2835_finish_request() [4] the
(simplified) functions are
static void tmio_mmc_reset_work(struct work_struct *work)
{
struct mmc_request *mrq;
spin_lock_irqsave(&host->lock, flags);
mrq = host->mrq;
spin_unlock_irqrestore(&host->lock, flags);
/* Ready for new calls */
host->mrq = NULL;
mmc_request_done(host->mmc, mrq);
}
static void bcm2835_timeout(struct work_struct *work)
{
struct mmc_request *mrq;
mutex_lock(&host->mutex);
if (host->mrq) {
cancel_delayed_work(&host->timeout_work);
mrq = host->mrq;
host->mrq = NULL;
mmc_request_done(mmc_from_priv(host), mrq);
}
mutex_unlock(&host->mutex);
}
Comparing this, the questions are:
* tmio uses irqsave spin locks, while the bcm2835 uses mutex for
locking. Why is irqsave spin lock used here?
* In tmio the mmc_request_done() is *outside* the lock, while for
bcm2835 its *inside* the lock protected section. Why does tmio doesn't
lock the access to mmc_request_done()?
* How does tmio_mmc_reset_work() ensures that the content mrq points to
and passed to mmc_request_done() is still the correct content and hasn't
changed asynchronously in the background (due to being outside the lock)?
* Why does tmio doesn't use cancel_delayed_work()? Maybe because its
ensured by the interrupt locking no further work is scheduled?
In sum looking at the KASAN report and the locking in
tmio_mmc_reset_work() we have an uncomfortable feeling.
bcm2835_timeout() looks much safer regarding the locking. Of course that
is just a feeling and can't be proven. Therefore we want to ask here :)
Best regards
Dirk
[1]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/mmc/host/tmio_mmc_core.c#n229
[2]
BUG: KASAN: stack-out-of-bounds in mmc_request_done+0xcc/0x30c
Read of size 1 at addr ffff8005df517738 by task kworker/4:1/2308
CPU: 4 PID: 2308 Comm: kworker/4:1 Tainted: G C 4.14.327-ltsi
Hardware name: custom board based on r8a7796 (DT)
Workqueue: events tmio_mmc_reset_work
Call trace:
dump_backtrace+0x0/0x1fc
show_stack+0x14/0x1c
dump_stack+0xcc/0x114
print_address_description+0x54/0x238
kasan_report+0x274/0x29c
__asan_load1+0x24/0x50
mmc_request_done+0xcc/0x30c
tmio_mmc_reset_work+0x1fc/0x248
process_one_work+0x324/0x6c0
worker_thread+0x358/0x4d4
kthread+0x1a4/0x1bc
ret_from_fork+0x10/0x18
The buggy address belongs to the page:
page:ffff7e00177d45c0 pfn:61f517 refcount:0 mapcount:0 mapping:
(null) index:0x0
flags: 0x4000000000000000()
raw: 4000000000000000 0000000000000000 0000000000000000 00000000ffffffff
raw: 0000000000000000 ffff7e00177d45e0 0000000000000000 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff8005df517600: 00 00 00 00 f1 f1 f1 f1 00 00 00 00 00 00 00 00
ffff8005df517680: f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00
>ffff8005df517700: 00 00 f1 f1 f1 f1 04 f2 f2 f2 00 00 00 00 00 00
^
ffff8005df517780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
ffff8005df517800: 00 00 f1 f1 f1 f1 04 f2 f2 f2 f2 f2 f2 f2 00 00
==================================================================
Disabling lock debugging due to kernel taint
------------[ cut here ]------------
[3]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/mmc/core/core.c#n113
[4]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/mmc/host/bcm2835.c#n821