Re: [syzbot] KASAN: out-of-bounds Read in i801_isr

Jean Delvare <jdelvare@xxxxxxx> · Fri, 21 May 2021 15:48:20 +0200

On Tue, 11 May 2021 15:52:21 -0700, syzbot wrote:
> syzbot found the following issue on:
> 
> HEAD commit:    e1381380 Merge git://git.kernel.org/pub/scm/linux/kernel/g..
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=10be778ad00000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=5adab0bdee099d7a
> dashboard link: https://syzkaller.appspot.com/bug?extid=b4d3fd1dfd53e90afd79
> userspace arch: i386
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+b4d3fd1dfd53e90afd79@xxxxxxxxxxxxxxxxxxxxxxxxx
> 
> ==================================================================
> BUG: KASAN: out-of-bounds in i801_isr_byte_done drivers/i2c/busses/i2c-i801.c:608 [inline]
> BUG: KASAN: out-of-bounds in i801_isr drivers/i2c/busses/i2c-i801.c:668 [inline]
> BUG: KASAN: out-of-bounds in i801_isr+0xc70/0xd60 drivers/i2c/busses/i2c-i801.c:649
> Read of size 1 at addr ffffc90004237d68 by task systemd-udevd/5027
> 
> CPU: 2 PID: 5027 Comm: systemd-udevd Not tainted 5.12.0-rc4-syzkaller #0
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
> Call Trace:
>  <IRQ>
>  __dump_stack lib/dump_stack.c:79 [inline]
>  dump_stack+0x141/0x1d7 lib/dump_stack.c:120
>  print_address_description.constprop.0.cold+0x5/0x2f8 mm/kasan/report.c:232
>  __kasan_report mm/kasan/report.c:399 [inline]
>  kasan_report.cold+0x7c/0xd8 mm/kasan/report.c:416
>  i801_isr_byte_done drivers/i2c/busses/i2c-i801.c:608 [inline]
>  i801_isr drivers/i2c/busses/i2c-i801.c:668 [inline]
>  i801_isr+0xc70/0xd60 drivers/i2c/busses/i2c-i801.c:649
>  __handle_irq_event_percpu+0x303/0x8f0 kernel/irq/handle.c:156
>  handle_irq_event_percpu kernel/irq/handle.c:196 [inline]
>  handle_irq_event+0x102/0x290 kernel/irq/handle.c:213
>  handle_fasteoi_irq+0x22f/0x9f0 kernel/irq/chip.c:714
>  generic_handle_irq_desc include/linux/irqdesc.h:158 [inline]
>  handle_irq arch/x86/kernel/irq.c:231 [inline]
>  __common_interrupt+0x9e/0x200 arch/x86/kernel/irq.c:250
>  common_interrupt+0x4c/0xd0 arch/x86/kernel/irq.c:240
>  asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:623
> RIP: 0010:__do_softirq+0x196/0x9f6 kernel/softirq.c:331
> Code: 00 48 01 f0 48 89 44 24 18 48 c7 c7 80 39 6a 89 e8 7f 8d c6 ff 65 66 c7 05 35 46 c3 76 00 00 e8 00 df 39 f8 fb b8 ff ff ff ff <48> c7 c3 c0 a0 c0 8b 41 0f bc c5 41 89 c7 41 83 c7 01 75 47 e9 d2
> RSP: 0018:ffffc90000550f78 EFLAGS: 00000216
> RAX: 00000000ffffffff RBX: ffff888018491c40 RCX: 1ffffffff1b8be19
> RDX: 0000000000000000 RSI: 0000000000000100 RDI: 0000000000000000
> RBP: ffff888011159c40 R08: 0000000000000001 R09: 0000000000000001
> R10: ffffffff8179e0c8 R11: 0000000000000001 R12: 0000000000000000
> R13: 0000000000000200 R14: 0000000000000000 R15: 0000000000000000
>  invoke_softirq kernel/softirq.c:221 [inline]
>  __irq_exit_rcu kernel/softirq.c:422 [inline]
>  irq_exit_rcu+0x134/0x200 kernel/softirq.c:434
>  sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1100
>  </IRQ>
>  asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:632
> RIP: 0010:check_kcov_mode kernel/kcov.c:163 [inline]
> RIP: 0010:__sanitizer_cov_trace_pc+0x7/0x60 kernel/kcov.c:197
> Code: ff ff b9 ff ff ff ff ba 08 00 00 00 4d 8b 03 48 0f bd ca 49 8b 45 00 48 63 c9 e9 64 ff ff ff 0f 1f 40 00 65 8b 05 49 fe 8d 7e <89> c1 48 8b 34 24 81 e1 00 01 00 00 65 48 8b 14 25 00 f0 01 00 a9
> RSP: 0018:ffffc90009ff7868 EFLAGS: 00000283
> RAX: 0000000080000000 RBX: 0000000000000004 RCX: ffff888018491c40
> RDX: 0000000000000000 RSI: ffff888018491c40 RDI: 0000000000000003
> RBP: ffff8880143aa100 R08: 0000000000000000 R09: 0000000000000004
> R10: ffffffff8392c68a R11: 0000000000000010 R12: 0000000000000002
> R13: 00000000000003e4 R14: dffffc0000000000 R15: 0000000000000000
>  tomoyo_domain_quota_is_ok+0x2f1/0x550 security/tomoyo/util.c:1093
>  tomoyo_supervisor+0x2f2/0xf00 security/tomoyo/common.c:2089
>  tomoyo_audit_path_log security/tomoyo/file.c:168 [inline]
>  tomoyo_path_permission security/tomoyo/file.c:587 [inline]
>  tomoyo_path_permission+0x270/0x3a0 security/tomoyo/file.c:573
>  tomoyo_path_perm+0x2f0/0x400 security/tomoyo/file.c:838
>  security_inode_getattr+0xcf/0x140 security/security.c:1288
>  vfs_getattr fs/stat.c:131 [inline]
>  vfs_statx+0x164/0x390 fs/stat.c:199
>  vfs_fstatat fs/stat.c:217 [inline]
>  vfs_lstat include/linux/fs.h:3240 [inline]
>  __do_sys_newlstat+0x91/0x110 fs/stat.c:372
>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> RIP: 0033:0x7fc3c1a88335
> Code: 69 db 2b 00 64 c7 00 16 00 00 00 b8 ff ff ff ff c3 0f 1f 40 00 83 ff 01 48 89 f0 77 30 48 89 c7 48 89 d6 b8 06 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 03 f3 c3 90 48 8b 15 31 db 2b 00 f7 d8 64 89
> RSP: 002b:00007ffd4b6fa5c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
> RAX: ffffffffffffffda RBX: 0000558fa852a210 RCX: 00007fc3c1a88335
> RDX: 00007ffd4b6fa600 RSI: 00007ffd4b6fa600 RDI: 0000558fa8529210
> RBP: 00007ffd4b6fa6c0 R08: 00007fc3c1d472c8 R09: 0000000000001010
> R10: 0000000000000020 R11: 0000000000000246 R12: 0000558fa8529210
> R13: 0000558fa852923a R14: 0000558fa8501ed1 R15: 0000558fa8501eda
> 
> 
> Memory state around the buggy address:
>  ffffc90004237c00: f2 f2 f2 00 00 f3 f3 00 00 00 00 00 00 00 00 00
>  ffffc90004237c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >ffffc90004237d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>                                                           ^
>  ffffc90004237d80: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
>  ffffc90004237e00: 00 f2 f2 f2 00 00 00 00 00 00 00 00 00 00 00 f3
> ==================================================================

We get similar reports from Kasan since at least January 2020.
Sometimes about reads, sometimes about writes. Originally in vmalloc'd
area, then on the stack, then in vmalloc's aread again, and now just
"out-of-bound" without the memory type. What all these reports have in
common is that the problem happens in i801_isr_byte_done(). The exact
line differs though. Today's error is on this line:

	} else if (priv->count < priv->len - 1) {
		/* Write next byte, except for IRQ after last byte */
		outb_p(priv->data[++priv->count], SMBBLKDAT(priv));		<<--- HERE
	}

while the previous report was on that line:

		/* Read next byte */
		if (priv->count < priv->len)
			priv->data[priv->count++] = inb(SMBBLKDAT(priv));	<<--- HERE
		else
			dev_dbg(&priv->pci_dev->dev,
				"Discarding extra byte on block read\n");

All reports before that were on either location.

All reports also have in common that a previous transaction before the
bug triggers has failed with a timeout:

[  376.721552][T12286] i801_smbus 0000:00:1f.3: Timeout waiting for interrupt!
[  376.743857][T12286] i801_smbus 0000:00:1f.3: Transaction timeout
[  376.766812][T12286] i801_smbus 0000:00:1f.3: Failed terminating the transaction

I think the restoration of the SMBus controller to a working state
after a transaction failed like that never actually worked, typically
the controller would stop responding until next reboot. Most probably
because the problem that caused the timeout (like a broken I2C device
holding either SDA or SCL permanently up or down) is still present. But
it should definitely not cause a memory overrun.

Looking at the ICH9 datasheet, I see the following description for the
KILL bit (which is what we try to use to reset the SMBus controller):

  "Kills the current host transaction taking place, sets the FAILED
   status bit, and asserts the interrupt (or SMI#)."

At the time the recovery code was written, i2c-i801 was a polling-only
driver, interrupts were not supported, so asserting the interrupt had
no effect. Now that the driver does support interrupts, this would call
i801_isr(), right?

So my theory is that our attempt to kill a timed-out byte-by-byte block
transaction triggers an interrupt, which calls in i801_isr() with the
SMBHSTSTS_BYTE_DONE bit set. This in turn causes i801_isr_byte_done()
to be called while we are absolutely not ready nor even supposed to
process the next data byte.

I guess we should clear SMBHSTSTS_BYTE_DONE before issuing a
SMBHSTCNT_KILL. Alternatively we could add a check at the beginning of
i801_isr() to bail out immediately if SMBHSTCNT_KILL is set. While
possibly more robust, this approach has the drawback of increasing the
processing time of all interrupts, even standard/legitimate ones. So
maybe just clearing SMBHSTSTS_BYTE_DONE is more reasonable. Something
like:

--- linux-5.11.orig/drivers/i2c/busses/i2c-i801.c
+++ linux-5.11/drivers/i2c/busses/i2c-i801.c
@@ -393,6 +393,8 @@ static int i801_check_post(struct i801_p
 		dev_err(&priv->pci_dev->dev, "Transaction timeout\n");
 		/* try to stop the current command */
 		dev_dbg(&priv->pci_dev->dev, "Terminating the current operation\n");
+		/* Clear BYTE_DONE so as to not confuse i801_isr() */
+		outb_p(SMBHSTSTS_BYTE_DONE, SMBHSTSTS(priv));
 		outb_p(inb_p(SMBHSTCNT(priv)) | SMBHSTCNT_KILL,
 		       SMBHSTCNT(priv));
 		usleep_range(1000, 2000);

I must say I wonder why SMBHSTCNT_KILL generates an interrupt in the
first place, I can't see who would need this.

Jarkko, Andy, Mika, what do you think? I would appreciate a second pair
of eyes looking at the issue, in case I'm on the wrong track.

Of course, another question is why we hit the timeout in the first
place. But we would need a reproducer to investigate this. Is there any
place where I could see the actual I2C commands being sent by the bot?
The log only traces the bot-level functions, and I could guess how some
of the parameters are being mapped to struct members that are then fed
to i2c-dev, but some of the parameters are still a mystery to me.

To be honest, I find the bot logs pretty hard to read, as apparently a
lot of tests are running in parallel for various subsystems, and
everything ends mixed up in the log. Wouldn't it be more valuable to
serialize the tests on one given virtual host, so that the developers
can then read the log and more easily see what's going on?

Thanks,
-- 
Jean Delvare
SUSE L3 Support