Re: [PATCH] serial: flush ldisc after hangup

Josef Bacik <jbacik@xxxxxx> · Tue, 1 Mar 2016 13:21:26 -0500

On 03/01/2016 01:17 PM, Peter Hurley wrote:
Hi Josef,

On 03/01/2016 10:02 AM, Josef Bacik wrote:
We hit a panic pretty consistently in production that looked like this

PID: 461061  TASK: ffff880203f8bc00  CPU: 2   COMMAND: "kworker/u8:2"
  #0 [ffff88015834b940] machine_kexec at ffffffff8103c1c5
  #1 [ffff88015834b990] crash_kexec at ffffffff810cd448
  #2 [ffff88015834ba60] oops_end at ffffffff81006478
  #3 [ffff88015834ba90] no_context at ffffffff818c5262
  #4 [ffff88015834baf0] __bad_area_nosemaphore at ffffffff818c545a
  #5 [ffff88015834bb40] bad_area_nosemaphore at ffffffff818c548c
  #6 [ffff88015834bb50] __do_page_fault at ffffffff81045ad5
  #7 [ffff88015834bbc0] do_page_fault at ffffffff81045efc
  #8 [ffff88015834bbd0] page_fault at ffffffff818d6b82
     [exception RIP: __uart_start+0x1a]
     RIP: ffffffff8152f30a  RSP: ffff88015834bc80  RFLAGS: 00010046
     RAX: 0000000000000000  RBX: ffffffff822e9920  RCX: 0000000000000036
     RDX: 0000000000003636  RSI: 00000000000000fe  RDI: ffffffff822e9920
     RBP: ffff88015834bca8   R8: 0000000000000000   R9: 00000000ffffffff
     R10: ffff8802546f0d20  R11: 0000000000000000  R12: ffff880254712400
     R13: 0000000000000286  R14: 00000000000000fe  R15: ffff880254712400
     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  #9 [ffff88015834bc80] uart_start at ffffffff8152fbf2

Thanks for the report, but where's the rest of the stack trace?

Woops sorry about that

crash> bt
PID: 461061  TASK: ffff880203f8bc00  CPU: 2   COMMAND: "kworker/u8:2"
 #0 [ffff88015834b940] machine_kexec at ffffffff8103c1c5
 #1 [ffff88015834b990] crash_kexec at ffffffff810cd448
 #2 [ffff88015834ba60] oops_end at ffffffff81006478
 #3 [ffff88015834ba90] no_context at ffffffff818c5262
 #4 [ffff88015834baf0] __bad_area_nosemaphore at ffffffff818c545a
 #5 [ffff88015834bb40] bad_area_nosemaphore at ffffffff818c548c
 #6 [ffff88015834bb50] __do_page_fault at ffffffff81045ad5
 #7 [ffff88015834bbc0] do_page_fault at ffffffff81045efc
 #8 [ffff88015834bbd0] page_fault at ffffffff818d6b82
    [exception RIP: __uart_start+0x1a]
    RIP: ffffffff8152f30a  RSP: ffff88015834bc80  RFLAGS: 00010046
    RAX: 0000000000000000  RBX: ffffffff822e9920  RCX: 0000000000000036
    RDX: 0000000000003636  RSI: 00000000000000fe  RDI: ffffffff822e9920
    RBP: ffff88015834bca8   R8: 0000000000000000   R9: 00000000ffffffff
    R10: ffff8802546f0d20  R11: 0000000000000000  R12: ffff880254712400
    R13: 0000000000000286  R14: 00000000000000fe  R15: ffff880254712400
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff88015834bc80] uart_start at ffffffff8152fbf2
#10 [ffff88015834bcb0] uart_flush_chars at ffffffff8152fc1e
#11 [ffff88015834bcc0] n_tty_receive_buf_common at ffffffff81516cf1
#12 [ffff88015834bd80] n_tty_receive_buf2 at ffffffff81517414
#13 [ffff88015834bd90] flush_to_ldisc at ffffffff8151ab6d
#14 [ffff88015834bdf0] process_one_work at ffffffff81069871
#15 [ffff88015834be40] worker_thread at ffffffff81069c53
#16 [ffff88015834bec0] kthread at ffffffff8106f429
#17 [ffff88015834bf50] ret_from_fork at ffffffff818d50c8

It was a NULL pointer dereference, the state->port.tty was NULL so when we go to
check tty->stopped in uart_tx_stopped() we panic.  Looking at the other CPU's we
were in the middle of uart_open(), and the core actually had a valid pointer in
state->port.tty, which points to a race between either close or hangup (the only
two places that set state->port.tty to NULL) and open.  Close already flushes
the ldisc but hangup does not, which means we could have some characters in the
receive buffer in between the hangup and the open, and we end up in this
situation.

Yeah, the race is that the ldisc should not be attempting i/o to
the driver at all. This problem is fixed in -next already, but in the
tty core rather than in each individual tty driver.

Great!  Which patch/patches fix this?  I looked at linux-next and 
there's a lot of refactoring stuff, do I need all the things or is there 
a specific one that fixes this problem?  Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-serial" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html