Re: [PATCH 1/1] tty, serial, 8250: to avoid recv fifo overrun, read rx fifo during xmit

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Prasad,

On 02/26/2015 09:24 PM, Prasad Koya wrote:
> Hi Peter..sorry I wanted to try with higher baud rates before I got
> back. I do see the overrun even at higher rates like 38400 or 57600.
> While a printk of about 60 chars is being spit out, I pasted a rather
> large command(s) of about 200-1000 chars on the console and I see
> overrun. Pl see my answers within:

Thanks for confirming my suspicions.

> On Wed, Feb 18, 2015 at 6:32 AM, Peter Hurley <peter@xxxxxxxxxxxxxxxxxx> wrote:
>> On 02/18/2015 03:40 AM, Prasad Koya wrote:
>>> On Wed, Feb 18, 2015 at 12:35 AM, Prasad Koya <prasad@xxxxxxxxxx> wrote:
>>>> On Tue, Feb 17, 2015 at 7:34 PM, Peter Hurley <peter@xxxxxxxxxxxxxxxxxx>
>>>> wrote:
>>>>> On 02/17/2015 06:22 PM, prasad@xxxxxxxxxx wrote:
>>>>>> From: Prasad Koya <prasad@xxxxxxxxxx>
>>>>>>
>>>>>> At slow baud rates, say 9600, it is easy to run into tty buffer overrun
>>>>>> as interrupts could get disabled for a long time. To avoid that, if
>>>>>> UART_LSR_DR is set, pick up bytes from rx fifo while waiting for tx
>>>>>> buffer to be empty. Stop buffering if port->sysrq is true or
>>>>>> oops_in_progress is set.
>>>>>
>>>>> Some of the original information you supplied has been lost in this commit
>>>>> log. I think it would be helpful if you detailed how the overrun error can
>>>>> happen, like you did before:
>>>>>
>>>
>>> Yes, its been quite some my first patch or problem was posted..then holiday
>>> season etc.
>>>
>>>>
>>>>>
>>>>>> On 09/30/2014 01:30 PM, Prasad Koya wrote:
>>>>>>> At 9600 baud, we are seeing buffer overruns on our 16550A UARTs. If a
>>>>>>> command, say 40 chars long, is sent to console and if printk to
>>>>>>> console happens at same time, we running into receive buffer overrun.
>>>>>>> Thats because, at 9600 baud, it takes about a millisecond to xmit 1
>>>>>>> char and with receive FIFO only 16 chars deep, it is very easy to
>>>>>>> cause receive buffer overrun.
>>>>>
>>>>> What exactly is the use case here? I ask because
>>>>
>>>>
>>>
>>> In our testing infrastructure we have scripts shooting commands to the
>>> server (ethernet switch) via its console. Bunch of such servers are
>>> connected to a terminal server. These could be commands like "ip route add
>>> to 192.168.0.0/16 dev ma1 via 192.168.0.1" which are over 16 chars in
>>> length. If server happens to write something out to console in the middle of
>>> receiving above command, we are seeing receive fifo overrun.
>>
>> Thanks for explaining that.
>>
>> I just want to be honest right here: I'm reluctant to recommend such an
>> invasive and consequential change for an automated test harness that
>> can be worked around at the sender end.
>>
>> I'm not implying yours is not a valid use-case; it is. But there are a
>> number of factors which weigh against it:
>> 1. It makes serial console output less reliable. This is already a problem
>>    with printk().
>> 2. It imposes a significant maintenance burden by linking areas of code
>>    that were separate before.
>> 3. It's had really limited testing.
>> 4. There are other workarounds available.
> 
> I wanted to get feedback on the patch starting from whether this is a
> valid use case. There are ways to circumvent this overrun but I felt
> like kernel should be able to handle that (I may be wrong). If the
> reviewers here agree that this is a valid use-case I'm ok doing
> required testing/look into any scripts to test this.

Ok.

I'll reserve judgement for the next iteration that addresses my
earlier review; I'd like to see this patch pared down to minimally
invasive but still useful and addresses your specific use-case.

So to recap:
1. Remove the UPF_CONS_FLOW case; let that remain Someone Else's Problem.
2. Rework the commit log to accurately describe the use case and how the
   overrun error can occur.
   Be sure to note that overruns can occur at any baud rate but they're more
   frequent at 9600 baud (which is a common line speed in your industry).
   Also be sure to mention this happens on your motherboard/switch product;

3. Add a note in the commit log that you don't care about limiting overruns
   in sysrq and this doesn't address that problem. I usually make a note like
   that this way:

NB: Overruns are still possible if simultaneous rx occurs while outputting
    from a sysrq handler; this patch does not address that use-case.

4. Only perform rx from within serial8250_console_putchar() because that's
   the only source of rx overrun errors.
5. Add the CONFIG_ knob. Be sure it depends on CONFIG_SERIAL_8250_CONSOLE and
   build test both configurations.

Regards,
Peter Hurley


>> At a minimum this patch should have a CONFIG_ knob which is by default disabled.
>>
> 
> Sure. This can be disabled by default.
> 
>> FWIW, I carry a number of patchsets outside mainline because they're
>> simply not suitable, either because they're really only useful to me
>> or because they're too ugly or whatever.
>>
>>
>>>>> 1. the patch is trying to fix overrun during a random printk but
>>>>>    makes no effort to fix overrun during the potentially much, much
>>>>>    longer sysrq printk.
>>>>
>>>>
>>>
>>> The sysrq-key combinations are manual entry and we'd never see this in that
>>> case.
>>
>> That's what I thought; I just wanted to be sure.
>>
>>
>>>>> 2. I'm struggling to envision the circumstances where a printk is
>>>>>    happening frequently enough to collide with large tty input. Heaviest
>>>>>    printk usage is during boot but that is also typically the lightest
>>>>>    tty input load.
>>>>>
>>>>
>>>
>>> When we first ran into this issue it was soon after boot where script was
>>> configuring the server and one of our modules printed something on console.
>>> There could be other cases where one could run into such situation.
>>
>> Ok, this makes sense. The console goes online way earlier than the network
>> stack.
>>
>>
>>>>> Do you have a test case that I can use to reproduce this?
>>>>>
>>>>> Why is the line speed so slow?
>>>>>
>>>>
>>>
>>> Maybe this low speed requirement was something that came from our customers.
>>
>> So it is an actual requirement and not simply because the default console
>> speed is 9600?
>>
> 
> Apparently is the 9600 is the industry standard baud rate for
> switches/routers, so test our switch from console at same speed.
> 
>> If possible, I'd rather fix this problem by setting the console line speed
>> on the kernel command line from your bootloader.
>>
>> [...]
>>
>>>>>> +/*
>>>>>>   *   Wait for transmitter & holding register to empty
>>>>>> + *   If rx is true, then also poll for received characters if it's
>>>>>> + *   safe to process them.  This is only used in the SUPPORT_SYSRQ
>>>>>> + *   case, and hopefully the compiler can figure out it's dead code
>>>>>> + *   if that's not the case.
>>>>>>   */
>>>>>> -static void wait_for_xmitr(struct uart_8250_port *up, int bits)
>>>>>> +static void __wait_for_xmitr(struct uart_8250_port *up, int bits, bool
>>>>>> rx)
>>>>>>  {
>>>>>>       unsigned int status, tmout = 10000;
>>>>>>
>>>>>> @@ -1978,6 +2005,11 @@ static void wait_for_xmitr(struct uart_8250_port
>>>>>> *up, int bits)
>>>>>>                       break;
>>>>>>               if (--tmout == 0)
>>>>>>                       break;
>>>>>> +
>>>>>> +             if (rx && status & (UART_LSR_DR | UART_LSR_BI) &&
>>>>>> +                 !emergency_situation(&up->port))
>>>>>> +                     serial8250_rx_chars(up, status, true);
>>>>>> +
>>>>>
>>>>> serial8250_rx_chars() should not be used in this context because it loops
>>>>> as long as there is data to be read. That is unacceptably long with
>>>>> large fifos and fast line speeds.
>>>>>
>>>>> serial8250_rx_chars() will have to be refactored so that just a single
>>>>> char
>>>>> can be read from here. And what I said about not trying to handle anything
>>>>> other than data ready.
>>>>>
>>>>
>>>
>>> What I thought after your first comments was that, since it seems to take
>>> same amount of time for serial_in() compared to udelay(1), why not read one
>>> char from rx fifo instead of waiting with udelay().
>>
>> I did think about that too, but I'm not sure it matters much.
>> The line status is being heavily oversampled as it is.
>>
>>>>
>>>>>
>>>>>>               udelay(1);
>>>>>>       }
>>>>>>
>>>>>> @@ -1989,12 +2021,29 @@ static void wait_for_xmitr(struct uart_8250_port
>>>>>> *up, int bits)
>>>>>>                       up->msr_saved_flags |= msr & MSR_SAVE_FLAGS;
>>>>>>                       if (msr & UART_MSR_CTS)
>>>>>>                               break;
>>>>>> +
>>>>>> +                     if (rx && !emergency_situation(&up->port)) {
>>>>>> +                             status = serial_in(up, UART_LSR);
>>>>>> +
>>>>>> +                             up->lsr_saved_flags |= status &
>>>>>> LSR_SAVE_FLAGS;
>>>>>> +                             if (status & (UART_LSR_DR | UART_LSR_BI))
>>>>>> +                                     serial8250_rx_chars(up, status,
>>>>>> true);
>>>>>> +                     }
>>>>>> +
>>>>>
>>>>> Is UPF_CONS_FLOW part of the use case? If not, please drop.
>>>>
>>>
>>> No. not related to flow control.
>>
>> Ok.
>>
>> [...]
>>
>>>>>> +
>>>>>>  #ifdef CONFIG_CONSOLE_POLL
>>>>>>  /*
>>>>>>   * Console polling routines for writing and reading from the uart while
>>>>>> @@ -3168,7 +3217,7 @@ static void serial8250_console_putchar(struct
>>>>>> uart_port *port, int ch)
>>>>>>  {
>>>>>>       struct uart_8250_port *up = up_to_u8250p(port);
>>>>>>
>>>>>> -     wait_for_xmitr(up, UART_LSR_THRE);
>>>>>> +     __wait_for_xmitr(up, UART_LSR_THRE, true);
>>>>>
>>>>> Why is rx performed unconditionally if the problem set is limited
>>>>> to slow line speeds (I'm not convinced it is limited to slow
>>>>> line speeds though)?
>>>>>
>>>>
>>>
>>> not sure if i understood your question. we are doing tx and rx is done
>>> only if lsr has DR and BI
>>> bits set. not unconditional. the issue may be seen even at higher speeds.
>>
>> I meant unconditionally with respect to the observed problem:
>> the commit log indicates that the problem is limited to slow line speeds
>> on uarts with small fifos, but yet the patch applies to all uarts at
>> all line speeds.
>>
> 
> Got it.
> 
>> If the problem can happen at any line speed then the commit log shouldn't
>> say it only happens at slow line speeds.
>>
>>
> 
> Sure. The commit log could be changed to make it more generic.
> 
>>>>>
>>>>> What other port types and line speeds has this been tested on besides
>>>>> your 16550A design?
>>>>>
>>>>
>>>
>>> I can try at higher speeds as well and see if the issue gets reproduced. I
>>> don't have other port types. From what i read online, 16550s are quite
>>> popular commercially but maybe people use it at higher speeds.
>>
>> I have many and they do not all behave similarly, that's why I'm asking
>> which platforms and conditions this has been tested under.
>>
>>
> 
> UART is built into our motherboard on switches and we've seen this
> behavior on kernels 2.6.x and up to 3.19.
> 
> thanks.
> 
>>>>>
>>>>>>       serial_port_out(port, UART_TX, ch);
>>>>>>  }
>>>>>>
>>>>>> @@ -3176,6 +3225,10 @@ static void serial8250_console_putchar(struct
>>>>>> uart_port *port, int ch)
>>>>>>   *   Print a string to the serial port trying not to disturb
>>>>>>   *   any possible real use of the port...
>>>>>>   *
>>>>>> + *   In order to avoid dropped input if the output is longer than
>>>>>> + *   the available hardware FIFO space, we try to handle incoming
>>>>>> + *   characters during this polling loop.
>>>>>> + *
>>>>>>   *   The console_lock must be held when we get here.
>>>>>>   */
>>>>>>  static void
>>>>>> @@ -3214,7 +3267,7 @@ serial8250_console_write(struct console *co, const
>>>>>> char *s, unsigned int count)
>>>>>>        *      Finally, wait for transmitter to become empty
>>>>>>        *      and restore the IER
>>>>>>        */
>>>>>> -     wait_for_xmitr(up, BOTH_EMPTY);
>>>>>> +     __wait_for_xmitr(up, BOTH_EMPTY, true);
>>>>>
>>>>> Is rx actually necessary here? How big is the tx shifter? Because the tx
>>>>> fifo
>>>>> can have at most 1 char right now and any pending rx was done just before
>>>>> that.
>>>>>
>>>
>>> Yes, rx here is similar to the rx in serial8250_console_putchar().
>>
>> I think you may misunderstand what I mean here.
>>
>> I don't think rx could overflow while waiting for the transmitter to empty
>> here:
>>
>> 1. the last output char has just been written to the tx fifo by
>>    serial8250_console_char()
>> 2. this means the tx fifo can have at most 1 char in it right now
>> 3. the rx fifo was just read at most 1us before the tx fifo was written
>> 4. so this call to wait_for_xmitr() is only waiting for 1 char to be output
>>    plus how ever much data the shifter can hold, which probably isn't much
>>
>>
>> How could the rx fifo overflow in the time it takes the xmitr to send
>> 1-2 chars, especially since the rx fifo was being polled at 1000x oversample
>> (every 1us for each 1ms tx char)?

--
To unsubscribe from this list: send the line "unsubscribe linux-serial" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux PPP]     [Linux FS]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Linmodem]     [Device Mapper]     [Linux Kernel for ARM]

  Powered by Linux