Re: [PATCH v1 tty] 8250: microchip: pci1xxxx: Refactor TX Burst code to use pre-existing APIs

Jiri Slaby <jirislaby@xxxxxxxxxx> · Tue, 5 Mar 2024 08:19:27 +0100

On 05. 03. 24, 5:15, Rengarajan.S@xxxxxxxxxxxxx wrote:
Hi Jiri,

On Mon, 2024-03-04 at 07:19 +0100, Jiri Slaby wrote:
[Some people who received this message don't often get email from
jirislaby@xxxxxxxxxx. Learn why this is important at
https://aka.ms/LearnAboutSenderIdentification ;]

EXTERNAL EMAIL: Do not click links or open attachments unless you
know the content is safe

On 04. 03. 24, 5:37, Rengarajan.S@xxxxxxxxxxxxx wrote:
Hi Jiri,

On Fri, 2024-02-23 at 10:26 +0100, Jiri Slaby wrote:
EXTERNAL EMAIL: Do not click links or open attachments unless you
know the content is safe

On 23. 02. 24, 10:21, Rengarajan.S@xxxxxxxxxxxxx wrote:
On Fri, 2024-02-23 at 07:08 +0100, Jiri Slaby wrote:
EXTERNAL EMAIL: Do not click links or open attachments unless
you
know the content is safe

On 22. 02. 24, 14:49, Rengarajan S wrote:
Updated the TX Burst implementation by changing the
circular
buffer
processing with the pre-existing APIs in kernel. Also
updated
conditional
statements and alignment issues for better readability.

Hi,

so why are you keeping the nested double loop?

Hi, in order to differentiate Burst mode handling with byte
mode
had
seperate loops for both. Since, having single while loop also
does
not
align with rx implementation (where we have seperate handling
for
burst
and byte) have retained the double loop.

So obviously, align RX to a single loop if possible. The current
TX
code
is very hard to follow and sort of unmaintainable (and buggy).
And
IMO
it's unnecessary as I proposed [1]. And even if RX cannot be one
loop,
you still can make TX easy to read as the two need not be the
same.

[1]
https://lore.kernel.org/all/b8325c3f-bf5b-4c55-8dce-ef395edce251@xxxxxxxxxx/

while (data_empty_count) {
     cnt = CIRC_CNT_TO_END();
     if (!cnt)
       break;
     if (cnt < UART_BURST_SIZE || (tail & 3)) { // is_unaligned()
       writeb();
       cnt = 1;
     } else {
       writel()
       cnt = UART_BURST_SIZE;
     }
     uart_xmit_advance(cnt);
     data_empty_count -= cnt;
}

With the above implementation we are observing performance drop of
2
Mbps at baud rate of 4 Mbps. The reason for this is the fact that
for
each iteration we are checking if the the data need to be processed
via
DWORDs or Bytes. The condition check for each iteration is causing
the
drop in performance.

Hi,

the check is by several orders of magnitude faster than the I/O
proper.
So I don't think that's the root cause.

With the previous implementation(with nested loops) the performance
is
found to be around 4 Mbps at baud rate of 4 Mbps. In that
implementation we handle sending DWORDs continuosly until the
transfer
size < 4. Can you let us know any other alternatives for the above
performance drop.

Could you attach the patch you are testing?

Please find the updated pci1xxxx_process_write_data

	u32 xfer_cnt;

         while (*valid_byte_count) {
                 xfer_cnt = CIRC_CNT_TO_END(xmit->head, xmit->tail,
                                            UART_XMIT_SIZE);

                 if (!xfer_cnt)
                         break;

                 if (xfer_cnt < UART_BURST_SIZE || (xmit->tail & 3)) {

Hi,

OK, is it different if you remove the alignment checking (which should 
be correct™ thing to do, but may/will slow down things on platforms 
which don't care)?

                         writeb(xmit->buf[xmit->tail], port->membase +
                                UART_TX_BYTE_FIFO);
                         xfer_cnt = UART_BYTE_SIZE;
                 } else {
                         writel(*(u32 *)&xmit->buf[xmit->tail],

If you remove the "tail & 3" check, you can use get_unaligned() here and 
need not care about unaligned accesses after all...

                                port->membase + UART_TX_BURST_FIFO);
                         xfer_cnt = UART_BURST_SIZE;
                 }

                 uart_xmit_advance(port, xfer_cnt);
                 *data_empty_count -= xfer_cnt;
                 *valid_byte_count -= xfer_cnt;
         }

Testing is done via minicom by transferring a 10 MB file at 4 Mbps,

After the minicom transfer with single instance:

Previous implementation(Nested While Loops):
Transferred 10 MB at 3900000 CPS

Current implementation:
Transferred 10 MB at 2459999 CPS

--
js
suse labs