> Not so much confused as simply merged. Input processing is inherently > single-threaded; it makes sense to rely on that at the highest level > possible. I would disagree entirely. You want to minimise the areas affected by a given lock. You also want to lock data not code. Correctness comes before speed. You optimise it when its right, otherwise you end up in a nasty mess when you discover you've optimised to assumptions that are flawed. > On smp, locked instructions and cache-line contention on the tty_buffer > list ptrs and read_buf indices account for more than 90% of the time cost > in the read path for real hardware (and over 95% for ptys). Yes I'm uncomfortably aware of that for modern SMP hardware, and also that simply ripping out the buffering will screw the real low end people (eg M68K and friends) > Firewire, which is capable of sustained throughput in excess of 40MB/sec, > struggles to get over 5MB/sec through the tty layer. [And drm output > is orders-of-magnitude slower than that, which is just sad...] And what protocols do you care about 5MB/second - n_tty - no ? For the high speed protocols you are trying to fix a lost cause. By the time we've gone piddling around with tty buffers and serialized tty queues firing bytes through tasks and the like you already lost. For drm I assume you mean the framebuffer console logic ? Last time I benched that except for the Poulsbo it was bottlenecked on the GPU - not that I can type at 5MB/second anyway. Not that fixing the performance of the various bits wouldn't be a good thing too especially on the output end. > While that would work, it's expensive extra locking in a path that 99.999% > of the time doesn't need it. I'd rather explore other solutions. How about getting the high speed paths out of the whole tty buffer layer ? Almost every line discipline can be a fastpath directly to the network layer. If optimisation is the new obsession then we can cut the crap entirely by optimising for networking not making it a slave of n_tty. Starting at the beginning we have locks on rx because - we want serialized rx - we have buffer lifetimes - we have buffer queues - we have loads of flow control parameters Only n_tty needs the buffers (maybe some of irda but irda hasn't worked for years afaik). IRQ receive paths are serialized (and as a bonus can be pinned to a CPU). Flow control is n_tty stuff, everyone else simply fires it at their network layer as fast as possible and net already does the work. Keep a single tty_buf in the tty for batching at any given time, and private so no locks at all Have a wrapper via ld->receive(tty, buf) which fires the tty_buf at the ldisc and allocates a new empty one tty_queue_bytes(tty, buf, flags, len) which adds to the buffer, and if full calls ld->queue and then carries on the copying cycle and ld->receive_direct(tty, buf, flags, len) which allows block mode devices to blast bytes directly at the queue (ie all the USB 3G stuff, firewire, etc) without going via any additional copies. For almost all ldiscs ld->receive would be ld->receive_direct(tty, buf->buf, buf->flags, buf->len); free buffer For n_tty type stuff ld->receive is basically much of tty_flip_buffer_push ld->receive_direct allocates tty_buffers and copies into it We may even be able to optimise some of the n_tty cases into the fastpath afterwards (notably raw, no echo) For anything receiving in blocks that puts us close to (but not quite at) ethernet kinds of cleanness for network buffer delivery. Worth me looking into ? > The clock/generation method seems like it might yield a lockless solution > for this problem, but maybe creates another one because the driver-side > would need to stamp the buffer (in essence, a flush could affect data > that has not yet been copied from the driver). But it has arrived in the driver so might not matter. That requires a little thought! Alan -- To unsubscribe from this list: send the line "unsubscribe linux-serial" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html