On Mon, Jan 16, 2023 at 6:08 AM David Howells <dhowells@xxxxxxxxxx> wrote: > > I'm not sure how relevant it is to the topic, but I seem to remember you > having a go at implementing spinlocks with x86_64 memory transaction > instructions a while back. Do you have any thoughts on whether these > instructions are ever likely to become something we can use? Nope. Not only are they buggy, the actual cost of them was prohibitive too. The only implementation I ever did was what then became 'lockref' (so this was about a decade ago), and using transactional memory was actually quite noticeably *slower* than just using a spinlock in the common cases (particularly the uncontended case - which is actually the common one by far, despite some benchmarks). And while the argument could be that transactional memory has improved in the last decade (narrator: "It hasn't"), the problem is actually quite fundamental. The whole "best case scenario" for transactional memory concept is literally optimized and designed for the rare circumstance - the case where you have contention, but where the locking part is basically an idempotent operation (so "lock+unlock" ends up undoing each other and can use shared cachelines rather than bouncing the cacheline). And contention pretty much happens in one situation, and one situation only: in a combination of (a) bad locking and (b) benchmarks. And for the kernel, where we don't have bad locking, and where we actually use fine-grained locks that are _near_ the data that we are locking (the lockref of the dcache is obviously one example of that, but the skbuff queue you mention is almost certainly exactly the same situation): the lock is right by the data that the lock protects, and the "shared lock cacheline" model simply does not work. You'll bounce the data, and most likely you'll also touch the same lock cacheline too. I was quite excited about transactional memory, but have (obviously) come to the conclusion that it was one of those "looks good in theory, but is just garbage in reality". So this isn't an "the x86 implementation was bad" issue. This is a "transactional memory is a bad idea" situation. It complicates the most important part of the core CPU (the memory pipeline) enormously, and it doesn't actually really work. Now, with that "transactional memory is garbage" in mind, let me just mention that there are obviously other circumstances: (a) horrendously bad locking If you have a single centralized lock, and you really have serious contention on it in real life all the time (and not just under microbenchmarks), and the data that gets touched is spread out all over, and you simply cannot fix the locking, then all those theoretical advantages of transactional memory can actually come into play. But again: even then, there's an absolutely huge cost to the memory pipeline. So that advantage really doesn't come free. Nobody has ever gotten transactional memory to actually work well in real life. Not Intel, not IBM with powerpc HTM. That should tell you something. The hw complexity cost is very very real. But Intel actually had some loads where TSX helped. I think it was SAP HANA, and I really think that what you should take away from that is that SAP HANA locking is likely horrible garbage inside. But maybe I'm biased, and maybe SAP HANA is lovely, and maybe it just happened to work really well for TSX. I've never actually looked at that load, but if I were a betting man... (b) very *limited* transactions are probably a great idea LL/SC is arguably a form of "transactional memory", just with a single word. Now, I think atomics are generally usually better than LL/SC when it really is just a single word (because most of the time, "single word" also means "simple semantics", and while CMPXCHG loops aren't lovely when the semantics are a bit more complex, it's actually nice and portable and works at a higher level than assembler sequences and as such is actually a *lot* better than LL/SC in practice). But a "N-word LL/SC with forward progress guarantees" would be objectively stronger than atomics, and would allow for doing low-level data structures in ways that atomics don't. Doubly linked lists together with a lock check would argue for "N" being ~5 memory operations. So I do believe in potentially *limited* memory transactions, when they can be made limited enough that you have (1) forward progress guarantees with no "separate fallback on contention" garbage (2) can keep the hw implementation simple enough with just a store buffer and an (architecturally) very limited number of cacheline accesses that you can actually order in HW so that you don't have ABBA situations. But the generic kind of transactional memory that Intel tried with TSX (and IBM with HTM) is pure garbage, and almost certainly always will be. And all of the above is obviously just my opinionated input on it. But I just happen to be right. Linus