On Thu, Feb 28, 2013 at 7:13 AM, Rik van Riel <riel@xxxxxxxxxx> wrote: > > Btw, the IPC lock is already fairly fine grained. One ipc > lock is allocated for each set of semaphores allocated through > sys_semget. Looking up those semaphores in the namespace, when > they are used later, is done under RCU. Bullshit. There is no fine-grained locking, and the code has never ever been optimized for concurrency at all. Yes, the lookup is done with RCU, and that's good. And I can tell you *why* it's done with RCU: all the trivial sysv semaphore benchmarks I've seen tend to have just one lock per semget and then they test the performance of that semaphore. But I'm suspecting that since there is clearly contention going on (and hopefully the real world people aren't just all pounding on one single mutex), maybe these loads actually allocate multiple semaphores with semget (that 'nsems' argument). And then we scale really badly, because we use a single spinlock for all of them, even if we didn't have to. THERE IS ABSOLUTELY ZERO SCALING. None. Nada. The scaling that exists (your RCU lookup example) is for *unrelated* locks, which is exactly what you'd expect if all the simple benchmarks around did a single sem-lock and then the "scaling" was tested by running a thousand copies of that program - all using their own private single ipc semaphore. Plus the code in ipc/sem.c really is not very good. It does that "sem_lock()" (which gets the ipc spinlock) way too eagerly, because the semaphore lock also protects some trivial ipc stuff, so it gets the lock *first*, then it does all the random checks it wants to do (security checks, you name it), and then ot does all the queue undo crap etc. All while holding that lock. So no. There's no scaling, and the SINGLE lock that is held is held way too f*cking long. And that's what I'm talking about. We've never had people actually bother about the IPC locking, because we've never had people who cared. Everybody hates the crazy SysV IPC code, often for good reason. People are told not to use it (use shared memory and futexes instead, which actually *do* scale, and people *do* care about). But of course, lots of projects do use the horrid SysV IPC code because it's portable, and it does have those undo features etc. And I'm sure we *could* fix it, but judging by past performance nobody really cares enough. Which is why I do think that fixing it the wrong way (by making the contention on the ipc_lock()) may well be the right thing here for once. Of course, if the real-world applications really only use a single SysV semaphore, then we can't scale any better. But even then we could still try to lower the hold-times of that ipc_lock(). And hold-time really tends to be one big killer for contention. A lot of hardware tends to have ping-pong avoidance logic that means that quick unlocks after getting the lock, and it will still remain in the local node caches. Hardware features like that make it much harder to introduce the really bad contention cases even if there are tons of CPU's trying to access the same lock. But if the lock hold times are long, it's *much* easier to get bad contention. Looking at the code, if we could just get the security hook outside of the locked region, that would probably be a big improvement. Maybe do that part in the RCU lookup part? Because selinux really is very expensive, with the whole ipc_has_perm -> avc_has_perm_flags thing etc etc. This is exactly the kind of thing we did with pathnames. I'm sure there are other things we could do to improve ipc lock times even if we don't actually split the lock, but the security one might be a good first step. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-tip-commits" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html