The patch titled posix-timers: RCU conversion has been added to the -mm tree. Its filename is posix-timers-rcu-conversion.patch Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/SubmitChecklist when testing your code *** See http://userweb.kernel.org/~akpm/stuff/added-to-mm.txt to find out what to do about this The current -mm tree may be found at http://userweb.kernel.org/~akpm/mmotm/ ------------------------------------------------------ Subject: posix-timers: RCU conversion From: Eric Dumazet <eric.dumazet@xxxxxxxxx> Ben Nagy reported a scalability problem with KVM/QEMU that hit very hard a single spinlock (idr_lock) in posix-timers code, on its 48 core machine. Even on a 16 cpu machine (2x4x2), a single test can show 98% of cpu time used in ticket_spin_lock, from lock_timer Ref: http://www.spinics.net/lists/kvm/msg51526.html Switching to RCU is quite easy, IDR being already RCU ready. idr_lock should be locked only for an insert/delete, not a lookup. Benchmark on a 2x4x2 machine, 16 processes calling timer_gettime(). Before : real 1m18.669s user 0m1.346s sys 1m17.180s After : real 0m3.296s user 0m1.366s sys 0m1.926s Reported-by: Ben Nagy <ben@xxxxxxxx> Signed-off-by: Eric Dumazet <eric.dumazet@xxxxxxxxx> Tested-by: Ben Nagy <ben@xxxxxxxx> Cc: Avi Kivity <avi@xxxxxxxxxx> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> Cc: John Stultz <johnstul@xxxxxxxxxx> Cc: Richard Cochran <richard.cochran@xxxxxxxxxx> Cc: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- include/linux/posix-timers.h | 1 + kernel/posix-timers.c | 25 ++++++++++++++----------- 2 files changed, 15 insertions(+), 11 deletions(-) diff -puN include/linux/posix-timers.h~posix-timers-rcu-conversion include/linux/posix-timers.h --- a/include/linux/posix-timers.h~posix-timers-rcu-conversion +++ a/include/linux/posix-timers.h @@ -81,6 +81,7 @@ struct k_itimer { unsigned long expires; } mmtimer; } it; + struct rcu_head rcu; }; struct k_clock { diff -puN kernel/posix-timers.c~posix-timers-rcu-conversion kernel/posix-timers.c --- a/kernel/posix-timers.c~posix-timers-rcu-conversion +++ a/kernel/posix-timers.c @@ -491,6 +491,13 @@ static struct k_itimer * alloc_posix_tim return tmr; } +static void k_itimer_rcu_free(struct rcu_head *head) +{ + struct k_itimer *tmr = container_of(head, struct k_itimer, rcu); + + kmem_cache_free(posix_timers_cache, tmr); +} + #define IT_ID_SET 1 #define IT_ID_NOT_SET 0 static void release_posix_timer(struct k_itimer *tmr, int it_id_set) @@ -503,7 +510,7 @@ static void release_posix_timer(struct k } put_pid(tmr->it_pid); sigqueue_free(tmr->sigq); - kmem_cache_free(posix_timers_cache, tmr); + call_rcu(&tmr->rcu, k_itimer_rcu_free); } static struct k_clock *clockid_to_kclock(const clockid_t id) @@ -631,22 +638,18 @@ out: static struct k_itimer *__lock_timer(timer_t timer_id, unsigned long *flags) { struct k_itimer *timr; - /* - * Watch out here. We do a irqsave on the idr_lock and pass the - * flags part over to the timer lock. Must not let interrupts in - * while we are moving the lock. - */ - spin_lock_irqsave(&idr_lock, *flags); + + rcu_read_lock(); timr = idr_find(&posix_timers_id, (int)timer_id); if (timr) { - spin_lock(&timr->it_lock); + spin_lock_irqsave(&timr->it_lock, *flags); if (timr->it_signal == current->signal) { - spin_unlock(&idr_lock); + rcu_read_unlock(); return timr; } - spin_unlock(&timr->it_lock); + spin_unlock_irqrestore(&timr->it_lock, *flags); } - spin_unlock_irqrestore(&idr_lock, *flags); + rcu_read_unlock(); return NULL; } _ Patches currently in -mm which might be from eric.dumazet@xxxxxxxxx are linux-next.patch vfs-avoid-large-kmallocs-for-the-fdtable.patch posix-timers-rcu-conversion.patch net-convert-%p-usage-to-%pk.patch percpu_counter-change-return-value-and-add-comments.patch -- To unsubscribe from this list: send the line "unsubscribe mm-commits" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html