Re: [RFC PATCH 0/3] restartable sequences benchmarks

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Oct 22, 2015 at 11:06 AM, Dave Watson <davejwatson@xxxxxx> wrote:
> We've been testing out restartable sequences + malloc changes for use
> at Facebook.  Below are some test results, as well as some possible
> changes based on Paul Turner's original patches

Thanks!  I'll stare at this some time between now and Kernel Summit.

>
> https://lkml.org/lkml/2015/6/24/665
>
> I ran one service with several permutations of various mallocs.  The
> service is CPU-bound, and hits the allocator quite hard.  Requests/s
> are held constant at the source, so we use cpu idle time and latency
> as an indicator of service quality. These are average numbers over
> several hours.  Machines were dual E5-2660, total 16 cores +
> hyperthreading.  This service has ~400 total threads, 70-90 of which
> are doing work at any particular time.
>
>                                    RSS CPUIDLE LATENCYMS
> jemalloc 4.0.0                     31G   33%     390
> jemalloc + this patch              25G   33%     390
> jemalloc + this patch using lsl    25G   30%     420
> jemalloc + PT's rseq patch         25G   32%     405
> glibc malloc 2.20                  27G   30%     420
> tcmalloc gperftools trunk (2.2)    21G   30%     480

Slightly confused.  This is showing a space efficiency improvement but
not a performance improvement?  Is the idea that percpu free lists are
more space efficient than per-thread free lists?

>
> jemalloc rseq patch used for testing:
> https://github.com/djwatson/jemalloc
>
> lsl test - using lsl segment limit to get cpu (i.e. inlined vdso
> getcpu on x86) instead of using the thread caching as in this patch.
> There has been some suggestions to add the thread-cached getcpu()
> feature separately.  It does seem to move the needle in a real service
> by about ~3% to have a thread-cached getcpu vs. not.  I don't think we
> can use restartable sequences in production without a faster getcpu.

If nothing else, I'd like to replace the thread-cached getcpu thing
with percpu gsbase, at least on x86.  That doesn't necessarily have to
be exclusive with restartable sequences.

>
> GS-segment / migration only tests
>
> There's been some interest in seeing if we can do this with only gs
> segment, here's some numbers for those.  This doesn't have to be gs,
> it could just be a migration signal sent to userspace as well, the
> same approaches would apply.
>
> GS patch: https://lkml.org/lkml/2014/9/13/59
>
>                                    RSS CPUIDLE LATENCYMS
> jemalloc 4.0.0                     31G   33%     390
> jemalloc + percpu locking          25G   25%     420
> jemalloc + preempt lock / signal   25G   32%     415

Neat!

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux