On 10/01/2015 10:32 AM, Curley, Matthew wrote:
We've been trying to reproduce the allocator performance impact on 4K random reads seen in the Hackathon (and more recent tests). At this point though, we're not seeing any significant difference between tcmalloc and jemalloc so we're looking for thoughts on what we're doing wrong. Or at least some suggestions to try out.
More detail here:
https://drive.google.com/file/d/0B2kp18maR7axTmU5WG9WclNKQlU/view?usp=sharing
Thanks for any input!
Hi Mathew,
I can point out a couple of differences in our setups:
1) I have 4 NVMe cards with 4 OSDs per card in each node, ie 16 OSDs
total per node. I'm also running the fio processes on the same nodes as
the OSDs, so there is far less CPU available per OSD in my setup.
2) You have more memory per node than I do (and far more memory per OSD)
3) I'm using fio with the librbd engine, not fio+libaio on kernel RBD.
It would be interesting to know if if this is having an effect.
4) I'm using RBD cache (and allowing writeback before flush)
5) I'm not using nobarriers
I suspect that in my setup I am very much bound by things other than the
NVMe cards. I think we should look at this in terms of per-node
throughput rather than per-OSD. What I find very interesting is that
you are seeing much higher per-node tcmalloc performance than I am but
fairly similar per-node jemalloc performance. For 4K random reads I saw
about 14K random read IOPs per node for tcmalloc+32MB TC and around 40K
IOPS per node with tcmalloc+128MB tc or jemalloc. It appears to me that
for both tcmalloc and jemalloc you saw around 50K IOPS per node in the 4
OSD per card case.
A couple of thoughts:
1) Did you happen to record any CPU usage data during your tests?
Perhaps with only 4 OSDs per node there is less CPU contention.
2) Did you test 4K random writes? It would be interesting to see if
those results show the same behavior.
3) I'm going to assume that since you saw differences in performance
with different queue depths that this is O_DIRECT? Did you sync/drop
cache on the OSDs before the tests? Was the data pre-filled on the RBD
volumes?
4) Even given the above, you have a lot more memory available for buffer
cache. Did you happen to look at how many of the IOs were actually
hitting the NVMe devices?
Mark
--MC
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html