Re: hashing variation in rados bench runs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 04/20/2015 07:37 AM, Deneau, Tom wrote:
I have been trying to run rados bench runs and I've noticed a lot of variations from run to run.
The runs generally write data with --no-cleanup then read it back (seq), dropping the caches in between
I admit this is on a single node "cluster" with 5 data disks so maybe not realistic but...

In my runs I also collect disk activity traces.  When I look at the seq read scores, I've noticed
the low scoring runs always have a "hot" disk which maxes out while others might be at 30% to 40% usage.
Whereas in the high scoring runs the disk activity is much more evenly distributed.
I realize the hashing of objects to primary osds depends on the object names which are different for each run
(in rados bench, the object names include the pid).  But I was surprised at the sometimes marked unevenness
in the hashing.

Have others seen this and is there a good workaround?

So I'd start by narrowing down why that one disk is hot. First step is to just look at the amount of data on the disks and see if that one disk has a lot more than the others. Next might be to try our pool distribution quality script and see how the PG distribution looks:

https://github.com/ceph/cbt/blob/master/tools/readpgdump.py

run it like:

ceph pg dump | ./readpgdump.py

That will give you a ton of statistics on pool distribution quality and possible OSD weighting strategies.

Beyond that, like you said there is some inherent randomness not just in the PG distribution but in the way that objects are distributed to PGs. There is a test that would be very interesting to do but I'm not sure anyone has done it yet for crush/straw. Basically the idea is that if you have very similar names for objects with only 1 input bit flipped, how does the mapping change. IE say you have objects with names [object1,object2,object3...objectN]. Even if the distribution is random for inputs up to N, how much clumpiness is there as you iterate over objects within the distribution? This kind of testing is often done for hashing algorithms and folks have done it for Jenkins, but it might be nice to see something like this repeated for Ceph/Crush.

Beyond that, there's also the concern that at the OS or even hardware level there is some contention for IOs as request come in for multiple disks that could cause some to end up waiting more than others. We've seen this with network gear, and suspect we've seen it with disks on SAS expander backplanes, especially with SATA drives mixed in. On the other hand we've seen setups that work well too so there's no easy answer.

Mark


-- Tom Deneau, AMD
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux