Hi Marcel, This is great! On Tue, 12 Jan 2016, Marcel Lauhoff wrote: > > Hi, > > I wrote a Master's Thesis about Ceph and cold storage last year. One of > the things I looked at was modifications to object placement. > > Among others, what would happen to balance (e.g objects / OSD) when > all objects of a file end up on the same OSD. I also ran tests with a > different hash algorithm (Linux dcache). > > I wrote an article on my website with the analysis, changes to the > source and how I ran the tests: > > http://irq0.org/articles/ceph/object_name_hashing The interesting thing to me is the error bars for linux prefix (the right-most set of bars on the last graph). They range is significantly wider than rjenkins + prefix (ranging from 2.1TiB to 4.0TiB (vs 2.3-3.7ish for the others). The reason we switched away from the linux dcache hash (it was the original choice) is because it is very weak. I suspect that even if you look at the average + standard deviation it hides some of the badness; looking at 99th or 99.9th percentile, or simply a plot of the osd utilization distribution, will show that there are more low- and high- utilization outliers. The other thing to keep in mind is that beyond a certain size locality doesn't buy you that much... the disk seek overhead is no longer significant once you've read several megabytes of data. At the same time, concentrating all data in a file (or rbd image) on a single device means that a large, busy, hot file can focus a lot of traffic on a single OSD. What might be more useful is the ability to take the data for several smaller files that are thought to be related (e.g., in the same directory, created at the same time) and try to store them together. In that case, since we know the file are small, the impact on balance would not be significant. On the other hand, what we currently do with (very) small files in CephFS is just inline the data in the inode anyway so we already get that locality (and more)--the main limitation there being that the max inline size is quite small (a KB or two, IIRC). sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html