On 25/05/18 20:26, Paul Emmerich wrote: > Answes inline. > >> 2018-05-25 17:57 GMT+02:00 Jesus Cea <jcea@xxxxxxx >> <mailto:jcea@xxxxxxx>>: recommendation. Would be nice to know too if >> being "close" to a power of two is better than be far away and if it >> is better to be close but below or close but a little bit more. If >> ideal value is 128 but I only can be 120 or 130, what should I >> choose?. 120 or 130?. Why? > > Go for the next larger power of two under the assumption that your > cluster will grow. I now know better. Check my other emails. Not being a power of two always creates imbalance. You can not overcome that. If your are close to a power of two but under it (120), most of your PG will be "X" in size, a few of your PG will be "2*X" in size. If your are close to a power of two but over it (130), most of your PG will be size "X" and few of them will be of size "X/2". >> 3. Is there any negative effect for CRUSH of using erasure code 8+2 >> instead of 6+2 or 14+2 (power of two)?. I have 25 OSDs, so requiring >> 16 for a single operation seems a bad idea, even more when my OSD >> capacities are very spread (from 150 GB to 1TB) and filling a small >> OSD would block writes in the entire pool. > > EC rules don't have to be powers of two. And yes, too many chunks > for EC pools is a bad idea. It's rarely advisable to have a total of > k + m larger than 8 or so. I verified it. My objects are 4MB fixed size, inmutable (no rewrites), so each OSD provides 512 Kbytes. Seems nice. I could even use wider EC codes, in my personal environment. If your objects are small, requests per OSD will be tiny and performance will suffer. You would better use narrower EC codes. > Also, you should have at least k + m + 1 servers, otherwise full > server failures cannot be handled properly. Good advice, of course. "crush-failure-domain=host" (or bigger failure domain) is also important, if you have enough resources. > A large spread between the OSD capacities within one crush rule is > also usually a bad idea, 150 GB to 1 TB is typically too big. I know. Legacy sins. I spend my days reweighting. > Well, you reduced the number of PGs by a factor of 64, so you'll of > course see a large skew here. The option mon_pg_warn_max_object_skew > controls when this warning is shown, default is 10. So you are advising me to increase that value to silent the warning?. What I am thinking is that mixing in the same cluster regular replicated pools with EC pools will always generate this "warning". It is almost a natural effect. >> What is the actual memory hungry factor in a OSD, PGs or objects >> per PG?. > > PGs typically impose a bigger overhead. But PGs with a large number > of objects can become annoying... I find this difficult to believe, but you are far more experience with Ceph than me. Do you have any reference I can learn the details from?. Beside source code :-). Using EC will inevitably create PG with large number of objects. My pools have around 240.000 4MB immutable objects (~1 TB). A replicated pool would be configured as 128 PG, each PG having 1.875 objects, 7.5GB. The same pool using EC 8+2 would use 13 PG (internally it would use 130 "pseudo PG", close to the original 128). Spare me the power of two rule for now. 240.000 objects in 13 PG is 18.461 objects per PG, 92GB (74*10/8) (internally it will stored in 10 OSD, each providing 9.2GB each). I am actually using 8 PGs, so in my configuration it is more in the 30.000 objects per PG range, 150GB per PG, 15GB per OSD per PG. This compares badly with the original 1.875 objects by PG, although each OSD used to take care of 7.5 GB and now it only grew to 15GB. Is 30.000 objects per PG an issue?. What price am I paying here? Can I do something to improve the situation?. Increasing PG_num to 16 will be better, but not too much, and going to 32 will push the PG count per OSD well over the <500 PGs per OSD advice, considering that I have quite a few of those EC pools. Advice? Thank!. -- Jesús Cea Avión _/_/ _/_/_/ _/_/_/ jcea@xxxxxxx - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/ Twitter: @jcea _/_/ _/_/ _/_/_/_/_/ jabber / xmpp:jcea@xxxxxxxxxx _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com