I have a Erasure Coded 8+2 pool with 8 PGs. Each PG is spread on 10 OSDs using Reed-Solomon (the Erasure Code). When I rebalance the cluster I see two PGs moving: "active+remapped+backfilling". A "pg dump" shows this: """ root@jcea:/srv# ceph --id jcea pg dump|grep backf dumped all 75.5 25536 0 0 18690 0 107105746944 3816 3816 active+remapped+backfilling 2018-05-25 23:53:06.341894 117576'47616 117576:61186 [1,11,0,18,19,21,4,5,15,12] 1 [3,11,0,18,19,21,4,5,15,12] 3 0'0 2018-05-25 14:01:30.889768 0'0 2018-05-25 14:01:30.889768 0 73.7 29849 0 0 21587 0 125195780096 1537 1537 active+remapped+backfilling 2018-05-25 23:49:47.085736 117466'60337 117576:77332 [18,21,4,12,6,17,2,15,10,23] 18 [18,3,4,12,6,17,2,15,10,23] 18 117466'60337 2018-05-25 15:53:40.005828 0'0 2018-05-24 10:47:07.592897 0 """ In my application, each file is 4MB in size, Erasure code (8+2) expands 25%, so each OSD stores 512 Kbytes from it, including erasure overhead. We can see that PG 75.5 is moving from [3,11,0,18,19,21,4,5,15,12] OSDs to [1,11,0,18,19,21,4,5,15,12] OSDs. Comparing tuples, we see that the content in OSD 3 is moving to OSD 1. That should be 512Kbytes per object (the slice assigned to this OSD). My "ceph -s" shows: """ io: recovery: 41308 kB/s, 10 objects/s """ So, each object moved requires 4MB instead of 512Kbytes. According to this, we are moving complete objects instead of only the slices belonging to the evacuated OSDs. Am I interpreting this correctly?. Is this something known?. Will it be solved in the future?. This is especially costly because the PGs associated to an Erasure Coded pool are quite small compared to a regular replicated pool. So they are huge (in my case, 150 GB each, including EC overhead). Rebuilding the file from the EC slices to move a single slice seems way overkill. Am I missing anything? Thanks for your time and expertise!. -- Jesús Cea Avión _/_/ _/_/_/ _/_/_/ jcea@xxxxxxx - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/ Twitter: @jcea _/_/ _/_/ _/_/_/_/_/ jabber / xmpp:jcea@xxxxxxxxxx _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com