On Fri, May 25, 2018 at 5:36 PM Jesus Cea <jcea@xxxxxxx> wrote:
I have a Erasure Coded 8+2 pool with 8 PGs.
Each PG is spread on 10 OSDs using Reed-Solomon (the Erasure Code).
When I rebalance the cluster I see two PGs moving:
"active+remapped+backfilling".
A "pg dump" shows this:
"""
root@jcea:/srv# ceph --id jcea pg dump|grep backf
dumped all
75.5 25536 0 0 18690 0
107105746944 3816 3816 active+remapped+backfilling 2018-05-25
23:53:06.341894 117576'47616 117576:61186
[1,11,0,18,19,21,4,5,15,12] 1 [3,11,0,18,19,21,4,5,15,12]
3 0'0 2018-05-25 14:01:30.889768 0'0
2018-05-25 14:01:30.889768 0
73.7 29849 0 0 21587 0
125195780096 1537 1537 active+remapped+backfilling 2018-05-25
23:49:47.085736 117466'60337 117576:77332
[18,21,4,12,6,17,2,15,10,23] 18 [18,3,4,12,6,17,2,15,10,23]
18 117466'60337 2018-05-25 15:53:40.005828 0'0
2018-05-24 10:47:07.592897 0
"""
In my application, each file is 4MB in size, Erasure code (8+2) expands
25%, so each OSD stores 512 Kbytes from it, including erasure overhead.
We can see that PG 75.5 is moving from [3,11,0,18,19,21,4,5,15,12] OSDs
to [1,11,0,18,19,21,4,5,15,12] OSDs. Comparing tuples, we see that the
content in OSD 3 is moving to OSD 1. That should be 512Kbytes per object
(the slice assigned to this OSD).
My "ceph -s" shows:
"""
io:
recovery: 41308 kB/s, 10 objects/s
"""
So, each object moved requires 4MB instead of 512Kbytes. According to
this, we are moving complete objects instead of only the slices
belonging to the evacuated OSDs.
Am I interpreting this correctly?.
Is this something known?. Will it be solved in the future?.
I'm not sure what's happening here, but there are two likely cases:
1) the recovery io stats are lying, and multiplying the "size" of objects by the number of them recovering, rather than the amount of data actually moving; or
2) recovering a single shard involves doing a read of the whole object and then writing out only the appropriate shard. I don't *think* this should be happening normally if it's just a migration, but it definitely can happen if you've lost a shard.
Basically boiling down the potential scenarios leaves summing everything up in that single number to be pretty limiting.
-Greg
This is especially costly because the PGs associated to an Erasure Coded
pool are quite small compared to a regular replicated pool. So they are
huge (in my case, 150 GB each, including EC overhead). Rebuilding the
file from the EC slices to move a single slice seems way overkill.
Am I missing anything?
Thanks for your time and expertise!.
--
Jesús Cea Avión _/_/ _/_/_/ _/_/_/
jcea@xxxxxxx - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/
Twitter: @jcea _/_/ _/_/ _/_/_/_/_/
jabber / xmpp:jcea@xxxxxxxxxx _/_/ _/_/ _/_/ _/_/ _/_/
"Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/
"My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com