Re: Rebalancing an Erasure coded pool seems to move far more data that necessary

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 29 May 2018 14:37:02 -0700

On Fri, May 25, 2018 at 5:36 PM Jesus Cea <jcea@xxxxxxx> wrote:
I have a Erasure Coded 8+2 pool with 8 PGs.

Each PG is spread on 10 OSDs using Reed-Solomon (the Erasure Code).

When I rebalance the cluster I see two PGs moving:

"active+remapped+backfilling".

A "pg dump" shows this:

"""

root@jcea:/srv# ceph --id jcea pg dump|grep backf

dumped all

75.5      25536                  0        0     18690       0

107105746944 3816     3816 active+remapped+backfilling 2018-05-25

23:53:06.341894    117576'47616    117576:61186

[1,11,0,18,19,21,4,5,15,12]          1   [3,11,0,18,19,21,4,5,15,12]

         3             0'0 2018-05-25 14:01:30.889768             0'0

2018-05-25 14:01:30.889768             0

73.7      29849                  0        0     21587       0

125195780096 1537     1537 active+remapped+backfilling 2018-05-25

23:49:47.085736    117466'60337    117576:77332

[18,21,4,12,6,17,2,15,10,23]         18   [18,3,4,12,6,17,2,15,10,23]

         18    117466'60337 2018-05-25 15:53:40.005828             0'0

2018-05-24 10:47:07.592897             0

"""

In my application, each file is 4MB in size, Erasure code (8+2) expands

25%, so each OSD stores 512 Kbytes from it, including erasure overhead.

We can see that PG 75.5 is moving from [3,11,0,18,19,21,4,5,15,12] OSDs

to [1,11,0,18,19,21,4,5,15,12] OSDs. Comparing tuples, we see that the

content in OSD 3 is moving to OSD 1. That should be 512Kbytes per object

(the slice assigned to this OSD).

My "ceph -s" shows:

"""

io:

   recovery: 41308 kB/s, 10 objects/s

"""

So, each object moved requires 4MB instead of 512Kbytes. According to

this, we are moving complete objects instead of only the slices

belonging to the evacuated OSDs.

Am I interpreting this correctly?.

Is this something known?. Will it be solved in the future?.

I'm not sure what's happening here, but there are two likely cases:
1) the recovery io stats are lying, and multiplying the "size" of objects by the number of them recovering, rather than the amount of data actually moving; or
2) recovering a single shard involves doing a read of the whole object and then writing out only the appropriate shard. I don't *think* this should be happening normally if it's just a migration, but it definitely can happen if you've lost a shard.

Basically boiling down the potential scenarios leaves summing everything up in that single number to be pretty limiting.
-Greg

This is especially costly because the PGs associated to an Erasure Coded

pool are quite small compared to a regular replicated pool. So they are

huge (in my case, 150 GB each, including EC overhead). Rebuilding the

file from the EC slices to move a single slice seems way overkill.

Am I missing anything?

Thanks for your time and expertise!.

-- 

Jesús Cea Avión                         _/_/      _/_/_/        _/_/_/

jcea@xxxxxxx - http://www.jcea.es/     _/_/    _/_/  _/_/    _/_/  _/_/

Twitter: @jcea                        _/_/    _/_/          _/_/_/_/_/

jabber / xmpp:jcea@xxxxxxxxxx  _/_/  _/_/    _/_/          _/_/  _/_/

"Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/

"My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/

"El amor es poner tu felicidad en la felicidad de otro" - Leibniz

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com