Re: Rebalancing an Erasure coded pool seems to move far more data that necessary

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, May 25, 2018 at 5:36 PM Jesus Cea <jcea@xxxxxxx> wrote:
I have a Erasure Coded 8+2 pool with 8 PGs.

Each PG is spread on 10 OSDs using Reed-Solomon (the Erasure Code).

When I rebalance the cluster I see two PGs moving:
"active+remapped+backfilling".

A "pg dump" shows this:

"""
root@jcea:/srv# ceph --id jcea pg dump|grep backf
dumped all
75.5      25536                  0        0     18690       0
107105746944 3816     3816 active+remapped+backfilling 2018-05-25
23:53:06.341894    117576'47616    117576:61186
[1,11,0,18,19,21,4,5,15,12]          1   [3,11,0,18,19,21,4,5,15,12]
         3             0'0 2018-05-25 14:01:30.889768             0'0
2018-05-25 14:01:30.889768             0
73.7      29849                  0        0     21587       0
125195780096 1537     1537 active+remapped+backfilling 2018-05-25
23:49:47.085736    117466'60337    117576:77332
[18,21,4,12,6,17,2,15,10,23]         18   [18,3,4,12,6,17,2,15,10,23]
         18    117466'60337 2018-05-25 15:53:40.005828             0'0
2018-05-24 10:47:07.592897             0
"""

In my application, each file is 4MB in size, Erasure code (8+2) expands
25%, so each OSD stores 512 Kbytes from it, including erasure overhead.

We can see that PG 75.5 is moving from [3,11,0,18,19,21,4,5,15,12] OSDs
to [1,11,0,18,19,21,4,5,15,12] OSDs. Comparing tuples, we see that the
content in OSD 3 is moving to OSD 1. That should be 512Kbytes per object
(the slice assigned to this OSD).

My "ceph -s" shows:

"""
io:
   recovery: 41308 kB/s, 10 objects/s
"""

So, each object moved requires 4MB instead of 512Kbytes. According to
this, we are moving complete objects instead of only the slices
belonging to the evacuated OSDs.

Am I interpreting this correctly?.

Is this something known?. Will it be solved in the future?.

I'm not sure what's happening here, but there are two likely cases:
1) the recovery io stats are lying, and multiplying the "size" of objects by the number of them recovering, rather than the amount of data actually moving; or
2) recovering a single shard involves doing a read of the whole object and then writing out only the appropriate shard. I don't *think* this should be happening normally if it's just a migration, but it definitely can happen if you've lost a shard.

Basically boiling down the potential scenarios leaves summing everything up in that single number to be pretty limiting.
-Greg
 

This is especially costly because the PGs associated to an Erasure Coded
pool are quite small compared to a regular replicated pool. So they are
huge (in my case, 150 GB each, including EC overhead). Rebuilding the
file from the EC slices to move a single slice seems way overkill.

Am I missing anything?

Thanks for your time and expertise!.

--
Jesús Cea Avión                         _/_/      _/_/_/        _/_/_/
jcea@xxxxxxx - http://www.jcea.es/     _/_/    _/_/  _/_/    _/_/  _/_/
Twitter: @jcea                        _/_/    _/_/          _/_/_/_/_/
jabber / xmpp:jcea@xxxxxxxxxx  _/_/  _/_/    _/_/          _/_/  _/_/
"Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux