Rebalancing an Erasure coded pool seems to move far more data that necessary

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I have a Erasure Coded 8+2 pool with 8 PGs.

Each PG is spread on 10 OSDs using Reed-Solomon (the Erasure Code).

When I rebalance the cluster I see two PGs moving:
"active+remapped+backfilling".

A "pg dump" shows this:

"""
root@jcea:/srv# ceph --id jcea pg dump|grep backf
dumped all
75.5      25536                  0        0     18690       0
107105746944 3816     3816 active+remapped+backfilling 2018-05-25
23:53:06.341894    117576'47616    117576:61186
[1,11,0,18,19,21,4,5,15,12]          1   [3,11,0,18,19,21,4,5,15,12]
         3             0'0 2018-05-25 14:01:30.889768             0'0
2018-05-25 14:01:30.889768             0
73.7      29849                  0        0     21587       0
125195780096 1537     1537 active+remapped+backfilling 2018-05-25
23:49:47.085736    117466'60337    117576:77332
[18,21,4,12,6,17,2,15,10,23]         18   [18,3,4,12,6,17,2,15,10,23]
         18    117466'60337 2018-05-25 15:53:40.005828             0'0
2018-05-24 10:47:07.592897             0
"""

In my application, each file is 4MB in size, Erasure code (8+2) expands
25%, so each OSD stores 512 Kbytes from it, including erasure overhead.

We can see that PG 75.5 is moving from [3,11,0,18,19,21,4,5,15,12] OSDs
to [1,11,0,18,19,21,4,5,15,12] OSDs. Comparing tuples, we see that the
content in OSD 3 is moving to OSD 1. That should be 512Kbytes per object
(the slice assigned to this OSD).

My "ceph -s" shows:

"""
io:
   recovery: 41308 kB/s, 10 objects/s
"""

So, each object moved requires 4MB instead of 512Kbytes. According to
this, we are moving complete objects instead of only the slices
belonging to the evacuated OSDs.

Am I interpreting this correctly?.

Is this something known?. Will it be solved in the future?.

This is especially costly because the PGs associated to an Erasure Coded
pool are quite small compared to a regular replicated pool. So they are
huge (in my case, 150 GB each, including EC overhead). Rebuilding the
file from the EC slices to move a single slice seems way overkill.

Am I missing anything?

Thanks for your time and expertise!.

-- 
Jesús Cea Avión                         _/_/      _/_/_/        _/_/_/
jcea@xxxxxxx - http://www.jcea.es/     _/_/    _/_/  _/_/    _/_/  _/_/
Twitter: @jcea                        _/_/    _/_/          _/_/_/_/_/
jabber / xmpp:jcea@xxxxxxxxxx  _/_/  _/_/    _/_/          _/_/  _/_/
"Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz

Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux