Re: Large numbers of OSD per node

Stefan Kleijkers <stefan@xxxxxxxxxxxxxxxxxxxx> · Tue, 06 Nov 2012 12:51:11 +0100

On 11/06/2012 12:31 PM, Gandalf Corvotempesta wrote:
2012/11/6 Stefan Kleijkers <stefan@xxxxxxxxxxxxxxxxxxxx>:
Well you have to keep in mind that when a node fails the PG's that resided
on that node have to be redistributed over all the other nodes. So you begin
moving about 1% of the data between all the remaining nodes/osds (coming
from an OSD that has the remaining replica of the pg to the new OSD that
will get a replica). So you move from and to all the remaining osd's and
that will give you a lot of bandwidth and therefor fast recorvery to a
consistent state.
Ok, but in this case, 1% is still 36TB of data.
There are no difference between 3 nodes with 36TB of data each or 90
nodes with 36TB of data each.
In case of a node failure, you always have to move 36TB of data, no
matter on how many nodes do you have.

True, but it's a huge difference if you have to redistribute the 36T 
between 2 remaining nodes or between 89 remaining nodes. And with such a 
few nodes you hit probably a couple of other bottlenecks like CPU power 
per node, networking bandwidth per node, etc... I have noticed this the 
hard way with 3 nodes and 24 disks/osds per node.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html