I realize we're probably kind of pushing it. It was the only option
i could think of however that would satisfy the idea that:
Have separate servers for HDD and NVMe storage spread out in 3 data
centers.
Always select 1 NVMe and 2 HDD, in separate data centers (make sure
NVMe is primary)
If one data center goes down, only loose 1/3 of the NVMes.
I tried making a ceph rule to first select an NVMe based on class,
and then select 2 HDDs based on class. I couldn't make it guarantee
however that they would be in separate data centers probably because
of two separate chooseleaf statements. Sometimes one of the HDDs
would end up being in the same one as the NVMe. I did play around
with this for some time.
Just selecting 3 separate ones instead sometimes resulted in 2 or 3
NVMes, or no NVMes at all. In fact we do have a separate pool with
3xNVMe for the high performance req stuff, but that uses a
traditional "default" tree.
Rearranging the osd map and reducing the rule to a single chooseleaf
seems to work though and we will manually alter the weights outside
of the hosts to make life easier for CRUSH :).
If we want to add more servers we will just add another layer in
between and make sure the weights there do not differ too much when
we plan it out.
/Peter
Den 2018-01-29 kl. 17:52, skrev Gregory
Farnum:
CRUSH is a pseudorandom, probabilistic algorithm.
That can lead to problems with extreme input.
In this case, you've given it a bucket in which one child
contains ~3.3% of the total weight, and there are only three
weights. So on only 3% of "draws", as it tries to choose a
child bucket to descend into, will it choose that small one
first.
And then you've forced it to select...each of the hosts in
that data center, for all inputs? How can that even work in
terms of actual data storage, if some of them are an order of
magnitude larger than the others?
Anyway, leaving that bit aside since it looks like you're
mapping each host to multiple DCs, you're giving CRUSH a very
difficult problem to solve. You can probably "fix" it by
turning up the choose_retries value (or whatever it is) to a
high enough level that trying to map a PG eventually actually
grabs the small host. But I wouldn't be very confident in a
solution like this; it seems very fragile and subject to input
error.
-Greg
We kind of
turned the crushmap inside out a little bit.
Instead of the traditional "for 1 PG, select OSDs from 3
separate data
centers" we did "force selection from only one datacenter (out
of 3) and
leave enough options only to make sure precisely 1 SSD and 2
HDD are
selected".
We then organized these "virtual datacenters" in the hierachy
so that
one of them in fact contain 3 options that lead to 3
physically separate
servers in different locations.
Every physical datacenter has both SSD's and HDD's. The idea
is that if
one datacenter is lost, 2/3 of the SSD's still remain (and can
be mapped
to by marking the missing ones "out") so performance is
maintained.
Den 2018-01-29 kl. 13:35, skrev Niklas:
> Yes.
> It is a hybrid solution where a placement group is always
located on
> one NVMe drive and two HDD drives. Advantage is great
read performance
> and cost savings. Disadvantages is low write performance.
Still the
> write performance is good thanks to rockdb on Intel
Optane disks in
> HDD servers.
>
> Real world looks more like I described in a previous
question
> (2018-01-23) here on ceph-users list, "Ruleset for
optimized Ceph
> hybrid storage". Nobody answered so am guessing it is not
possible to
> create my wanted rule. Now am trying to solve it with
virtual
> datacenters in the crush map. Which works but maybe the
the most
> optimal solution.
>
>
> On 2018-01-29 13:21, Wido den Hollander wrote:
>>
>>
>> On 01/29/2018 01:14 PM, Niklas wrote:
>>> ...
>>>
>>
>> Is it your intention to put all copies of a object in
only one DC?
>>
>> What is your exact idea behind this rule? What's the
purpose?
>>
>> Wido
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
|
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com