Hi Sage,
The more I think about explicit-only mapping (so clients and OSDs
not doing CRUSH computations anymore) the more interesting
use cases I can see, better balance being only one of them.
Other major advantage I see is that Client would become independent
of internal CRUSH calculation changes (straw->straw2 was a bit
problematic on our production for example) which I believe would
also simplify keeping backward compatibility, especially for the
kernel stuff. Also it looks like the code would simplify.
Such explicit mapping could also help with all kinds of cluster
reconfiguration stuff - i.e. growing cluster by significant amount of
OSDs could be spread over time to reduce backfill impact, same with
increasing pgp_num. I also believe peering could gain here too
because mon/mgr could ensure only limited number of PGs is blocked
peering at the same time.
About memory overhead - I agree that it is the most important factor to
consider. I don't know ceph internals that much yet to be authoritative
here but I believe that in case of cluster rebalance, the amount of data
that has to be maintained can already grow way beyond whatexplicit
mapping would need. Is it possible that such explicit map would also
store list of potential OSDs as long as the PG is not fully recovered?
That would mean the osdmap history could be trimmed much faster, it
would be easier to predict mon resource requirements and monitors would
sync much faster during recovery.
Regards,
Bartek
W dniu 02.03.2017 o 19:09, Sage Weil pisze:
On Fri, 3 Mar 2017, Tiger Hu wrote:
Hi Sage,
I am very glad to know you raised this issue. In some cases, user may want
to accurately control the PG numbers in every OSDs. Is it possible to
add/implement a new policy to support fixed-mapping? This may be useful for
performance tuning or test purpose. Thanks.
In principle this could be used to map every pg explicitly without regard
for CRUSH. In think in practice, though, we can have CRUSH map 80-90% of
the PGs, and then have explicit mappings for 10-20% in order to achieve
the same "perfect" balance of PGs.
The result is a smaller OSDMap. It might not matter that much for small
clusters, but for large clusters, it is helpful to keep the OSDMap small.
OTOH, maybe through all of this we end up in a place where the OSDMap is
just an explicit map and the mgr agent that's managing it is using CRUSH
to generate it; who knows! That might make for faster mappings on the
clients since it's a table lookup instead of a mapping calculation.
I'm thinking that if we implement the ability to have the explicit
mappings in the OSDMap we open up both possibilities. The hard part is
always getting the mapping compatbility into the client, so if we do that
right, luminous+ clients will support explicit mapping (overrides) in
OSDMap and would be able to work with future versions that go all-in on
explicit...
sage
>
Tiger
在 2017年3月3日,上午1:40,Sage Weil <sweil@xxxxxxxxxx> 写道:
On Thu, 2 Mar 2017, Kamble, Nitin A wrote:
Hi Sage,
The crush algorithm handles mapping of pgs, and it will
even with the
addition of explicit mappings. I presume, finding which
pgs belong to
which OSDs will involve addition computation for each
additional
explicit mapping.
What would be penalty of this additional computation?
For small number of explicit mappings such penalty would
be small,
IMO it can get quite expensive with large number of
explicit mappings.
The implementation will need to manage the count of
explicit mappings,
by reverting some of the explicit mappings as the
distribution changes.
The understanding of additional overhead of the explicit
mappings would
had great influence on the implementation.
Yeah, we'll want to be careful with the implementation, e.g., by using
an
unordered_map (hash table) for looking up the mappings (an rbtree
won't
scale particularly well).
Note that pg_temp is already a map<pg_t,vector<int>> and can get big
when
you're doing a lot of rebalancing, resulting on O(log n) lookups. If
we
do something optimized here we should use the same strategy for
pg_temp
too.
sage
Nitin
On 3/1/17, 11:44 AM, "ceph-devel-owner@xxxxxxxxxxxxxxx on
behalf of Sage Weil" <ceph-devel-owner@xxxxxxxxxxxxxxx on
behalf of sweil@xxxxxxxxxx> wrote:
There's been a longstanding desire to improve the
balance of PGs and data
across OSDs to better utilize storage and balance
workload. We had a few
ideas about this in a meeting last week and I wrote up
a summary/proposal
here:
http://pad.ceph.com/p/osdmap-explicit-mapping
The basic idea is to have the ability to explicitly map
individual PGs
to certain OSDs so that we can move PGs from overfull
to underfull
devices. The idea is that the mon or mgr would do this
based on some
heuristics or policy and should result in a better
distribution than teh
current osd weight adjustments we make now with
reweight-by-utilization.
The other key property is that one reason why we need
as many PGs as we do
now is to get a good balance; if we can remap some of
them explicitly, we
can get a better balance with fewer. In essense, CRUSH
gives an
approximate distribution, and then we correct to make
it perfect (or close
to it).
The main challenge is less about figuring out when/how
to remap PGs to
correct balance, but figuring out when to remove those
remappings after
CRUSH map changes. Some simple greedy strategies are
obvious starting
points (e.g., to move PGs off OSD X, first adjust or
remove existing remap
entries targetting OSD X before adding new ones), but
there are a few
ways we could structure the remap entries themselves so
that they
more gracefully disappear after a change.
For example, a remap entry might move a PG from OSD A
to B if it maps to
A; if the CRUSH topology changes and the PG no longer
maps to A, the entry
would be removed or ignored. There are a few ways to
do this in the pad;
I'm sure there are other options.
I put this on the agenda for CDM tonight. If anyone
has any other ideas
about this we'd love to hear them!
sage
--
To unsubscribe from this list: send the line
"unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html