Re: Backfill stops after a while after OSD reweight

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

have a look at "ceph pg dump" to see which ones are stuck in remapped.

But my guess here is that you are running a CRUSH rule to distribute across 3 racks
and you only have 3 racks in total.
CRUSH will sometimes fail to find a mapping in this scenario. There are a few parameters
that you can tune in your CRUSH rule to increase the number of retries.
For example, the settings set_chooseleaf_tries and set_choose_tries can help, they are
set by default for erasure coding rules (where this scenario is more common). Values used
for EC are set_chooseleaf_tries = 5 and set_choose_tries = 100.
You can configure them by adding them as the first steps of the rule.

You can also configure an upmap exception.

But in general it is often not the best idea to have only 3 racks for replica = 3 if you want
to achieve a good data balance.



Paul


2018-06-20 16:50 GMT+02:00 Oliver Schulz <oliver.schulz@xxxxxxxxxxxxxx>:
Dear Paul,

thanks, here goes (output of "ceph -s", etc.):

    https://gist.github.com/oschulz/7d637c7a1dfa28660b1cdd5cc5dffbcb

> Also please run "ceph pg X.YZ query" on one of the PGs not backfilling.

Silly question: How do I get a list of the PGs not backfilling?



On 06/20/2018 04:00 PM, Paul Emmerich wrote:
Can you post the full output of "ceph -s", "ceph health detail, and ceph osd df tree
Also please run "ceph pg X.YZ query" on one of the PGs not backfilling.


Paul

2018-06-20 15:25 GMT+02:00 Oliver Schulz <oliver.schulz@xxxxxxxxxxxxxx <mailto:oliver.schulz@tu-dortmund.de>>:

    Dear all,

    we (somewhat) recently extended our Ceph cluster,
    and updated it to Luminous. By now, the fill level
    on some ODSs is quite high again, so I'd like to
    re-balance via "OSD reweight".

    I'm running into the following problem, however:
    Not matter what I do (reweigt a little, or a lot,
    or only reweight a single OSD by 5%) - after a
    while, backfilling simply stops and lots of objects
    stay misplaced.

    I do have up to 250 PGs per OSD (early sins from
    the first days of the cluster), but I've set
    "mon_max_pg_per_osd = 400" and
    "osd_max_pg_per_osd_hard_ratio = 1.5" to compensate.

    How can I find out why backfill stops? Any advice
    would be very much appreciated.


    Cheers,

    Oliver
    _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxx.com>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>




--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io <http://www.croit.io>
Tel: +49 89 1896585 90




--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux