Re: [Ceph-community] Pool broke after increase pg_num

Gesiel Galvão Bernardes <gesiel.bernardes@xxxxxxxxx> · Fri, 9 Nov 2018 09:11:52 -0200

Hi,
The pool is back up and running. I made this actions:

        - Increased max pg per OSD (ceph tell mon.* injectargs '--mon_max_pg_per_osd=400'). But was still frozen. (already had OSDs with 251 pgs, then I not sure if this was the my problem.)
        - Restarted all daemons, including OSDs. In a specific host, when I restarted a OSD daemon, It took too long, and after this I saw that the pool started rebuild.

I don't have a sure conclusion about what's happened, at least it's working. I will read logs, now with more diem, for understanding exactly happened. 

Thank you all for your help.

Gesiel

Em sex, 9 de nov de 2018 às 03:37, Ashley Merrick <singapore@xxxxxxxxxxxxxx> escreveu:
Are you sure the down OSD didn't happen to have any data required for the re-balance to complete? How long has the down now removed OSD been out? Before or after your increased PG count?
If you do "ceph health detail" and then pick a stuck PG what does "ceph pg PG query" output?

Has your ceph -s output changed at all since the last paste?

On Fri, Nov 9, 2018 at 12:08 AM Gesiel Galvão Bernardes <gesiel.bernardes@xxxxxxxxx> wrote:
Em qui, 8 de nov de 2018 às 10:00, Joao Eduardo Luis <joao@xxxxxxx> escreveu:
Hello Gesiel,

Welcome to Ceph!

In the future, you may want to address the ceph-users list

(`ceph-users@xxxxxxxxxxxxxx`) for this sort of issues.

Thank you, I will do.

On 11/08/2018 11:18 AM, Gesiel Galvão Bernardes wrote:

> Hi everyone,

> 

> I am a beginner in Ceph. I made a increase of pg_num in a pool, and

> after  the cluster rebalance I increased pgp_num (a confission: I not

> had read the complete documentation about this operation :-(  ). Then

> after this my cluster broken, and stoped all. The cluster not rebalance,

> and my impression is that are all stopped. 

> 

> Below is my "ceph -s". Can anyone help-me?

You have two osds down. Depending on how your data is mapped, your pgs

may be waiting for those to come back up before they finish being

cleaned up.

 After removed OSD downs, it is tried rebalance, but is "frozen" again, in this status:

  cluster:
    id:     ab5dcb0c-480d-419c-bcb8-013cbcce5c4d
    health: HEALTH_WARN
            12840/988707 objects misplaced (1.299%)
            Reduced data availability: 358 pgs inactive, 325 pgs peering

  services:
    mon: 3 daemons, quorum cmonitor,thanos,cmonitor2
    mgr: thanos(active), standbys: cmonitor
    osd: 17 osds: 17 up, 17 in; 221 remapped pgs

  data:
    pools:   1 pools, 1024 pgs
    objects: 329.6 k objects, 1.3 TiB
    usage:   3.8 TiB used, 7.4 TiB / 11 TiB avail
    pgs:     1.660% pgs unknown
             33.301% pgs not active
             12840/988707 objects misplaced (1.299%)
             666 active+clean
             188 remapped+peering
             137 peering
             17  unknown
             16  activating+remapped

Any other idea?

Gesiel

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com