Nautilus - inconsistent PGs - stat mismatch

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We have a new ceph Nautilus setup (Nautilus from scratch - not upgraded):

# ceph versions
{
    "mon": {
        "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)": 3
    },
    "osd": {
        "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)": 169
    },
    "mds": {},
    "overall": {
        "ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus (stable)": 175
    }
}

We only have CephFS on it with the two required triple replicated pools.  After creating the pools, we've increased the PG numbers on them, but did not turn autoscaling on:

# ceph osd pool ls detail
pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 8192 pgp_num 8192 autoscale_mode warn last_change 2017 lfor 0/0/886 flags hashpspool stripe_width 0 application cephfs pool 2 'cephfs_metadata' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode warn last_change 1995 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs

After a few days of running, we started seeing inconsistent placement groups:

# ceph pg dump | grep incons
dumped all
1.bb4        21                  0        0         0       0 67108864           0          0 3059     3059 active+clean+inconsistent 2019-10-20 11:13:43.270022 5830'5655 5831:18901 [346,426,373]        346 [346,426,373]            346 5830'5655 2019-10-20 11:13:43.269992       1763'5424 2019-10-18 08:14:37.582180             0 1.795        29                  0        0         0       0 96468992           0          0 3081     3081 active+clean+inconsistent 2019-10-20 17:06:45.876483 5830'5472 5831:17921 [468,384,403]        468 [468,384,403]            468 5830'5472 2019-10-20 17:06:45.876455       1763'5235 2019-10-18 08:16:07.166754             0 1.1fa        18                  0        0         0       0 33554432           0          0 3065     3065 active+clean+inconsistent 2019-10-20 15:35:29.755622 5830'5268 5831:17139 [337,401,455]        337 [337,401,455]            337 5830'5268 2019-10-20 15:35:29.755588       1763'5084 2019-10-18 08:17:17.962888             0 1.579        26                  0        0         0       0 75497472           0          0 3068     3068 active+clean+inconsistent 2019-10-20 21:45:42.914200 5830'5218 5831:15405 [477,364,332]        477 [477,364,332]            477 5830'5218 2019-10-20 21:45:42.914173       5830'5218 2019-10-19 12:13:53.259686             0 1.11c5       21                  0        0         0       0 71303168           0          0 3010     3010 active+clean+inconsistent 2019-10-20 23:31:36.183053 5831'5183 5831:16214 [458,370,416]        458 [458,370,416]            458 5831'5183 2019-10-20 23:31:36.183030       5831'5183 2019-10-19 16:35:17.195721             0 1.128d       17                  0        0         0       0 46137344           0          0 3073     3073 active+clean+inconsistent 2019-10-20 19:14:55.459236 5830'5368 5831:17584 [441,422,377]        441 [441,422,377]            441 5830'5368 2019-10-20 19:14:55.459209       1763'5110 2019-10-18 08:12:51.062548             0 1.19ef       16                  0        0         0       0 41943040           0          0 3076     3076 active+clean+inconsistent 2019-10-20 23:33:02.020050 5830'5502 5831:18244 [323,431,439]        323 [323,431,439]            323 5830'5502 2019-10-20 23:33:02.020025       1763'5220 2019-10-18 08:12:51.117020             0

The logs look like this (the 1.bb4 PG for example):

2019-10-20 11:13:43.261 7fffd3633700  0 log_channel(cluster) log [DBG] : 1.bb4 scrub starts 2019-10-20 11:13:43.265 7fffd3633700 -1 log_channel(cluster) log [ERR] : 1.bb4 scrub : stat mismatch, got 21/21 objects, 0/0 clones, 21/21 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 88080384/67108864 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. 2019-10-20 11:13:43.265 7fffd3633700 -1 log_channel(cluster) log [ERR] : 1.bb4 scrub 1 errors

It looks like doing a pg repair fixes the issue:

2019-10-21 09:17:50.125 7fffd3633700  0 log_channel(cluster) log [DBG] : 1.bb4 repair starts 2019-10-21 09:17:50.653 7fffd3633700 -1 log_channel(cluster) log [ERR] : 1.bb4 repair : stat mismatch, got 21/21 objects, 0/0 clones, 21/21 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 88080384/67108864 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes. 2019-10-21 09:17:50.653 7fffd3633700 -1 log_channel(cluster) log [ERR] : 1.bb4 repair 1 errors, 1 fixed

Is this a known issue with Nautilus?  We have other Luminous/Mimic clusters, where I haven't seen this come up.

Thanks,

Andras

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux