# ceph pg stat
v1109889: 8192 pgs: 1 active+clean+inconsistent, 8191 active+clean; 31041 GB data, 42473 GB used, 1177 TB / 1218 TB avail; 411 kB/s rd, 28102 kB/s wr, 322 op/s
# ceph health detail | grep 'inconsistent'
HEALTH_ERR 1 pgs inconsistent; 5 scrub errors
pg 1.5c4 is active+clean+inconsistent, acting [258,268,156,14,67]
# ceph pg deep-scrub 1.5c4
instructing pg 1.5c4 on osd.258 to deep-scrub
From ceph-osd.258.log when the problem pops up;
=====================================
2017-04-04 13:08:08.288728 7f6b34fe6700 -1 log_channel(cluster) log [ERR] : 1.5c4s0 scrub stat mismatch, got 1989/1988 objects, 0/0 clones, 1989/1988 dirty, 0/0
omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 4113960196/4110943924 bytes, 0/0 hit_set_archive bytes.
2017-04-04 13:08:08.288744 7f6b34fe6700 -1 log_channel(cluster) log [ERR] : 1.5c4s0 scrub 1 missing, 0 inconsistent objects
2017-04-04 13:08:08.288747 7f6b34fe6700 -1 log_channel(cluster) log [ERR] : 1.5c4 scrub 5 errors
From mon logs
===========
2017-04-05 07:18:17.665821 7f0fb55b5700 0 log_channel(cluster) log [INF] : pgmap v1134421: 8192 pgs: 1 activating+degraded+inconsistent, 4 activ
ating, 27 active+recovery_wait+degraded, 3 active+recovering+degraded, 8157 active+clean; 31619 GB data, 43163 GB used, 1176 TB / 1218 TB avail;
16429 kB/s rd, 109 MB/s wr, 1208 op/s; 554/87717275 objects degraded (0.001%)
2017-04-05 07:18:18.643905 7f0fb55b5700 1 mon.cn1@0(leader).osd e4701 e4701: 335 osds: 334 up, 335 in
2017-04-05 07:18:18.656143 7f0fb55b5700 0 mon.cn1@0(leader).osd e4701 crush map has features 288531987042664448, adjusting msgr requires
2017-04-05 07:18:18.676477 7f0fb55b5700 0 log_channel(cluster) log [INF] : osdmap e4701: 335 osds: 334 up, 335 in
2017-04-05 07:18:18.731719 7f0fb55b5700 0 log_channel(cluster) log [INF] : pgmap v1134422: 8192 pgs: 1 activating+degraded+inconsistent, 4 activ
ating, 27 active+recovery_wait+degraded, 3 active+recovering+degraded, 8157 active+clean; 31619 GB data, 43163 GB used, 1176 TB / 1218 TB avail;
21108 kB/s rd, 105 MB/s wr, 1165 op/s; 554/87717365 objects degraded (0.001%)
2017-04-05 07:18:19.963217 7f0fb55b5700 1 mon.cn1@0(leader).osd e4702 e4702: 335 osds: 335 up, 335 in
2017-04-05 07:18:19.975002 7f0fb55b5700 0 mon.cn1@0(leader).osd e4702 crush map has features 288531987042664448, adjusting msgr requires
2017-04-05 07:18:19.978696 7f0fb55b5700 0 log_channel(cluster) log [INF] : osd.258 10.139.4.84:6815/27074 boot --->>>>>>>>>>>>>>>>>>>>>>>>>>
2017-04-05 07:18:19.983349 7f0fb55b5700 0 log_channel(cluster) log [INF] : osdmap e4702: 335 osds: 335 up, 335 in
2017-04-05 07:18:20.079996 7f0fb55b5700 0 log_channel(cluster) log [INF] : pgmap v1134423: 8192 pgs: 1 inconsistent+peering, 15 peering, 23 acti
ve+recovery_wait+degraded, 3 active+recovering+degraded, 8150 active+clean; 31619 GB data, 43163 GB used, 1176 TB / 1218 TB avail; 6888 kB/s rd,
72604 kB/s wr, 654 op/s; 489/87717555 objects degraded (0.001%)
2017-04-05 07:18:20.135081 7f0fb55b5700 0 log_channel(cluster) log [INF] : pgmap v1134424: 8192 pgs: 19 stale+active+clean, 1 inconsistent+peeri
ng, 15 peering, 23 active+recovery_wait+degraded, 3 active+recovering+degraded, 8131 active+clean; 31619 GB data, 43163 GB used, 1176 TB / 1218 T
B avail; 3452 kB/s rd, 97092 kB/s wr, 834 op/s; 489/87717565 objects degraded (0.001%)
2017-04-05 07:18:20.303398 7f0fb55b5700 1 mon.cn1@0(leader).osd e4703 e4703: 335 osds: 335 up, 335 in
-5299> 2017-04-05 07:18:20.151610 7f7c9ed5e700 1 osd.258 4702 state: booting -> active
-5298> 2017-04-05 07:18:20.151622 7f7c9ed5e700 0 osd.258 4702 crush map has features 288531987042664448, adjusting msgr requires for osds
-5297> 2017-04-05 07:18:20.151759 7f7c92d46700 5 osd.258 pg_epoch: 4701 pg[1.894s2( v 4699'7995 lc 4260'7863 (2052'4800,4699'7995] local-les=4699 n=2058 ec=1713 les/c/f 4699/4295/0 4698/4698/4088) [288,191,258,27,125] r=2 lpr=4699 pi=1713-4697/359 crt=4687'7993 lcod 0'0 inactive NOTIFY NIBBLEWISE m=25] exit Started/Stray 2.766060 3 0.000228
-5296> 2017-04-05 07:18:20.151764 7f7c94549700 5 osd.258 pg_epoch: 4701 pg[1.636s2( v 4683'8066 (2052'4800,4683'8066] local-les=4699 n=2039 ec=1713 les/c/f 4699/4699/0 4698/4698/4088) [292,186,258,28,106] r=2 lpr=4699 pi=1713-4697/354 crt=4683'8065 lcod 0'0 inactive NOTIFY NIBBLEWISE] exit Started/Stray 2.765980 3 0.000292
#ceph pg repair 1.5c4
While we ran the pg repair the primary OSD associated with the above PG got crashed. Then it tries to boot again.
# ceph pg stat
v1139236: 8192 pgs: 1 inconsistent+peering, 5
remapped+peering, 2 stale+active+clean, 16 peering, 8168 active+clean;
31701 GB data, 43302 GB used, 1176 TB / 1218 TB avail; 44501 kB/s rd, 108 MB/s
wr, 1193 op/s”””
Initially deep-scrubbing was disabled in cluster and now we have enabled it.
We also ran command 'ceph osd deep-scrub 258' but the osd keeps on booting and we could not see any scrubbing initialized.
Removing OSD.258 from the cluster and reformat it again and re-inject to cluster is the only fix for this issue? Is there any way to get rid of this issue?
Awaiting for comments
Thanks
Jayaram
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com