Re: Scrub Errors

Oliver Dzombic <info@xxxxxxxxxxxxxxxxx> · Wed, 4 May 2016 23:37:58 +0200

Hi Blade,

you can try to set the min_size to 1, to get it back online, and if/when
the error vanish ( maybe after another repair command ) you can set the
min_size again to 2.

you can try to simply out/down/?remove? the osd where it is on.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:info@xxxxxxxxxxxxxxxxx

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

Am 04.05.2016 um 22:46 schrieb Blade Doyle:
> 
> When I issue the "ceph pg repair 1.32" command I *do* see it reported in
> the "ceph -w" output but I *do not* see any new messages about page 1.32
> in the log of osd.6 - even if I turn debug messages way up. 
> 
> # ceph pg repair 1.32
> instructing pg 1.32 on osd.6 to repair
> 
> (ceph -w shows)
> 2016-05-04 11:19:50.528355 mon.0 [INF] from='client.?
> 192.168.2.224:0/1341169978 <http://192.168.2.224:0/1341169978>'
> entity='client.admin' cmd=[{"prefix": "pg repair", "pgid": "1.32"}]:
> dispatch
> 
> ---
> 
> Yes, I also noticed that there is only one copy of that pg.  I have no
> idea how it happened, but my pools (all of them) got set to replication
> size=1.  I re-set them back to the intended values as soon as I noticed
> it.  Currently the pools are configured like this:
> 
> # ceph osd pool ls detail
> pool 0 'rbd' replicated size 2 min_size 2 crush_ruleset 0 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 349499 flags hashpspool
> stripe_width 0
>         removed_snaps [1~d]
> pool 1 'cephfs_data' replicated size 2 min_size 2 crush_ruleset 0
> object_hash rjenkins pg_num 300 pgp_num 300 last_change 349490 lfor
> 25902 flags hashpspool crash_replay_interval 45 tiers 4 read_tier 4
> write_tier 4 stripe_width 0
> pool 2 'cephfs_metadata' replicated size 2 min_size 2 crush_ruleset 0
> object_hash rjenkins pg_num 300 pgp_num 300 last_change 349503 flags
> hashpspool stripe_width 0
> pool 4 'ssd_cache' replicated size 2 min_size 1 crush_ruleset 0
> object_hash rjenkins pg_num 256 pgp_num 256 last_change 349490 flags
> hashpspool,incomplete_clones tier_of 1 cache_mode writeback target_bytes
> 126701535232 target_objects 1000000 hit_set
> bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 3600s
> x2 min_read_recency_for_promote 1 stripe_width 0
> 
> # ceph osd tree
> ID  WEIGHT  TYPE NAME       UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -12 0.39999 root ssd_cache
>  -4 0.20000     host node11
>   8 0.20000         osd.8        up  1.00000          1.00000
> -11 0.20000     host node13
>   0 0.20000         osd.0        up  1.00000          1.00000
>  -1 2.99997 root default
>  -7 0.29999     host node6
>   7 0.29999         osd.7        up  0.72400          1.00000
>  -8 0.23000     host node5
>   5 0.23000         osd.5        up  0.67996          1.00000
>  -6 0.45999     host node12
>   9 0.45999         osd.9        up  0.72157          1.00000
> -10 0.67000     host node14
>  10 0.67000         osd.10       up  0.70659          1.00000
> -13 0.67000     host node22
>   6 0.67000         osd.6        up  0.69070          1.00000
> -15 0.67000     host node21
>  11 0.67000         osd.11       up  0.69788          1.00000
> 
> --
> 
> For the most part data in my ceph cluster is not critical.  Also, I have
> a recent backup.  At this point I would be happy to resolve the pg
> problems "any way possible" in order to get it working again.  Can I
> just delete the problematic pg (or the versions of it that are broken)?
> 
> I tried some commands to "accept the missing objects as lost" but it
> tells me:
> 
> # ceph pg 1.32 mark_unfound_lost delete
> pg has no unfound objects
> 
> The osd log for that is:
> 2016-05-04 11:31:03.742453 9b088350  0 osd.6 350327 do_command r=0
> 2016-05-04 11:31:03.763017 9b088350  0 osd.6 350327 do_command r=0 pg
> has no unfound objects
> 2016-05-04 11:31:03.763066 9b088350  0 log_channel(cluster) log [INF] :
> pg has no unfound objects
> 
> 
> I also tried to "force create" the page:
> # ceph pg force_create_pg 1.32
> pg 1.32 now creating, ok
> 
> In that case, I do see a dispatch:
> 2016-05-04 11:32:42.073625 mon.4 [INF] from='client.?
> 192.168.2.224:0/208882728 <http://192.168.2.224:0/208882728>'
> entity='client.admin' cmd=[{"prefix": "pg force_create_pg", "pgid":
> "1.32"}]: dispatch
> 2016-05-04 11:32:42.075024 mon.0 [INF] from='client.17514719 :/0'
> entity='client.admin' cmd=[{"prefix": "pg force_create_pg", "pgid":
> "1.32"}]: dispatch
> 2016-05-04 11:32:42.183389 mon.0 [INF] from='client.17514719 :/0'
> entity='client.admin' cmd='[{"prefix": "pg force_create_pg", "pgid":
> "1.32"}]': finished
> 
> That puts the page in a new state for a while:
> # ceph health detail | grep 1.32
> pg 1.32 is stuck inactive since forever, current state creating, last
> acting []
> pg 1.32 is stuck unclean since forever, current state creating, last
> acting []
> 
> But after a few minutes it returns to the previous state:
> 
> # ceph health detail | grep 1.32
> pg 1.32 is stuck inactive for 160741.831891, current state
> undersized+degraded+peered, last acting [6]
> pg 1.32 is stuck unclean for 1093042.263678, current state
> undersized+degraded+peered, last acting [6]
> pg 1.32 is stuck undersized for 57229.481051, current state
> undersized+degraded+peered, last acting [6]
> pg 1.32 is stuck degraded for 57229.481382, current state
> undersized+degraded+peered, last acting [6]
> pg 1.32 is undersized+degraded+peered, acting [6]
> 
> Blade.
> 
> 
> On Tue, May 3, 2016 at 10:45 AM, Oliver Dzombic <info@xxxxxxxxxxxxxxxxx
> <mailto:info@xxxxxxxxxxxxxxxxx>> wrote:
> 
>     Hi Blade,
> 
>     if you dont see anything in the logs, then you should raise the debug
>     level/frequency.
> 
>     You must at least see, that the repair command has been issued  (
>     started ).
> 
>     Also i am wondering about the [6] from your output.
> 
>     That means, that there is only 1 copy of it ( on osd.6 ).
> 
>     What is your setting for the minimal required copies ?
> 
>     osd_pool_default_min_size = ??
> 
>     And whats the setting for the to create copies ?
> 
>     osd_pool_default_size = ???
> 
>     Please give us the output of
> 
>     ceph osd pool ls detail
> 
> 
> 
> 
>     --
>     Mit freundlichen Gruessen / Best regards
> 
>     Oliver Dzombic
>     IP-Interactive
> 
>     mailto:info@xxxxxxxxxxxxxxxxx <mailto:info@xxxxxxxxxxxxxxxxx>
> 
>     Anschrift:
> 
>     IP Interactive UG ( haftungsbeschraenkt )
>     Zum Sonnenberg 1-3
>     63571 Gelnhausen
> 
>     HRB 93402 beim Amtsgericht Hanau
>     Geschäftsführung: Oliver Dzombic
> 
>     Steuer Nr.: 35 236 3622 1 <tel:35%20236%203622%201>
>     UST ID: DE274086107
> 
> 
>     Am 03.05.2016 um 19:11 schrieb Blade Doyle:
>     > Hi Oliver,
>     >
>     > Thanks for your reply.
>     >
>     > The problem could have been caused by crashing/flapping OSD's. The
>     > cluster is stable now, but lots of pg problems remain.
>     >
>     > $ ceph health
>     > HEALTH_ERR 4 pgs degraded; 158 pgs inconsistent; 4 pgs stuck
>     degraded; 1
>     > pgs stuck inactive; 10 pgs stuck unclean; 4 pgs stuck undersized;
>     4 pgs
>     > undersized; recovery 1489/523934 objects degraded (0.284%); recovery
>     > 2620/523934 objects misplaced (0.500%); 158 scrub errors
>     >
>     > Example: for pg 1.32 :
>     >
>     > $ ceph health detail | grep "pg 1.32"
>     > pg 1.32 is stuck inactive for 13260.118985, current state
>     > undersized+degraded+peered, last acting [6]
>     > pg 1.32 is stuck unclean for 945560.550800, current state
>     > undersized+degraded+peered, last acting [6]
>     > pg 1.32 is stuck undersized for 12855.304944, current state
>     > undersized+degraded+peered, last acting [6]
>     > pg 1.32 is stuck degraded for 12855.305305, current state
>     > undersized+degraded+peered, last acting [6]
>     > pg 1.32 is undersized+degraded+peered, acting [6]
>     >
>     > I tried various things like:
>     >
>     > $ ceph pg repair 1.32
>     > instructing pg 1.32 on osd.6 to repair
>     >
>     > $ ceph pg deep-scrub 1.32
>     > instructing pg 1.32 on osd.6 to deep-scrub
>     >
>     > Its odd that I never do see any log on osd.6 about scrubbing or
>     > repairing that pg (after waiting many hours).  I attached "ceph pg
>     > query" and a grep of osd logs for that page.  If there is a better way
>     > to provide large logs please let me know.
>     >
>     > For reference the last mention of that pg in the logs is:
>     >
>     > 2016-04-30 09:24:44.703785 975b9350 20 osd.6 349418  kicking pg 1.32
>     > 2016-04-30 09:24:44.703880 975b9350 30 osd.6 pg_epoch: 349418
>     pg[1.32( v
>     > 338815'7745 (20981'4727,338815'7745] local-les=349347 n=435 ec=17
>     les/c
>     > 349347/349347 349418/349418/349418) [] r=-1 lpr=349418
>     > pi=349346-349417/1 crt=338815'7743 lcod 0'0 inactive NOTIFY] lock
>     >
>     >
>     > Suggestions appreciated,
>     > Blade.
>     >
>     >
>     >
>     >
>     > On Sat, Apr 30, 2016 at 9:31 AM, Blade Doyle
>     <blade.doyle@xxxxxxxxx <mailto:blade.doyle@xxxxxxxxx>
>     > <mailto:blade.doyle@xxxxxxxxx <mailto:blade.doyle@xxxxxxxxx>>> wrote:
>     >
>     >     Hi Ceph-Users,
>     >
>     >     Help with how to resolve these would be appreciated.
>     >
>     >     2016-04-30 09:25:58.399634 9b809350  0 log_channel(cluster) log
>     >     [INF] : 4.97 deep-scrub starts
>     >     2016-04-30 09:26:00.041962 93009350  0 --
>     192.168.2.52:6800/6640 <http://192.168.2.52:6800/6640>
>     >     <http://192.168.2.52:6800/6640> >> 192.168.2.32:0/3983425916
>     <http://192.168.2.32:0/3983425916>
>     >     <http://192.168.2.32:0/3983425916> pipe(0x27406000 sd=111
>     :6800 s=0
>     >     pgs=0 cs=0 l=0 c=0x272da0a0).accept peer addr is really
>     >     192.168.2.32:0/3983425916 <http://192.168.2.32:0/3983425916>
>     <http://192.168.2.32:0/3983425916> (socket
>     >     is 192.168.2.32:38514/0 <http://192.168.2.32:38514/0>
>     <http://192.168.2.32:38514/0>)
>     >     2016-04-30 09:26:15.415883 9b809350 -1 log_channel(cluster) log
>     >     [ERR] : 4.97 deep-scrub stat mismatch, got 284/282 objects, 0/0
>     >     clones, 145/145 dirty, 0/0 omap, 4/2 hit_set_archive, 137/137
>     >     whiteouts, 365855441/365855441 bytes,340/340 hit_set_archive
>     bytes.
>     >     2016-04-30 09:26:15.415953 9b809350 -1 log_channel(cluster) log
>     >     [ERR] : 4.97 deep-scrub 1 errors
>     >     2016-04-30 09:26:15.416425 9b809350  0 log_channel(cluster) log
>     >     [INF] : 4.97 scrub starts
>     >     2016-04-30 09:26:15.682311 9b809350 -1 log_channel(cluster) log
>     >     [ERR] : 4.97 scrub stat mismatch, got 284/282 objects, 0/0 clones,
>     >     145/145 dirty, 0/0 omap, 4/2 hit_set_archive, 137/137 whiteouts,
>     >     365855441/365855441 bytes,340/340 hit_set_archive bytes.
>     >     2016-04-30 09:26:15.682392 9b809350 -1 log_channel(cluster) log
>     >     [ERR] : 4.97 scrub 1 errors
>     >
>     >     Thanks Much,
>     >     Blade.
>     >
>     >
>     >
>     >
>     > _______________________________________________
>     > ceph-users mailing list
>     > ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     >
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com