Re: Slow Process on backfilling and recovering

Christian Balzer <chibi@xxxxxxx> · Mon, 22 Feb 2016 15:24:13 +0900

Hello,

On Mon, 22 Feb 2016 07:06:21 +0800 Vlad Blando wrote:

> Hi Guys,
> 
> After I adjusted PGs on my volume pool, these processes been running for
> 2 days now. how do I speed things up?
> 
> ---
> [root@controller-node ~]# ceph -s
>     cluster 99cb7f6f-3441-4a94-bd4b-828183ecc393
>      health HEALTH_ERR 330 pgs backfill; 41 pgs backfill_toofull; 2 pgs
> backfilling; 3 pgs inconsistent; 1 pgs recovering; 71 pgs recovery_wait;
> 405 pgs stuck unclean; recovery 4835319/22635543 objects degraded
> (21.362%); 11 near full osd(s); 3 scrub errors
>      monmap e2: 3 mons at {ceph-node-1=
> 10.107.200.1:6789/0,ceph-node-2=10.107.200.2:6789/0,ceph-node-3=10.107.200.3:6789/0},
> election epoch 496, quorum 0,1,2 ceph-node-1,ceph-node-2,ceph-node-3
>      osdmap e11356: 27 osds: 27 up, 27 in
>       pgmap v42430325: 1536 pgs, 2 pools, 27592 GB data, 5783 kobjects
>             83068 GB used, 17484 GB / 100553 GB avail
>             4835319/22635543 objects degraded (21.362%)
>                    1 active+recovering+remapped
>                   36 active+recovery_wait
>                    1 active+remapped+backfill_toofull
>                 1128 active+clean
>                    2 active+clean+inconsistent
>                  289 active+remapped+wait_backfill
>                    1 active+remapped+inconsistent+wait_backfill
>                    1 active+clean+scrubbing+deep
>                   35 active+recovery_wait+remapped
>                   40 active+remapped+wait_backfill+backfill_toofull
>                    2 active+remapped+backfilling
>   client io 2555 kB/s rd, 73369 B/s wr, 177 op/s
> [root@controller-node ~]#
> ---
> 
IIRC, you were asking about a nearly full cluster earlier. 
Pretty much the best, safest things to do before doing anything else is to
get out of that state, buy either adding more OSDs or deleting objects.

That said, googling "backfill_toofull" gives several helpful links, one of
the 2 top ones stating the obvious, as in make space (or increase the full
ratios if possible/sensible). 

The second mentions a bug (in Firefly, though), where restarting OSDs
clearing this up as things remained stuck even after the OSD was below the
threshold.

Either way, those 41 backfill_toofull PGs won't be going anywhere until
something is done.

If you do a "watch ceph -s", do you see at least some recovery activity at
this point?

The 3 scrub errors wouldn't fill me with confidence either.

Lastly, for the duration of the backfill, I'd turn off scrubs to improve
the speed (and performance impact) of recovery. 
But of course recovery must first be possible (enough space) to boot.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com