Re: best way to resolve 'stale+active+clean' after disk failure

Brad Hubbard <bhubbard@xxxxxxxxxx> · Fri, 7 Apr 2017 09:34:34 +1000

What are size and min_size for pool '7'... and why?

On Fri, Apr 7, 2017 at 4:20 AM, David Welch <dwelch@xxxxxxxxxxxx> wrote:
> Hi,
> We had a disk on the cluster that was not responding properly and causing
> 'slow requests'. The osd on the disk was stopped and the osd was marked down
> and then out. Rebalancing succeeded but (some?) pgs from that osd are now
> stuck in stale+active+clean state, which is not being resolved (see below
> for query results).
>
> My question: is it better to mark this osd as "lost" (i.e. 'ceph osd lost
> 14') or to remove the osd as detailed here:
> https://www.sebastien-han.fr/blog/2015/12/11/ceph-properly-remove-an-osd/
>
> Thanks,
> David
>
>
> $ ceph health detail
> HEALTH_ERR 17 pgs are stuck inactive for more than 300 seconds; 17 pgs
> stale; 17 pgs stuck stale
> pg 7.f3 is stuck stale for 6138.330316, current state stale+active+clean,
> last acting [14]
> pg 7.bd is stuck stale for 6138.330365, current state stale+active+clean,
> last acting [14]
> pg 7.b6 is stuck stale for 6138.330374, current state stale+active+clean,
> last acting [14]
> pg 7.c5 is stuck stale for 6138.330363, current state stale+active+clean,
> last acting [14]
> pg 7.ac is stuck stale for 6138.330385, current state stale+active+clean,
> last acting [14]
> pg 7.5b is stuck stale for 6138.330678, current state stale+active+clean,
> last acting [14]
> pg 7.1b4 is stuck stale for 6138.330409, current state stale+active+clean,
> last acting [14]
> pg 7.182 is stuck stale for 6138.330445, current state stale+active+clean,
> last acting [14]
> pg 7.1f8 is stuck stale for 6138.330720, current state stale+active+clean,
> last acting [14]
> pg 7.53 is stuck stale for 6138.330697, current state stale+active+clean,
> last acting [14]
> pg 7.1d2 is stuck stale for 6138.330663, current state stale+active+clean,
> last acting [14]
> pg 7.70 is stuck stale for 6138.330742, current state stale+active+clean,
> last acting [14]
> pg 7.14f is stuck stale for 6138.330585, current state stale+active+clean,
> last acting [14]
> pg 7.23 is stuck stale for 6138.330610, current state stale+active+clean,
> last acting [14]
> pg 7.153 is stuck stale for 6138.330600, current state stale+active+clean,
> last acting [14]
> pg 7.cc is stuck stale for 6138.330409, current state stale+active+clean,
> last acting [14]
> pg 7.16b is stuck stale for 6138.330509, current state stale+active+clean,
> last acting [14]
> $ ceph pg dump_stuck stale
> ok
> pg_stat    state    up    up_primary    acting    acting_primary
> 7.f3    stale+active+clean    [14]    14    [14]    14
> 7.bd    stale+active+clean    [14]    14    [14]    14
> 7.b6    stale+active+clean    [14]    14    [14]    14
> 7.c5    stale+active+clean    [14]    14    [14]    14
> 7.ac    stale+active+clean    [14]    14    [14]    14
> 7.5b    stale+active+clean    [14]    14    [14]    14
> 7.1b4    stale+active+clean    [14]    14    [14]    14
> 7.182    stale+active+clean    [14]    14    [14]    14
> 7.1f8    stale+active+clean    [14]    14    [14]    14
> 7.53    stale+active+clean    [14]    14    [14]    14
> 7.1d2    stale+active+clean    [14]    14    [14]    14
> 7.70    stale+active+clean    [14]    14    [14]    14
> 7.14f    stale+active+clean    [14]    14    [14]    14
> 7.23    stale+active+clean    [14]    14    [14]    14
> 7.153    stale+active+clean    [14]    14    [14]    14
> 7.cc    stale+active+clean    [14]    14    [14]    14
> 7.16b    stale+active+clean    [14]    14    [14]    14
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com