Re: pg remapped+peering forever and MDS trimming behind

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Just before your response, I decided to take the chance of restarting the primary osd for the pg (153).

At this point, the MDS trimming error is gone and I'm in a warning state now. The pg has moved from peering+remapped to active+degraded+remapped+backfilling.

I'd say we're probably nearly back to a normal state.  And, thanks for the hint regarding pool ID.

Version Details:
[root@osd1 brady]# cat /etc/centos-release
CentOS Linux release 7.2.1511 (Core) 

[root@osd1 brady]# uname -a
Linux osd1.ceph.laureateinstitute.org 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu Jun 23 17:05:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

[root@osd1 brady]# rpm -qa|grep ceph
ceph-base-10.2.3-0.el7.x86_64
ceph-10.2.3-0.el7.x86_64
ceph-release-1-1.el7.noarch
python-cephfs-10.2.3-0.el7.x86_64
ceph-selinux-10.2.3-0.el7.x86_64
ceph-osd-10.2.3-0.el7.x86_64
ceph-mds-10.2.3-0.el7.x86_64
ceph-radosgw-10.2.3-0.el7.x86_64
ceph-deploy-1.5.34-0.noarch
libcephfs1-10.2.3-0.el7.x86_64
ceph-common-10.2.3-0.el7.x86_64
ceph-mon-10.2.3-0.el7.x86_64

Thanks!

On Wed, Oct 26, 2016 at 2:02 PM, Wido den Hollander <wido@xxxxxxxx> wrote:

> Op 26 oktober 2016 om 20:44 schreef Brady Deetz <bdeetz@xxxxxxxxx>:
>
>
> Summary:
> This is a production CephFS cluster. I had an OSD node crash. The cluster
> rebalanced successfully. I brought the down node back online. Everything
> has rebalanced except 1 hung pg and MDS trimming is now behind. No hardware
> failures have become apparent yet.
>
> Questions:
> 1) Is there a way to see what pool a placement group belongs to?

The PG's ID always starts with the pool's ID. In your case it's '1'.

# ceph osd dump|grep pool

You will see the pool ID there.

> 2) How should I move forward with unsticking my 1 pg in a constant
> remapped+peering state?
>

Looking at the PG query have you tried to restart the primary OSD of the PG? And trying to restart the others: [153,162,5]

Which version of Ceph are you running?

> Based on the remapped+peering pg not going away and the mds trimming
> getting further and further behind, I'm guessing that the pg belongs to the
> cephfs metadata pool.
>

Probably the case indeed. The MDS is blocked by this single PG.

> Any help you can provide is greatly appreciated.
>
> Details:
> OSD Node Description:
> -2 vlans going over 40gig ethernet for pub/priv nets
> -256 GB RAM
> -2x Xeon 2660v4
> -2x P3700 (journal)
> -24x OSD
> Primary monitor is dedicated similar configuration to OSD
> Primary MDS is dedicated similar configuration to OSD
>
> [brady@mon0 ~]$ ceph health detail
> HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs
> peering; 1 pgs stuck inactive; 47 requests are blocked > 32 sec; 1 osds
> have slow requests; mds0: Behind on trimming (76/30)
> pg 1.efa is stuck inactive for 174870.396769, current state
> remapped+peering, last acting [153,162,5]
> pg 1.efa is remapped+peering, acting [153,162,5]
> 34 ops are blocked > 268435 sec on osd.153
> 13 ops are blocked > 134218 sec on osd.153
> 1 osds have slow requests
> mds0: Behind on trimming (76/30)(max_segments: 30, num_segments: 76)
>
>
> [brady@mon0 ~]$ ceph pg dump_stuck
> ok
> pg_stat state   up      up_primary      acting  acting_primary
> 1.efa   remapped+peering        [153,10,162]    153     [153,162,5]     153
>
> [brady@mon0 ~]$ ceph pg 1.efa query
> http://pastebin.com/Rz0ZRfSb
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux