Inpu/output error mounting

Daniel Davidson <danield@xxxxxxxxxxxxxxxx> · Fri, 23 Jun 2017 12:15:44 -0500

Two of our OSD systems hit 75% disk utilization, so I added another 
system to try and bring that back down.  The system was usable for a day 
while the data was being migrated, but now the system is not responding 
when I try to mount it:

 mount -t ceph ceph-0,ceph-1,ceph-2,ceph-3:6789:/ /home -o 
name=admin,secretfile=/etc/ceph/admin.secret
mount error 5 = Input/output error

Here is our ceph health

[root@ceph-3 ~]# ceph -s
    cluster 7bffce86-9d7b-4bdf-a9c9-67670e68ca77
     health HEALTH_ERR
            2 pgs are stuck inactive for more than 300 seconds
            58 pgs backfill_wait
            20 pgs backfilling
            3 pgs degraded
            2 pgs stuck inactive
            76 pgs stuck unclean
            2 pgs undersized
            100 requests are blocked > 32 sec
            recovery 1197145/653713908 objects degraded (0.183%)
            recovery 47420551/653713908 objects misplaced (7.254%)
            mds0: Behind on trimming (180/30)
            mds0: Client biologin-0 failing to respond to capability 
release
            mds0: Many clients (20) failing to respond to cache pressure
     monmap e3: 4 mons at 
{ceph-0=172.16.31.1:6789/0,ceph-1=172.16.31.2:6789/0,ceph-2=172.16.31.3:6789/0,ceph-3=172.16.31.4:6789/0}
            election epoch 542, quorum 0,1,2,3 ceph-0,ceph-1,ceph-2,ceph-3
      fsmap e17666: 1/1/1 up {0=ceph-0=up:active}, 3 up:standby
     osdmap e25535: 32 osds: 32 up, 32 in; 78 remapped pgs
            flags sortbitwise,require_jewel_osds
      pgmap v19199544: 1536 pgs, 2 pools, 786 TB data, 299 Mobjects
            1595 TB used, 1024 TB / 2619 TB avail
            1197145/653713908 objects degraded (0.183%)
            47420551/653713908 objects misplaced (7.254%)
                1448 active+clean
                  58 active+remapped+wait_backfill
                  17 active+remapped+backfilling
                  10 active+clean+scrubbing+deep
                   2 undersized+degraded+remapped+backfilling+peered
                   1 active+degraded+remapped+backfilling
recovery io 906 MB/s, 331 objects/s

Checking in on the inactive PGs

[root@ceph-control ~]# ceph health detail |grep inactive
HEALTH_ERR 2 pgs are stuck inactive for more than 300 seconds; 58 pgs 
backfill_wait; 20 pgs backfilling; 3 pgs degraded; 2 pgs stuck inactive; 
78 pgs stuck unclean; 2 pgs undersized; 100 requests are blocked > 32 
sec; 1 osds have slow requests; recovery 1197145/653713908 objects 
degraded (0.183%); recovery 47390082/653713908 objects misplaced 
(7.249%); mds0: Behind on trimming (180/30); mds0: Client biologin-0 
failing to respond to capability release; mds0: Many clients (20) 
failing to respond to cache pressure
pg 2.1b5 is stuck inactive for 77215.112164, current state 
undersized+degraded+remapped+backfilling+peered, last acting [13]
pg 2.145 is stuck inactive for 76910.328647, current state 
undersized+degraded+remapped+backfilling+peered, last acting [13]

If I query, then I dont get a response:

[root@ceph-control ~]# ceph pg 2.1b5 query

Any ideas on what to do?

Dan

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com