Re: Inpu/output error mounting

David Turner <drakonstein@xxxxxxxxx> · Sat, 24 Jun 2017 04:28:54 +0000

Your min_size=2 is why the cluster is blocking and you can't mount cephfs.  Those 2 PGs, while the cluster is performing the backfilling, are currently only on 1 OSD (osd.13).  That is not enough OSDs to satisfy the min_size, so any requests for data on those PGs will block and wait until a second OSD is up for the PGs.
That's what's wrong, now what do you do to fix this?  If you changed your pool for these PGs to min_size=1, you will stop blocking IO and will be able to mount cephfs.  I would recommend setting nodown while you are in this state so that OSDs won't be marked down or flap, but will instead they will block any IO that should be going to them until they are back up.  This should avoid the problems of running with min_size=1 which is running the chance of having inconsistent data.

As soon as you no longer have any PGs that are undersized, you can switch back to min_size=2.  But honestly, while backfilling, it is fairly likely to have PGs go into this state and you will likely need to monitor things during this and future backfillings while you are running with size=2 and min_size=2.

That's the cost of only running with 2 copies of your data.  The alternative for running with 2 copies of your data is to run with min_size=1 at all times so your data will be accessible if a copy of your data is down.  That runs the inherent risk of having inconsistent data... but maybe 100% data integrity isn't a need for the project and a higher chance of some data loss is ok.

I personally would go with Erasure Coding over choosing size=2.  Slower performance, but it saves more space and is more fault tolerant.

On Fri, Jun 23, 2017 at 5:08 PM Daniel Davidson <danield@xxxxxxxxxxxxxxxx> wrote:

    We are using replica 2 and min size is
      2.  A small amount of data is sitting around from when we were
      running the default 3.

      Looks like the problem started around here:

      2017-06-22 14:54:29.173982 7f3c39f6f700  0 log_channel(cluster)
      log [INF] : 1.2c9 deep-scrub ok

      2017-06-22 14:54:29.690401 7f3c6e03d700 -1 osd.13 25313
      heartbeat_check: no reply from osd.8 since back 2017-06-22
      14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff
      2017-06-22 14:54:09.690398)

      2017-06-22 14:54:29.690423 7f3c6e03d700 -1 osd.13 25313
      heartbeat_check: no reply from osd.10 since back 2017-06-22
      14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff
      2017-06-22 14:54:09.690398)

      2017-06-22 14:54:29.690429 7f3c6e03d700 -1 osd.13 25313
      heartbeat_check: no reply from osd.11 since back 2017-06-22
      14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff
      2017-06-22 14:54:09.690398)

      2017-06-22 14:54:29.907210 7f3c3776a700 -1 osd.13 25313
      heartbeat_check: no reply from osd.8 since back 2017-06-22
      14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff
      2017-06-22 14:54:09.907207)

      2017-06-22 14:54:29.907221 7f3c3776a700 -1 osd.13 25313
      heartbeat_check: no reply from osd.10 since back 2017-06-22
      14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff
      2017-06-22 14:54:09.907207)

      2017-06-22 14:54:29.907227 7f3c3776a700 -1 osd.13 25313
      heartbeat_check: no reply from osd.11 since back 2017-06-22
      14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff
      2017-06-22 14:54:09.907207)

      2017-06-22 14:54:30.690551 7f3c6e03d700 -1 osd.13 25313
      heartbeat_check: no reply from osd.8 since back 2017-06-22
      14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff
      2017-06-22 14:54:10.690548)

      2017-06-22 14:54:30.690573 7f3c6e03d700 -1 osd.13 25313
      heartbeat_check: no reply from osd.10 since back 2017-06-22
      14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff
      2017-06-22 14:54:10.690548)

      2017-06-22 14:54:30.690579 7f3c6e03d700 -1 osd.13 25313
      heartbeat_check: no reply from osd.11 since back 2017-06-22
      14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff
      2017-06-22 14:54:10.690548)

      2017-06-22 14:54:31.690708 7f3c6e03d700 -1 osd.13 25313
      heartbeat_check: no reply from osd.8 since back 2017-06-22
      14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff
      2017-06-22 14:54:11.690706)

      2017-06-22 14:54:31.690729 7f3c6e03d700 -1 osd.13 25313
      heartbeat_check: no reply from osd.10 since back 2017-06-22
      14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff
      2017-06-22 14:54:11.690706)

      2017-06-22 14:54:31.690735 7f3c6e03d700 -1 osd.13 25313
      heartbeat_check: no reply from osd.11 since back 2017-06-22
      14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff
      2017-06-22 14:54:11.690706)

      2017-06-22 14:54:32.690862 7f3c6e03d700 -1 osd.13 25313
      heartbeat_check: no reply from osd.8 since back 2017-06-22
      14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff
      2017-06-22 14:54:12.690860)

      2017-06-22 14:54:32.690884 7f3c6e03d700 -1 osd.13 25313
      heartbeat_check: no reply from osd.10 since back 2017-06-22
      14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff
      2017-06-22 14:54:12.690860)

      2017-06-22 14:54:32.690890 7f3c6e03d700 -1 osd.13 25313
      heartbeat_check: no reply from osd.11 since back 2017-06-22
      14:53:13.582897 front 2017-06-22 14:53:13.582897 (cutoff
      2017-06-22 14:54:12.690860)

      2017-06-22 14:54:32.955768 7f3c5675c700  0 --
      172.16.31.7:6805/7128624 >> 172.16.31.3:6804/54002870
      pipe(0x7f3ca7475400 sd=116 :6805 s=2 pgs=15531 cs=1 l=0
      c=0x7f3c935ee700).fault with nothing to send, going to standby

      2017-06-22 14:54:32.958675 7f3c2ea0e700  0 --
      172.16.31.7:0/2128624 >> 172.16.31.3:6808/54002870
      pipe(0x7f3c9c150000 sd=189 :0 s=1 pgs=0 cs=0 l=1
      c=0x7f3c97726880).fault

      2017-06-22 14:54:32.958712 7f3c2c3e8700  0 --
      172.16.31.7:0/2128624 >> 172.16.31.3:6810/54002870
      pipe(0x7f3ca1727400 sd=233 :0 s=1 pgs=0 cs=0 l=1
      c=0x7f3c9cb16300).fault

      2017-06-22 14:54:34.176427 7f3c33a5e700  0 --
      172.16.31.7:6805/7128624 >> 172.16.31.3:6800/55002870
      pipe(0x7f3c99679400 sd=216 :6805 s=0 pgs=0 cs=0 l=0
      c=0x7f3c9532d200).accept connect_seq 0 vs existing 0 state
      connecting

      2017-06-22 14:54:34.545873 7f3c3ef79700  0 log_channel(cluster)
      log [INF] : 2.1b5 continuing backfill to osd.30 from
      (25014'10407450,25294'10411861] MIN to 25294'10411861

      2017-06-22 14:54:34.546531 7f3c3e778700  0 log_channel(cluster)
      log [INF] : 2.145 continuing backfill to osd.30 from
      (25014'10399385,25294'10404028] MIN to 25294'10404028

      2017-06-22 14:54:34.546551 7f3c43782700  0 log_channel(cluster)
      log [INF] : 1.2b3 continuing backfill to osd.30 from
      (24856'173854,25294'177823] MIN to 25294'177823

      2017-06-22 14:54:57.873097 7f3c27763700  0 --
      172.16.31.7:6805/7128624 >> 172.16.31.4:6803/57002857
      pipe(0x7f3c95e0b400 sd=188 :6805 s=0 pgs=0 cs=0 l=0
      c=0x7f3c9fc71f80).accept we reset (peer sent cseq 1), sending
      RESETSESSION

      2017-06-22 14:54:57.874965 7f3c27763700  0 --
      172.16.31.7:6805/7128624 >> 172.16.31.4:6803/57002857
      pipe(0x7f3c95e0b400 sd=188 :6805 s=2 pgs=15769 cs=1 l=0
      c=0x7f3c9fc71f80).reader missed message?  skipped from seq 0 to
      1739054688

      2017-06-22 14:54:57.875902 7f3c27763700  0 --
      172.16.31.7:6805/7128624 >> 172.16.31.4:6803/57002857
      pipe(0x7f3c9af11400 sd=188 :6805 s=0 pgs=0 cs=0 l=0
      c=0x7f3c9fc72e80).accept we reset (peer sent cseq 2), sending
      RESETSESSION

      2017-06-22 14:54:57.878969 7f3c27763700  0 --
      172.16.31.7:6805/7128624 >> 172.16.31.4:6803/57002857
      pipe(0x7f3c9af11400 sd=188 :6805 s=2 pgs=15771 cs=1 l=0
      c=0x7f3c9fc72e80).reader missed message?  skipped from seq 0 to
      2095419103

      2017-06-22 14:54:57.880075 7f3c27763700  0 --
      172.16.31.7:6805/7128624 >> 172.16.31.4:6803/57002857
      pipe(0x7f3c9af10000 sd=188 :6805 s=0 pgs=0 cs=0 l=0
      c=0x7f3c87f3b480).accept we reset (peer sent cseq 2), sending
      RESETSESSION

      2017-06-22 14:54:57.880781 7f3c27763700  0 --
      172.16.31.7:6805/7128624 >> 172.16.31.4:6803/57002857
      pipe(0x7f3c9af10000 sd=188 :6805 s=2 pgs=15772 cs=1 l=0
      c=0x7f3c87f3b480).reader missed message?  skipped from seq 0 to
      1022945821

      2017-06-22 14:54:57.881842 7f3c27763700  0 --
      172.16.31.7:6805/7128624 >> 172.16.31.4:6803/57002857
      pipe(0x7f3c99679400 sd=188 :6805 s=0 pgs=0 cs=0 l=0
      c=0x7f3c91431a80).accept we reset (peer sent cseq 2), sending
      RESETSESSION

      2017-06-22 14:54:57.902933 7f3c27763700  0 --
      172.16.31.7:6805/7128624 >> 172.16.31.4:6803/57002857
      pipe(0x7f3c99679400 sd=188 :6805 s=2 pgs=15773 cs=1 l=0
      c=0x7f3c91431a80).fault with nothing to send, going to standby

      2017-06-22 14:56:52.538631 7f3c6e03d700  0 log_channel(cluster)
      log [WRN] : 2 slow requests, 2 included below; oldest blocked for
      > 31.862172 secs

      2017-06-22 14:56:52.538641 7f3c6e03d700  0 log_channel(cluster)
      log [WRN] : slow request 31.665672 seconds old, received at
      2017-06-22 14:56:20.872915: osd_op(client.365488.1:6212453
      2.debad545 10009cc83a6.00000666 [read 0~4194304 [1@-1]] snapc 0=[]
      ack+read+known_if_redirected e25348) currently waiting for active

      2017-06-22 14:56:52.538646 7f3c6e03d700  0 log_channel(cluster)
      log [WRN] : slow request 31.862172 seconds old, received at
      2017-06-22 14:56:20.676415: osd_op(client.365488.1:6212450
      2.f781b45 10009cc83a6.00000664 [read 0~4194304 [1@-1]] snapc 0=[]
      ack+read+known_if_redirected e25348) currently waiting for active

      2017-06-22 14:57:18.140672 7f3c6e03d700  0 log_channel(cluster)
      log [WRN] : 3 slow requests, 1 included below; oldest blocked for
      > 57.464203 secs

      2017-06-22 14:57:18.140683 7f3c6e03d700  0 log_channel(cluster)
      log [WRN] : slow request 30.554865 seconds old, received at
      2017-06-22 14:56:47.585754: osd_op(client.364255.1:1681646
      2.b387afb5 1000a234aea.00000136 [write 0~4194304 [1@-1]] snapc
      1=[] ondisk+write+known_if_redirected e25351) currently waiting
      for active

      On 06/23/2017 03:28 PM, David Turner wrote:

      Something about it is blocking the cluster.  I
        would first try running this command.  If that doesn't work,
        then I would restart the daemon.

        # ceph osd down 13

        Marking it down should force it to reassert itself to the
          cluster without restarting the daemon and stopping any
          operations it's working on.  Also while it's down, the
          secondary OSDs for the PG should be able to handle the
          requests that are blocked.  Check it's log to see what it's
          doing.

        You didn't answer what your size and min_size are for your
          2 pools.

        On Fri, Jun 23, 2017 at 3:11 PM Daniel Davidson
          <danield@xxxxxxxxxxxxxxxx>
          wrote:

            Thanks
              for the response:

              [root@ceph-control ~]# ceph health detail | grep 'ops are
              blocked'

              100 ops are blocked > 134218 sec on osd.13

              [root@ceph-control ~]# ceph osd blocked-by

              osd num_blocked 

              A problem with osd.13?

              Dan

              On 06/23/2017 02:03 PM, David Turner wrote:

              # ceph health detail | grep 'ops are
                blocked'
                # ceph osd blocked-by

                My guess is that you have an OSD that is in a funky
                  state blocking the requests and the peering.  Let me
                  know what the output of those commands are.

                    Also what are the replica sizes of your 2
                      pools?  It shows that only 1 OSD was last active
                      for the 2 inactive PGs.  Not sure yet if that is
                      anything of concern, but didn't want to ignore it.

                On Fri, Jun 23, 2017 at 1:16 PM Daniel
                  Davidson <danield@xxxxxxxxxxxxxxxx>
                  wrote:

                Two
                  of our OSD systems hit 75% disk utilization, so I
                  added another

                  system to try and bring that back down.  The system
                  was usable for a day

                  while the data was being migrated, but now the system
                  is not responding

                  when I try to mount it:

                    mount -t ceph ceph-0,ceph-1,ceph-2,ceph-3:6789:/
                  /home -o

                  name=admin,secretfile=/etc/ceph/admin.secret

                  mount error 5 = Input/output error

                  Here is our ceph health

                  [root@ceph-3 ~]# ceph -s

                       cluster 7bffce86-9d7b-4bdf-a9c9-67670e68ca77

                        health HEALTH_ERR

                               2 pgs are stuck inactive for more than
                  300 seconds

                               58 pgs backfill_wait

                               20 pgs backfilling

                               3 pgs degraded

                               2 pgs stuck inactive

                               76 pgs stuck unclean

                               2 pgs undersized

                               100 requests are blocked > 32 sec

                               recovery 1197145/653713908 objects
                  degraded (0.183%)

                               recovery 47420551/653713908 objects
                  misplaced (7.254%)

                               mds0: Behind on trimming (180/30)

                               mds0: Client biologin-0 failing to
                  respond to capability

                  release

                               mds0: Many clients (20) failing to
                  respond to cache pressure

                        monmap e3: 4 mons at

                  {ceph-0=MailScanner
                        has detected a possible fraud attempt from
                        "172.16.31.1:6789" claiming to be MailScanner warning: numerical
                        links are often malicious:
172.16.31.1:6789/0,ceph-1=172.16.31.2:6789/0,ceph-2=172.16.31.3:6789/0,ceph-3=172.16.31.4:6789/0}

                               election epoch 542, quorum 0,1,2,3
                  ceph-0,ceph-1,ceph-2,ceph-3

                         fsmap e17666: 1/1/1 up {0=ceph-0=up:active}, 3
                  up:standby

                        osdmap e25535: 32 osds: 32 up, 32 in; 78
                  remapped pgs

                               flags sortbitwise,require_jewel_osds

                         pgmap v19199544: 1536 pgs, 2 pools, 786 TB
                  data, 299 Mobjects

                               1595 TB used, 1024 TB / 2619 TB avail

                               1197145/653713908 objects degraded
                  (0.183%)

                               47420551/653713908 objects misplaced
                  (7.254%)

                                   1448 active+clean

                                     58 active+remapped+wait_backfill

                                     17 active+remapped+backfilling

                                     10 active+clean+scrubbing+deep

                                      2
                  undersized+degraded+remapped+backfilling+peered

                                      1
                  active+degraded+remapped+backfilling

                  recovery io 906 MB/s, 331 objects/s

                  Checking in on the inactive PGs

                  [root@ceph-control ~]# ceph health detail |grep
                  inactive

                  HEALTH_ERR 2 pgs are stuck inactive for more than 300
                  seconds; 58 pgs

                  backfill_wait; 20 pgs backfilling; 3 pgs degraded; 2
                  pgs stuck inactive;

                  78 pgs stuck unclean; 2 pgs undersized; 100 requests
                  are blocked > 32

                  sec; 1 osds have slow requests; recovery
                  1197145/653713908 objects

                  degraded (0.183%); recovery 47390082/653713908 objects
                  misplaced

                  (7.249%); mds0: Behind on trimming (180/30); mds0:
                  Client biologin-0

                  failing to respond to capability release; mds0: Many
                  clients (20)

                  failing to respond to cache pressure

                  pg 2.1b5 is stuck inactive for 77215.112164, current
                  state

                  undersized+degraded+remapped+backfilling+peered, last
                  acting [13]

                  pg 2.145 is stuck inactive for 76910.328647, current
                  state

                  undersized+degraded+remapped+backfilling+peered, last
                  acting [13]

                  If I query, then I dont get a response:

                  [root@ceph-control ~]# ceph pg 2.1b5 query

                  Any ideas on what to do?

                  Dan

                  _______________________________________________

                  ceph-users mailing list

                  ceph-users@xxxxxxxxxxxxxx

                  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com