Re: Urgent: Reduced data availability / All pgs inactive

Irek Fasikhov <malmyzh@xxxxxxxxx> · Thu, 21 Feb 2019 09:05:18 +0300

Hi,
You have problems with MRG. 
http://docs.ceph.com/docs/master/rados/operations/pg-states/
The ceph-mgr hasn’t yet received any information about the PG’s state from an OSD since mgr started up.

чт, 21 февр. 2019 г. в 09:04, Irek Fasikhov <malmyzh@xxxxxxxxx>:
Hi,
You have problems with MRG. 
http://docs.ceph.com/docs/master/rados/operations/pg-states/
The ceph-mgr hasn’t yet received any information about the PG’s state from an OSD since mgr started up.

ср, 20 февр. 2019 г. в 23:10, Ranjan Ghosh <ghosh@xxxxxx>:

    Hi all,

      hope someone can help me. After restarting a node of my
      2-node-cluster suddenly I get this:

      root@yak2 /var/www/projects # ceph -s

        cluster:

          id:     749b2473-9300-4535-97a6-ee6d55008a1b

          health: HEALTH_WARN

                  Reduced data availability: 200 pgs inactive

        services:

          mon: 3 daemons, quorum yak1,yak2,yak0

          mgr: yak0.planwerk6.de(active), standbys: yak1.planwerk6.de,
      yak2.planwerk6.de

          mds: cephfs-1/1/1 up  {0=yak1.planwerk6.de=up:active}, 1
      up:standby

          osd: 2 osds: 2 up, 2 in

        data:

          pools:   2 pools, 200 pgs

          objects: 0  objects, 0 B

          usage:   0 B used, 0 B / 0 B avail

          pgs:     100.000% pgs unknown

                   200 unknown

      And this:

      root@yak2 /var/www/projects # ceph health detail

      HEALTH_WARN Reduced data availability: 200 pgs inactive

      PG_AVAILABILITY Reduced data availability: 200 pgs inactive

          pg 1.34 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.35 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.36 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.37 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.38 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.39 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.3a is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.3b is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.3c is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.3d is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.3e is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.3f is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.40 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.41 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.42 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.43 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.44 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.45 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.46 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.47 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.48 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.49 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.4a is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.4b is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.4c is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 1.4d is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.34 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.35 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.36 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.38 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.39 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.3a is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.3b is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.3c is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.3d is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.3e is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.3f is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.40 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.41 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.42 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.43 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.44 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.45 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.46 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.47 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.48 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.49 is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.4a is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.4b is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.4e is stuck inactive for 3506.815664, current state
      unknown, last acting []

          pg 2.4f is stuck inactive for 3506.815664, current state
      unknown, last acting []

      But if I query an individual PG I get this:

      root@yak1 /var/www/projects # ceph pg 1.49 query

      {

          "state": "active+clean",

          "snap_trimq": "[]",

          "snap_trimq_len": 0,

          "epoch": 162,

          "up": [

              0,

              1

          ],

          "acting": [

              0,

              1

          ],

          "acting_recovery_backfill": [

              "0",

              "1"

          ],

          "info": {

              "pgid": "1.49",

              "last_update": "127'38077",

              "last_complete": "127'38077",

              "log_tail": "127'35000",

              "last_user_version": 38077,

              "last_backfill": "MAX",

              "last_backfill_bitwise": 0,

              "purged_snaps": [],

              "history": {

                  "epoch_created": 10,

                  "epoch_pool_created": 10,

                  "last_epoch_started": 159,

                  "last_interval_started": 158,

                  "last_epoch_clean": 159,

                  "last_interval_clean": 158,

                  "last_epoch_split": 0,

                  "last_epoch_marked_full": 0,

                  "same_up_since": 158,

                  "same_interval_since": 158,

                  "same_primary_since": 135,

                  "last_scrub": "127'36909",

                  "last_scrub_stamp": "2019-02-20 15:02:45.204342",

                  "last_deep_scrub": "127'36714",

                  "last_deep_scrub_stamp": "2019-02-16 07:55:15.205861",

                  "last_clean_scrub_stamp": "2019-02-20 15:02:45.204342"

              },

              "stats": {

                  "version": "127'38077",

                  "reported_seq": "58934",

                  "reported_epoch": "162",

                  "state": "active+clean",

                  "last_fresh": "2019-02-20 19:56:56.740536",

                  "last_change": "2019-02-20 19:52:27.063812",

                  "last_active": "2019-02-20 19:56:56.740536",

                  "last_peered": "2019-02-20 19:56:56.740536",

                  "last_clean": "2019-02-20 19:56:56.740536",

                  "last_became_active": "2019-02-20 19:52:27.062689",

                  "last_became_peered": "2019-02-20 19:52:27.062689",

                  "last_unstale": "2019-02-20 19:56:56.740536",

                  "last_undegraded": "2019-02-20 19:56:56.740536",

                  "last_fullsized": "2019-02-20 19:56:56.740536",

                  "mapping_epoch": 158,

                  "log_start": "127'35000",

                  "ondisk_log_start": "127'35000",

                  "created": 10,

                  "last_epoch_clean": 159,

                  "parent": "0.0",

                  "parent_split_bits": 0,

                  "last_scrub": "127'36909",

                  "last_scrub_stamp": "2019-02-20 15:02:45.204342",

                  "last_deep_scrub": "127'36714",

                  "last_deep_scrub_stamp": "2019-02-16 07:55:15.205861",

                  "last_clean_scrub_stamp": "2019-02-20
      15:02:45.204342",

                  "log_size": 3077,

                  "ondisk_log_size": 3077,

                  "stats_invalid": false,

                  "dirty_stats_invalid": false,

                  "omap_stats_invalid": false,

                  "hitset_stats_invalid": false,

                  "hitset_bytes_stats_invalid": false,

                  "pin_stats_invalid": false,

                  "manifest_stats_invalid": true,

                  "snaptrimq_len": 0,

                  "stat_sum": {

                      "num_bytes": 478347970,

                      "num_objects": 12052,

                      "num_object_clones": 0,

                      "num_object_copies": 24104,

                      "num_objects_missing_on_primary": 0,

                      "num_objects_missing": 0,

                      "num_objects_degraded": 0,

                      "num_objects_misplaced": 0,

                      "num_objects_unfound": 0,

                      "num_objects_dirty": 12052,

                      "num_whiteouts": 0,

                      "num_read": 20186,

                      "num_read_kb": 1952018,

                      "num_write": 38927,

                      "num_write_kb": 484756,

                      "num_scrub_errors": 0,

                      "num_shallow_scrub_errors": 0,

                      "num_deep_scrub_errors": 0,

                      "num_objects_recovered": 6,

                      "num_bytes_recovered": 4101,

                      "num_keys_recovered": 0,

                      "num_objects_omap": 0,

                      "num_objects_hit_set_archive": 0,

                      "num_bytes_hit_set_archive": 0,

                      "num_flush": 0,

                      "num_flush_kb": 0,

                      "num_evict": 0,

                      "num_evict_kb": 0,

                      "num_promote": 0,

                      "num_flush_mode_high": 0,

                      "num_flush_mode_low": 0,

                      "num_evict_mode_some": 0,

                      "num_evict_mode_full": 0,

                      "num_objects_pinned": 0,

                      "num_legacy_snapsets": 0,

                      "num_large_omap_objects": 0,

                      "num_objects_manifest": 0

                  },

                  "up": [

                      0,

                      1

                  ],

                  "acting": [

                      0,

                      1

                  ],

                  "blocked_by": [],

                  "up_primary": 0,

                  "acting_primary": 0,

                  "purged_snaps": []

              },

              "empty": 0,

              "dne": 0,

              "incomplete": 0,

              "last_epoch_started": 159,

              "hit_set_history": {

                  "current_last_update": "0'0",

                  "history": []

              }

          },

          "peer_info": [

              {

                  "peer": "1",

                  "pgid": "1.49",

                  "last_update": "127'38077",

                  "last_complete": "127'38077",

                  "log_tail": "127'35000",

                  "last_user_version": 38077,

                  "last_backfill": "MAX",

                  "last_backfill_bitwise": 0,

                  "purged_snaps": [],

                  "history": {

                      "epoch_created": 10,

                      "epoch_pool_created": 10,

                      "last_epoch_started": 159,

                      "last_interval_started": 158,

                      "last_epoch_clean": 159,

                      "last_interval_clean": 158,

                      "last_epoch_split": 0,

                      "last_epoch_marked_full": 0,

                      "same_up_since": 158,

                      "same_interval_since": 158,

                      "same_primary_since": 135,

                      "last_scrub": "127'36909",

                      "last_scrub_stamp": "2019-02-20 15:02:45.204342",

                      "last_deep_scrub": "127'36714",

                      "last_deep_scrub_stamp": "2019-02-16
      07:55:15.205861",

                      "last_clean_scrub_stamp": "2019-02-20
      15:02:45.204342"

                  },

                  "stats": {

                      "version": "127'38077",

                      "reported_seq": "58745",

                      "reported_epoch": "134",

                      "state": "active+undersized+degraded",

                      "last_fresh": "2019-02-20 19:06:19.180016",

                      "last_change": "2019-02-20 19:04:39.483332",

                      "last_active": "2019-02-20 19:06:19.180016",

                      "last_peered": "2019-02-20 19:06:19.180016",

                      "last_clean": "2019-02-20 18:23:33.675145",

                      "last_became_active": "2019-02-20
      19:04:39.483332",

                      "last_became_peered": "2019-02-20
      19:04:39.483332",

                      "last_unstale": "2019-02-20 19:06:19.180016",

                      "last_undegraded": "2019-02-20 19:04:39.477829",

                      "last_fullsized": "2019-02-20 19:04:39.477717",

                      "mapping_epoch": 158,

                      "log_start": "127'35000",

                      "ondisk_log_start": "127'35000",

                      "created": 10,

                      "last_epoch_clean": 124,

                      "parent": "0.0",

                      "parent_split_bits": 0,

                      "last_scrub": "127'36909",

                      "last_scrub_stamp": "2019-02-20 15:02:45.204342",

                      "last_deep_scrub": "127'36714",

                      "last_deep_scrub_stamp": "2019-02-16
      07:55:15.205861",

                      "last_clean_scrub_stamp": "2019-02-20
      15:02:45.204342",

                      "log_size": 3077,

                      "ondisk_log_size": 3077,

                      "stats_invalid": false,

                      "dirty_stats_invalid": false,

                      "omap_stats_invalid": false,

                      "hitset_stats_invalid": false,

                      "hitset_bytes_stats_invalid": false,

                      "pin_stats_invalid": false,

                      "manifest_stats_invalid": true,

                      "snaptrimq_len": 0,

                      "stat_sum": {

                          "num_bytes": 478347970,

                          "num_objects": 12052,

                          "num_object_clones": 0,

                          "num_object_copies": 24104,

                          "num_objects_missing_on_primary": 0,

                          "num_objects_missing": 0,

                          "num_objects_degraded": 12052,

                          "num_objects_misplaced": 0,

                          "num_objects_unfound": 0,

                          "num_objects_dirty": 12052,

                          "num_whiteouts": 0,

                          "num_read": 20186,

                          "num_read_kb": 1952018,

                          "num_write": 38927,

                          "num_write_kb": 484756,

                          "num_scrub_errors": 0,

                          "num_shallow_scrub_errors": 0,

                          "num_deep_scrub_errors": 0,

                          "num_objects_recovered": 6,

                          "num_bytes_recovered": 4101,

                          "num_keys_recovered": 0,

                          "num_objects_omap": 0,

                          "num_objects_hit_set_archive": 0,

                          "num_bytes_hit_set_archive": 0,

                          "num_flush": 0,

                          "num_flush_kb": 0,

                          "num_evict": 0,

                          "num_evict_kb": 0,

                          "num_promote": 0,

                          "num_flush_mode_high": 0,

                          "num_flush_mode_low": 0,

                          "num_evict_mode_some": 0,

                          "num_evict_mode_full": 0,

                          "num_objects_pinned": 0,

                          "num_legacy_snapsets": 0,

                          "num_large_omap_objects": 0,

                          "num_objects_manifest": 0

                      },

                      "up": [

                          0,

                          1

                      ],

                      "acting": [

                          0,

                          1

                      ],

                      "blocked_by": [],

                      "up_primary": 0,

                      "acting_primary": 0,

                      "purged_snaps": []

                  },

                  "empty": 0,

                  "dne": 0,

                  "incomplete": 0,

                  "last_epoch_started": 159,

                  "hit_set_history": {

                      "current_last_update": "0'0",

                      "history": []

                  }

              }

          ],

          "recovery_state": [

              {

                  "name": "Started/Primary/Active",

                  "enter_time": "2019-02-20 19:52:27.027151",

                  "might_have_unfound": [],

                  "recovery_progress": {

                      "backfill_targets": [],

                      "waiting_on_backfill": [],

                      "last_backfill_started": "MIN",

                      "backfill_info": {

                          "begin": "MIN",

                          "end": "MIN",

                          "objects": []

                      },

                      "peer_backfill_info": [],

                      "backfills_in_flight": [],

                      "recovering": [],

                      "pg_backend": {

                          "pull_from_peer": [],

                          "pushing": []

                      }

                  },

                  "scrub": {

                      "scrubber.epoch_start": "0",

                      "scrubber.active": false,

                      "scrubber.state": "INACTIVE",

                      "scrubber.start": "MIN",

                      "scrubber.end": "MIN",

                      "scrubber.max_end": "MIN",

                      "scrubber.subset_last_update": "0'0",

                      "scrubber.deep": false,

                      "scrubber.waiting_on_whom": []

                  }

              },

              {

                  "name": "Started",

                  "enter_time": "2019-02-20 19:52:25.976144"

              }

          ],

          "agent_state": {}

      }

      I wonder what it all means and how to get out of this situation.
      The cluster seems to work normally. But it's quite disconcerting
      as you can probably imagine. Could it be a firewall issue? I'm not
      aware of any changes and I don't see any peering problems...

      Thank you

      Ranjan

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com