Urgent: Reduced data availability / All pgs inactive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

hope someone can help me. After restarting a node of my 2-node-cluster suddenly I get this:

root@yak2 /var/www/projects # ceph -s
  cluster:
    id:     749b2473-9300-4535-97a6-ee6d55008a1b
    health: HEALTH_WARN
            Reduced data availability: 200 pgs inactive

  services:
    mon: 3 daemons, quorum yak1,yak2,yak0
    mgr: yak0.planwerk6.de(active), standbys: yak1.planwerk6.de, yak2.planwerk6.de
    mds: cephfs-1/1/1 up  {0=yak1.planwerk6.de=up:active}, 1 up:standby
    osd: 2 osds: 2 up, 2 in

  data:
    pools:   2 pools, 200 pgs
    objects: 0  objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs:     100.000% pgs unknown
             200 unknown

And this:


root@yak2 /var/www/projects # ceph health detail
HEALTH_WARN Reduced data availability: 200 pgs inactive
PG_AVAILABILITY Reduced data availability: 200 pgs inactive
    pg 1.34 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.35 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.36 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.37 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.38 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.39 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.3a is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.3b is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.3c is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.3d is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.3e is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.3f is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.40 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.41 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.42 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.43 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.44 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.45 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.46 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.47 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.48 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.49 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.4a is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.4b is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.4c is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 1.4d is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.34 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.35 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.36 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.38 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.39 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.3a is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.3b is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.3c is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.3d is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.3e is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.3f is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.40 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.41 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.42 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.43 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.44 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.45 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.46 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.47 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.48 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.49 is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.4a is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.4b is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.4e is stuck inactive for 3506.815664, current state unknown, last acting []
    pg 2.4f is stuck inactive for 3506.815664, current state unknown, last acting []

But if I query an individual PG I get this:

root@yak1 /var/www/projects # ceph pg 1.49 query
{
    "state": "active+clean",
    "snap_trimq": "[]",
    "snap_trimq_len": 0,
    "epoch": 162,
    "up": [
        0,
        1
    ],
    "acting": [
        0,
        1
    ],
    "acting_recovery_backfill": [
        "0",
        "1"
    ],
    "info": {
        "pgid": "1.49",
        "last_update": "127'38077",
        "last_complete": "127'38077",
        "log_tail": "127'35000",
        "last_user_version": 38077,
        "last_backfill": "MAX",
        "last_backfill_bitwise": 0,
        "purged_snaps": [],
        "history": {
            "epoch_created": 10,
            "epoch_pool_created": 10,
            "last_epoch_started": 159,
            "last_interval_started": 158,
            "last_epoch_clean": 159,
            "last_interval_clean": 158,
            "last_epoch_split": 0,
            "last_epoch_marked_full": 0,
            "same_up_since": 158,
            "same_interval_since": 158,
            "same_primary_since": 135,
            "last_scrub": "127'36909",
            "last_scrub_stamp": "2019-02-20 15:02:45.204342",
            "last_deep_scrub": "127'36714",
            "last_deep_scrub_stamp": "2019-02-16 07:55:15.205861",
            "last_clean_scrub_stamp": "2019-02-20 15:02:45.204342"
        },
        "stats": {
            "version": "127'38077",
            "reported_seq": "58934",
            "reported_epoch": "162",
            "state": "active+clean",
            "last_fresh": "2019-02-20 19:56:56.740536",
            "last_change": "2019-02-20 19:52:27.063812",
            "last_active": "2019-02-20 19:56:56.740536",
            "last_peered": "2019-02-20 19:56:56.740536",
            "last_clean": "2019-02-20 19:56:56.740536",
            "last_became_active": "2019-02-20 19:52:27.062689",
            "last_became_peered": "2019-02-20 19:52:27.062689",
            "last_unstale": "2019-02-20 19:56:56.740536",
            "last_undegraded": "2019-02-20 19:56:56.740536",
            "last_fullsized": "2019-02-20 19:56:56.740536",
            "mapping_epoch": 158,
            "log_start": "127'35000",
            "ondisk_log_start": "127'35000",
            "created": 10,
            "last_epoch_clean": 159,
            "parent": "0.0",
            "parent_split_bits": 0,
            "last_scrub": "127'36909",
            "last_scrub_stamp": "2019-02-20 15:02:45.204342",
            "last_deep_scrub": "127'36714",
            "last_deep_scrub_stamp": "2019-02-16 07:55:15.205861",
            "last_clean_scrub_stamp": "2019-02-20 15:02:45.204342",
            "log_size": 3077,
            "ondisk_log_size": 3077,
            "stats_invalid": false,
            "dirty_stats_invalid": false,
            "omap_stats_invalid": false,
            "hitset_stats_invalid": false,
            "hitset_bytes_stats_invalid": false,
            "pin_stats_invalid": false,
            "manifest_stats_invalid": true,
            "snaptrimq_len": 0,
            "stat_sum": {
                "num_bytes": 478347970,
                "num_objects": 12052,
                "num_object_clones": 0,
                "num_object_copies": 24104,
                "num_objects_missing_on_primary": 0,
                "num_objects_missing": 0,
                "num_objects_degraded": 0,
                "num_objects_misplaced": 0,
                "num_objects_unfound": 0,
                "num_objects_dirty": 12052,
                "num_whiteouts": 0,
                "num_read": 20186,
                "num_read_kb": 1952018,
                "num_write": 38927,
                "num_write_kb": 484756,
                "num_scrub_errors": 0,
                "num_shallow_scrub_errors": 0,
                "num_deep_scrub_errors": 0,
                "num_objects_recovered": 6,
                "num_bytes_recovered": 4101,
                "num_keys_recovered": 0,
                "num_objects_omap": 0,
                "num_objects_hit_set_archive": 0,
                "num_bytes_hit_set_archive": 0,
                "num_flush": 0,
                "num_flush_kb": 0,
                "num_evict": 0,
                "num_evict_kb": 0,
                "num_promote": 0,
                "num_flush_mode_high": 0,
                "num_flush_mode_low": 0,
                "num_evict_mode_some": 0,
                "num_evict_mode_full": 0,
                "num_objects_pinned": 0,
                "num_legacy_snapsets": 0,
                "num_large_omap_objects": 0,
                "num_objects_manifest": 0
            },
            "up": [
                0,
                1
            ],
            "acting": [
                0,
                1
            ],
            "blocked_by": [],
            "up_primary": 0,
            "acting_primary": 0,
            "purged_snaps": []
        },
        "empty": 0,
        "dne": 0,
        "incomplete": 0,
        "last_epoch_started": 159,
        "hit_set_history": {
            "current_last_update": "0'0",
            "history": []
        }
    },
    "peer_info": [
        {
            "peer": "1",
            "pgid": "1.49",
            "last_update": "127'38077",
            "last_complete": "127'38077",
            "log_tail": "127'35000",
            "last_user_version": 38077,
            "last_backfill": "MAX",
            "last_backfill_bitwise": 0,
            "purged_snaps": [],
            "history": {
                "epoch_created": 10,
                "epoch_pool_created": 10,
                "last_epoch_started": 159,
                "last_interval_started": 158,
                "last_epoch_clean": 159,
                "last_interval_clean": 158,
                "last_epoch_split": 0,
                "last_epoch_marked_full": 0,
                "same_up_since": 158,
                "same_interval_since": 158,
                "same_primary_since": 135,
                "last_scrub": "127'36909",
                "last_scrub_stamp": "2019-02-20 15:02:45.204342",
                "last_deep_scrub": "127'36714",
                "last_deep_scrub_stamp": "2019-02-16 07:55:15.205861",
                "last_clean_scrub_stamp": "2019-02-20 15:02:45.204342"
            },
            "stats": {
                "version": "127'38077",
                "reported_seq": "58745",
                "reported_epoch": "134",
                "state": "active+undersized+degraded",
                "last_fresh": "2019-02-20 19:06:19.180016",
                "last_change": "2019-02-20 19:04:39.483332",
                "last_active": "2019-02-20 19:06:19.180016",
                "last_peered": "2019-02-20 19:06:19.180016",
                "last_clean": "2019-02-20 18:23:33.675145",
                "last_became_active": "2019-02-20 19:04:39.483332",
                "last_became_peered": "2019-02-20 19:04:39.483332",
                "last_unstale": "2019-02-20 19:06:19.180016",
                "last_undegraded": "2019-02-20 19:04:39.477829",
                "last_fullsized": "2019-02-20 19:04:39.477717",
                "mapping_epoch": 158,
                "log_start": "127'35000",
                "ondisk_log_start": "127'35000",
                "created": 10,
                "last_epoch_clean": 124,
                "parent": "0.0",
                "parent_split_bits": 0,
                "last_scrub": "127'36909",
                "last_scrub_stamp": "2019-02-20 15:02:45.204342",
                "last_deep_scrub": "127'36714",
                "last_deep_scrub_stamp": "2019-02-16 07:55:15.205861",
                "last_clean_scrub_stamp": "2019-02-20 15:02:45.204342",
                "log_size": 3077,
                "ondisk_log_size": 3077,
                "stats_invalid": false,
                "dirty_stats_invalid": false,
                "omap_stats_invalid": false,
                "hitset_stats_invalid": false,
                "hitset_bytes_stats_invalid": false,
                "pin_stats_invalid": false,
                "manifest_stats_invalid": true,
                "snaptrimq_len": 0,
                "stat_sum": {
                    "num_bytes": 478347970,
                    "num_objects": 12052,
                    "num_object_clones": 0,
                    "num_object_copies": 24104,
                    "num_objects_missing_on_primary": 0,
                    "num_objects_missing": 0,
                    "num_objects_degraded": 12052,
                    "num_objects_misplaced": 0,
                    "num_objects_unfound": 0,
                    "num_objects_dirty": 12052,
                    "num_whiteouts": 0,
                    "num_read": 20186,
                    "num_read_kb": 1952018,
                    "num_write": 38927,
                    "num_write_kb": 484756,
                    "num_scrub_errors": 0,
                    "num_shallow_scrub_errors": 0,
                    "num_deep_scrub_errors": 0,
                    "num_objects_recovered": 6,
                    "num_bytes_recovered": 4101,
                    "num_keys_recovered": 0,
                    "num_objects_omap": 0,
                    "num_objects_hit_set_archive": 0,
                    "num_bytes_hit_set_archive": 0,
                    "num_flush": 0,
                    "num_flush_kb": 0,
                    "num_evict": 0,
                    "num_evict_kb": 0,
                    "num_promote": 0,
                    "num_flush_mode_high": 0,
                    "num_flush_mode_low": 0,
                    "num_evict_mode_some": 0,
                    "num_evict_mode_full": 0,
                    "num_objects_pinned": 0,
                    "num_legacy_snapsets": 0,
                    "num_large_omap_objects": 0,
                    "num_objects_manifest": 0
                },
                "up": [
                    0,
                    1
                ],
                "acting": [
                    0,
                    1
                ],
                "blocked_by": [],
                "up_primary": 0,
                "acting_primary": 0,
                "purged_snaps": []
            },
            "empty": 0,
            "dne": 0,
            "incomplete": 0,
            "last_epoch_started": 159,
            "hit_set_history": {
                "current_last_update": "0'0",
                "history": []
            }
        }
    ],
    "recovery_state": [
        {
            "name": "Started/Primary/Active",
            "enter_time": "2019-02-20 19:52:27.027151",
            "might_have_unfound": [],
            "recovery_progress": {
                "backfill_targets": [],
                "waiting_on_backfill": [],
                "last_backfill_started": "MIN",
                "backfill_info": {
                    "begin": "MIN",
                    "end": "MIN",
                    "objects": []
                },
                "peer_backfill_info": [],
                "backfills_in_flight": [],
                "recovering": [],
                "pg_backend": {
                    "pull_from_peer": [],
                    "pushing": []
                }
            },
            "scrub": {
                "scrubber.epoch_start": "0",
                "scrubber.active": false,
                "scrubber.state": "INACTIVE",
                "scrubber.start": "MIN",
                "scrubber.end": "MIN",
                "scrubber.max_end": "MIN",
                "scrubber.subset_last_update": "0'0",
                "scrubber.deep": false,
                "scrubber.waiting_on_whom": []
            }
        },
        {
            "name": "Started",
            "enter_time": "2019-02-20 19:52:25.976144"
        }
    ],
    "agent_state": {}
}

I wonder what it all means and how to get out of this situation. The cluster seems to work normally. But it's quite disconcerting as you can probably imagine. Could it be a firewall issue? I'm not aware of any changes and I don't see any peering problems...

Thank you

Ranjan







_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux