Re: New firefly tiny cluster stuck unclean

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

Finally this was fixed this way:
# ceph osd pool set rbd size 1
(wait some seconds for HEALTH_OK)
# ceph osd pool set rbd size 2
(wait almost an hour for HEALTH_OK after backfilling)

I wanted to avoid this but didn't want to leave the cluster in bad state all night :)

I really think there's some kind of bug that sometimes prevents ceph to backfill correctly; this is quite similar to another problem I reported on december (that time it was originally a size=3 then changed to size=2 not cleaning correctly).

This time default pools were deleted and a new "rbd" pool was created with size=2. This was done before adding the OSDs of one of the nodes.

Thanks
Eneko

On 20/01/15 16:23, Eneko Lacunza wrote:
Hi all,

I've just created a new ceph cluster for RBD with latest firefly:
- 3 monitors
- 2 OSD nodes, each has 1 s3700 (journals) + 2 x 3TB WD red (osd)

Network is 1gbit, different physical interfaces for public and private network. There's only one pool "rbd", size=2. There are just 5 rbd devices created.

Somehow I reached the following status:
    cluster 8f839a95-d5e3-4a31-981e-497f9a0e4991
health HEALTH_WARN 16 pgs stuck unclean; recovery 2986/47638 objects degraded (6.268%) monmap e3: 3 mons at {0=172.16.1.3:6789/0,1=172.16.1.1:6789/0,2=172.16.1.2:6789/0}, election epoch 10, quorum 0,1,2 1,2,0
     osdmap e38: 4 osds: 4 up, 4 in
      pgmap v4347: 128 pgs, 1 pools, 95232 MB data, 23819 objects
            186 GB used, 10985 GB / 11171 GB avail
            2986/47638 objects degraded (6.268%)
                  16 active
                 112 active+clean
  client io 43854 B/s wr, 10 op/s

I don't see the problem for 16 pgs stuck unclean. ¿Can somebody suggest any hint?

# cat /etc/pve/ceph.conf
[global]
     auth client required = cephx
     auth cluster required = cephx
     auth service required = cephx
     auth supported = cephx
     cluster network = 172.16.2.0/24
     filestore xattr use omap = true
     fsid = 8f839a95-d5e3-4a31-981e-497f9a0e4991
     keyring = /etc/pve/priv/$cluster.$name.keyring
     osd journal size = 5120
     osd pool default min size = 1
     public network = 172.16.1.0/24

[osd]
     keyring = /var/lib/ceph/osd/ceph-$id/keyring
     osd max backfills = 1
     osd recovery max active = 1

[mon.0]
     host = proxmox3
     mon addr = 172.16.1.3:6789

[mon.1]
     host = proxmox1
     mon addr = 172.16.1.1:6789

[mon.2]
     host = proxmox2
     mon addr = 172.16.1.2:6789


# ceph pg dump_stuck
ok
pg_stat objects mip degr unf bytes log disklog state state_stamp v reported up up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 3.8 155 0 155 0 650117120 359 359 active 2015-01-20 12:44:19.545685 38'359 38:1593 [1,3] 1 [1,3] 1 0'0 2015-01-20 12:44:15.677078 0'0 2015-01-20 12:44:15.677078 3.22 217 0 217 0 910163968 987 987 active 2015-01-20 12:44:19.539596 38'987 38:1312 [3,1] 3 [3,1] 3 0'0 2015-01-20 12:44:15.676128 0'0 2015-01-20 12:44:15.676128 3.1e 179 0 179 0 750780416 3001 3001 active 2015-01-20 12:44:19.539570 38'5410 38:5961 [3,0] 3 [3,0] 3 0'0 2015-01-20 12:44:15.675939 0'0 2015-01-20 12:44:15.675939 3.62 182 0 182 0 763363328 588 588 active 2015-01-20 12:44:19.539713 38'588 38:932 [3,1] 3 [3,1] 3 0'0 2015-01-20 12:44:15.680806 0'0 2015-01-20 12:44:15.680806 3.63 170 0 170 0 713031680 340 340 active 2015-01-20 12:44:19.540329 38'340 38:512 [3,0] 3 [3,0] 3 0'0 2015-01-20 12:44:15.681099 0'0 2015-01-20 12:44:15.681099 3.18 190 0 190 0 796917760 589 589 active 2015-01-20 12:44:19.539550 38'589 38:852 [3,0] 3 [3,0] 3 0'0 2015-01-20 12:44:15.675345 0'0 2015-01-20 12:44:15.675345 3.1b 200 0 200 0 838860800 734 734 active 2015-01-20 12:44:19.539514 38'734 38:1882 [3,0] 3 [3,0] 3 0'0 2015-01-20 12:44:15.675738 0'0 2015-01-20 12:44:15.675738 3.14 185 0 185 0 775946240 393 393 active 2015-01-20 12:44:19.539492 38'393 38:965 [3,0] 3 [3,0] 3 0'0 2015-01-20 12:44:15.675138 0'0 2015-01-20 12:44:15.675138 3.10 187 0 187 0 780140560 606 606 active 2015-01-20 12:44:19.545741 38'606 38:925 [1,3] 1 [1,3] 1 0'0 2015-01-20 12:44:15.678035 0'0 2015-01-20 12:44:15.678035 3.11 186 0 186 0 780140544 301 301 active 2015-01-20 12:44:20.838550 38'301 38:686 [0,2] 0 [0,2] 0 0'0 2015-01-20 12:44:15.676908 0'0 2015-01-20 12:44:15.676908 3.12 187 0 187 0 784334848 601 601 active 2015-01-20 12:44:19.499264 38'601 38:1228 [2,0] 2 [2,0] 2 0'0 2015-01-20 12:44:15.675128 0'0 2015-01-20 12:44:15.675128 3.2b 218 0 218 0 914358272 536 536 active 2015-01-20 12:44:20.582636 38'536 38:1027 [0,2] 0 [0,2] 0 0'0 2015-01-20 12:44:15.677528 0'0 2015-01-20 12:44:15.677528 3.13 187 0 187 0 784334848 1217 1217 active 2015-01-20 12:44:19.545722 38'1217 38:1459 [1,3] 1 [1,3] 1 0'0 2015-01-20 12:44:15.678256 0'0 2015-01-20 12:44:15.678256 3.d 177 0 177 0 742391808 257 257 active 2015-01-20 12:44:19.545712 38'257 38:399 [1,3] 1 [1,3] 1 0'0 2015-01-20 12:44:15.677267 0'0 2015-01-20 12:44:15.677267 3.26 182 0 182 0 763363328 684 684 active 2015-01-20 12:44:20.582621 38'684 38:1118 [0,3] 0 [0,3] 0 0'0 2015-01-20 12:44:15.677425 0'0 2015-01-20 12:44:15.677425 3.e 184 0 184 0 771751936 567 567 active 2015-01-20 12:44:19.499224 38'567 38:814 [2,0] 2 [2,0] 2 0'0 2015-01-20 12:44:15.674915 0'0 2015-01-20 12:44:15.674915

# ceph pg 3.e query
{ "state": "active",
  "snap_trimq": "[]",
  "epoch": 38,
  "up": [
        2,
        0],
  "acting": [
        2,
        0],
  "actingbackfill": [
        "0",
        "2"],
  "info": { "pgid": "3.e",
      "last_update": "38'568",
      "last_complete": "38'568",
      "log_tail": "0'0",
      "last_user_version": 568,
      "last_backfill": "MAX",
      "purged_snaps": "[]",
      "history": { "epoch_created": 35,
          "last_epoch_started": 36,
          "last_epoch_clean": 36,
          "last_epoch_split": 0,
          "same_up_since": 35,
          "same_interval_since": 35,
          "same_primary_since": 35,
          "last_scrub": "0'0",
          "last_scrub_stamp": "2015-01-20 12:44:15.674915",
          "last_deep_scrub": "0'0",
          "last_deep_scrub_stamp": "2015-01-20 12:44:15.674915",
          "last_clean_scrub_stamp": "0.000000"},
      "stats": { "version": "38'568",
          "reported_seq": "815",
          "reported_epoch": "38",
          "state": "active",
          "last_fresh": "2015-01-20 16:00:39.580463",
          "last_change": "2015-01-20 12:44:19.499224",
          "last_active": "2015-01-20 16:00:39.580463",
          "last_clean": "0.000000",
          "last_became_active": "0.000000",
          "last_unstale": "2015-01-20 16:00:39.580463",
          "mapping_epoch": 35,
          "log_start": "0'0",
          "ondisk_log_start": "0'0",
          "created": 35,
          "last_epoch_clean": 36,
          "parent": "0.0",
          "parent_split_bits": 0,
          "last_scrub": "0'0",
          "last_scrub_stamp": "2015-01-20 12:44:15.674915",
          "last_deep_scrub": "0'0",
          "last_deep_scrub_stamp": "2015-01-20 12:44:15.674915",
          "last_clean_scrub_stamp": "0.000000",
          "log_size": 568,
          "ondisk_log_size": 568,
          "stats_invalid": "0",
          "stat_sum": { "num_bytes": 771751936,
              "num_objects": 184,
              "num_object_clones": 0,
              "num_object_copies": 368,
              "num_objects_missing_on_primary": 0,
              "num_objects_degraded": 184,
              "num_objects_unfound": 0,
              "num_objects_dirty": 184,
              "num_whiteouts": 0,
              "num_read": 217,
              "num_read_kb": 6892,
              "num_write": 1136,
              "num_write_kb": 759572,
              "num_scrub_errors": 0,
              "num_shallow_scrub_errors": 0,
              "num_deep_scrub_errors": 0,
              "num_objects_recovered": 0,
              "num_bytes_recovered": 0,
              "num_keys_recovered": 0,
              "num_objects_omap": 0,
              "num_objects_hit_set_archive": 0},
          "stat_cat_sum": {},
          "up": [
                2,
                0],
          "acting": [
                2,
                0],
          "up_primary": 2,
          "acting_primary": 2},
      "empty": 0,
      "dne": 0,
      "incomplete": 0,
      "last_epoch_started": 36,
      "hit_set_history": { "current_last_update": "0'0",
          "current_last_stamp": "0.000000",
          "current_info": { "begin": "0.000000",
              "end": "0.000000",
              "version": "0'0"},
          "history": []}},
  "peer_info": [
        { "peer": "0",
          "pgid": "3.e",
          "last_update": "38'568",
          "last_complete": "38'568",
          "log_tail": "0'0",
          "last_user_version": 0,
          "last_backfill": "MAX",
          "purged_snaps": "[]",
          "history": { "epoch_created": 35,
              "last_epoch_started": 36,
              "last_epoch_clean": 36,
              "last_epoch_split": 0,
              "same_up_since": 0,
              "same_interval_since": 0,
              "same_primary_since": 0,
              "last_scrub": "0'0",
              "last_scrub_stamp": "2015-01-20 12:44:15.674915",
              "last_deep_scrub": "0'0",
              "last_deep_scrub_stamp": "2015-01-20 12:44:15.674915",
              "last_clean_scrub_stamp": "0.000000"},
          "stats": { "version": "0'0",
              "reported_seq": "0",
              "reported_epoch": "0",
              "state": "inactive",
              "last_fresh": "0.000000",
              "last_change": "0.000000",
              "last_active": "0.000000",
              "last_clean": "0.000000",
              "last_became_active": "0.000000",
              "last_unstale": "0.000000",
              "mapping_epoch": 0,
              "log_start": "0'0",
              "ondisk_log_start": "0'0",
              "created": 0,
              "last_epoch_clean": 0,
              "parent": "0.0",
              "parent_split_bits": 0,
              "last_scrub": "0'0",
              "last_scrub_stamp": "0.000000",
              "last_deep_scrub": "0'0",
              "last_deep_scrub_stamp": "0.000000",
              "last_clean_scrub_stamp": "0.000000",
              "log_size": 0,
              "ondisk_log_size": 0,
              "stats_invalid": "0",
              "stat_sum": { "num_bytes": 0,
                  "num_objects": 0,
                  "num_object_clones": 0,
                  "num_object_copies": 0,
                  "num_objects_missing_on_primary": 0,
                  "num_objects_degraded": 0,
                  "num_objects_unfound": 0,
                  "num_objects_dirty": 0,
                  "num_whiteouts": 0,
                  "num_read": 0,
                  "num_read_kb": 0,
                  "num_write": 0,
                  "num_write_kb": 0,
                  "num_scrub_errors": 0,
                  "num_shallow_scrub_errors": 0,
                  "num_deep_scrub_errors": 0,
                  "num_objects_recovered": 0,
                  "num_bytes_recovered": 0,
                  "num_keys_recovered": 0,
                  "num_objects_omap": 0,
                  "num_objects_hit_set_archive": 0},
              "stat_cat_sum": {},
              "up": [],
              "acting": [],
              "up_primary": -1,
              "acting_primary": -1},
          "empty": 0,
          "dne": 0,
          "incomplete": 0,
          "last_epoch_started": 36,
          "hit_set_history": { "current_last_update": "0'0",
              "current_last_stamp": "0.000000",
              "current_info": { "begin": "0.000000",
                  "end": "0.000000",
                  "version": "0'0"},
              "history": []}}],
  "recovery_state": [
        { "name": "Started\/Primary\/Active",
          "enter_time": "2015-01-20 12:44:17.543538",
          "might_have_unfound": [],
          "recovery_progress": { "backfill_targets": [],
              "waiting_on_backfill": [],
              "last_backfill_started": "0\/\/0\/\/-1",
              "backfill_info": { "begin": "0\/\/0\/\/-1",
                  "end": "0\/\/0\/\/-1",
                  "objects": []},
              "peer_backfill_info": [],
              "backfills_in_flight": [],
              "recovering": [],
              "pg_backend": { "pull_from_peer": [],
                  "pushing": []}},
          "scrub": { "scrubber.epoch_start": "0",
              "scrubber.active": 0,
              "scrubber.block_writes": 0,
              "scrubber.finalizing": 0,
              "scrubber.waiting_on": 0,
              "scrubber.waiting_on_whom": []}},
        { "name": "Started",
          "enter_time": "2015-01-20 12:44:15.675011"}],
  "agent_state": {}}

I have tried the following with no luck:
- ceph pg repair 3.e
- ceph pg scrub 3.e


Thanks a lot
Eneko



--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943575997
      943493611
Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux