oot@red-compute:~# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 1.00000 root default
-4 1.00000 rack rack-1
-2 1.00000 host blue-compute
0 1.00000 osd.0 down 0 1.00000
2 1.00000 osd.2 down 0 1.00000
-3 1.00000 host red-compute
1 1.00000 osd.1 down 0 1.00000
3 0.50000 osd.3 up 1.00000 1.00000
4 1.00000 osd.4 down 0 1.00000
- Once upgraded I discovered that daemon runs under ceph. I just ran chown on ceph directories. and it worked.
- Firewall is fully disabled. Checked connectivity with nc and nmap.
- Configuration seems to be right. I can post if you want.
- Enabling logging on OSD shows that for example osd.1 is reconnecting all the time.
- 2016-05-10 14:35:48.199573 7f53e8f1a700 1 -- 0.0.0.0:6806/13962 >> :/0 pipe(0x556f99413400 sd=84 :6806 s=0 pgs=0 cs=0 l=0 c=0x556f993b3a80).accept sd=84 172.16.0.119:35388/0
2016-05-10 14:35:48.199966 7f53e8f1a700 2 -- 0.0.0.0:6806/13962 >> :/0 pipe(0x556f99413400 sd=84 :6806 s=4 pgs=0 cs=0 l=0 c=0x556f993b3a80).fault (0) Success
2016-05-10 14:35:48.200018 7f53fb941700 1 osd.1 2468 ms_handle_reset con 0x556f993b3a80 session 0 - OSD.3 goes ok because never left out because ceph restriction.
- I rebooted all services at once for it to have available all OSD at the same time and don't mark it down. Don't work.
- I forced up from commandline. ceph osd in 1-5. They appear as in for a while then out.
- We tried ceph-disk activate-all to boot everything. Don't work.
The strange thing is that culster started worked just right after upgrade. But the systemctrl command broke both servers.
root@blue-compute:~# ceph -w
cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
health HEALTH_ERR
694 pgs are stuck inactive for more than 300 seconds
694 pgs stale
694 pgs stuck stale
too many PGs per OSD (1528 > max 300)
mds cluster is degraded
crush map has straw_calc_version=0
monmap e10: 2 mons at {blue-compute=172.16.0.119:6789/0,red-compute=172.16.0.100:6789/0}
election epoch 3600, quorum 0,1 red-compute,blue-compute
fsmap e673: 1/1/1 up {0:0=blue-compute=up:replay}
osdmap e2495: 5 osds: 1 up, 1 in; 5 remapped pgs
pgmap v40765481: 764 pgs, 6 pools, 410 GB data, 103 kobjects
87641 MB used, 212 GB / 297 GB avail
694 stale+active+clean
70 active+clean
2016-05-10
17:03:55.822440 mon.0 [INF] HEALTH_ERR; 694 pgs are stuck inactive for
more than 300 seconds; 694 pgs stale; 694 pgs stuck stale; too many PGs
per OSD (1528 > max 300); mds cluster is degraded; crush map has
straw_calc_version=
cat /etc/ceph/ceph.conf
[global]
fsid = 9028f4da-0d77-462b-be9b-dbdf7fa57771
mon_initial_members = blue-compute, red-compute
mon_host = 172.16.0.119, 172.16.0.100
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = 172.16.0.0/24
osd_pool_default_pg_num = 100
osd_pool_default_pgp_num = 100
osd_pool_default_size = 2 # Write an object 3 times.
osd_pool_default_min_size = 1 # Allow writing one copy in a degraded state.
## Required upgrade
osd max object name len = 256
osd max object namespace len = 64
[mon.]
debug mon = 9
caps mon = "allow *"
Any help on this? Any clue of what's going wrong?
Best regards,
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com