I connected to the IRC looking for help where people pointed me to one or another place but none of the investigations helped to resolve.This is ubuntu 16.04.From there the OSD stopped working.Hello,I just upgraded my cluster to the version 10.1.2 and it worked well for a while until I saw that systemctl ceph-disk@dev-sdc1.service was failed and I reruned it.My configuration is rather simple:
oot@red-compute:~# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 1.00000 root default
-4 1.00000 rack rack-1
-2 1.00000 host blue-compute
0 1.00000 osd.0 down 0 1.00000
2 1.00000 osd.2 down 0 1.00000
-3 1.00000 host red-compute
1 1.00000 osd.1 down 0 1.00000
3 0.50000 osd.3 up 1.00000 1.00000
4 1.00000 osd.4 down 0 1.00000This is what I got sofar:
- Once upgraded I discovered that daemon runs under ceph. I just ran chown on ceph directories. and it worked.
- Firewall is fully disabled. Checked connectivity with nc and nmap.
- Configuration seems to be right. I can post if you want.
- Enabling logging on OSD shows that for example osd.1 is reconnecting all the time.
- 2016-05-10 14:35:48.199573 7f53e8f1a700 1 -- 0.0.0.0:6806/13962 >> :/0 pipe(0x556f99413400 sd=84 :6806 s=0 pgs=0 cs=0 l=0 c=0x556f993b3a80).accept sd=84 172.16.0.119:35388/0
2016-05-10 14:35:48.199966 7f53e8f1a700 2 -- 0.0.0.0:6806/13962 >> :/0 pipe(0x556f99413400 sd=84 :6806 s=4 pgs=0 cs=0 l=0 c=0x556f993b3a80).fault (0) Success
2016-05-10 14:35:48.200018 7f53fb941700 1 osd.1 2468 ms_handle_reset con 0x556f993b3a80 session 0- OSD.3 goes ok because never left out because ceph restriction.
- I rebooted all services at once for it to have available all OSD at the same time and don't mark it down. Don't work.
- I forced up from commandline. ceph osd in 1-5. They appear as in for a while then out.
- We tried ceph-disk activate-all to boot everything. Don't work.
The strange thing is that culster started worked just right after upgrade. But the systemctrl command broke both servers.
root@blue-compute:~# ceph -w
cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
health HEALTH_ERR
694 pgs are stuck inactive for more than 300 seconds
694 pgs stale
694 pgs stuck stale
too many PGs per OSD (1528 > max 300)
mds cluster is degraded
crush map has straw_calc_version=0
monmap e10: 2 mons at {blue-compute=172.16.0.119:6789/0,red-compute=172.16.0.100:6789/0}
election epoch 3600, quorum 0,1 red-compute,blue-compute
fsmap e673: 1/1/1 up {0:0=blue-compute=up:replay}
osdmap e2495: 5 osds: 1 up, 1 in; 5 remapped pgs
pgmap v40765481: 764 pgs, 6 pools, 410 GB data, 103 kobjects
87641 MB used, 212 GB / 297 GB avail
694 stale+active+clean
70 active+clean2016-05-10 17:03:55.822440 mon.0 [INF] HEALTH_ERR; 694 pgs are stuck inactive for more than 300 seconds; 694 pgs stale; 694 pgs stuck stale; too many PGs per OSD (1528 > max 300); mds cluster is degraded; crush map has straw_calc_version=
cat /etc/ceph/ceph.conf
[global]
fsid = 9028f4da-0d77-462b-be9b-dbdf7fa57771
mon_initial_members = blue-compute, red-compute
mon_host = 172.16.0.119, 172.16.0.100
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = 172.16.0.0/24
osd_pool_default_pg_num = 100
osd_pool_default_pgp_num = 100
osd_pool_default_size = 2 # Write an object 3 times.
osd_pool_default_min_size = 1 # Allow writing one copy in a degraded state.
## Required upgrade
osd max object name len = 256
osd max object namespace len = 64
[mon.]
debug mon = 9
caps mon = "allow *"
Any help on this? Any clue of what's going wrong?
Best regards,
--
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com