On 2018-12-17 20:16, Brad Hubbard wrote:
On Tue, Dec 18, 2018 at 10:23 AM Mike O'Connor <mike@xxxxxxxxxx> wrote:
Hi All
I have a ceph cluster which has been working with out issues for about
2
years now, it was upgrade about 6 month ago to 10.2.11
root@blade3:/var/lib/ceph/mon# ceph status
2018-12-18 10:42:39.242217 7ff770471700 0 -- 10.1.5.203:0/1608630285
>>
10.1.5.207:6789/0 pipe(0x7ff768000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1
c=0x7ff768001f90).fault
2018-12-18 10:42:45.242745 7ff770471700 0 -- 10.1.5.203:0/1608630285
>>
10.1.5.207:6789/0 pipe(0x7ff7680051e0 sd=3 :0 s=1 pgs=0 cs=0 l=1
c=0x7ff768002410).fault
2018-12-18 10:42:51.243230 7ff770471700 0 -- 10.1.5.203:0/1608630285
>>
10.1.5.207:6789/0 pipe(0x7ff7680051e0 sd=3 :0 s=1 pgs=0 cs=0 l=1
c=0x7ff768002f40).fault
2018-12-18 10:42:54.243452 7ff770572700 0 -- 10.1.5.203:0/1608630285
>>
10.1.5.205:6789/0 pipe(0x7ff768000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1
c=0x7ff768008060).fault
2018-12-18 10:42:57.243715 7ff770471700 0 -- 10.1.5.203:0/1608630285
>>
10.1.5.207:6789/0 pipe(0x7ff7680051e0 sd=3 :0 s=1 pgs=0 cs=0 l=1
c=0x7ff768003580).fault
2018-12-18 10:43:03.244280 7ff7781b9700 0 -- 10.1.5.203:0/1608630285
>>
10.1.5.205:6789/0 pipe(0x7ff7680051e0 sd=3 :0 s=1 pgs=0 cs=0 l=1
c=0x7ff768003670).fault
All system can ping each other. I simple can not see why its failing.
ceph.conf
[global]
auth client required = cephx
auth cluster required = cephx
auth service required = cephx
cluster network = 10.1.5.0/24
filestore xattr use omap = true
fsid = 42a0f015-76da-4f47-b506-da5cdacd030f
keyring = /etc/pve/priv/$cluster.$name.keyring
osd journal size = 5120
osd pool default min size = 1
public network = 10.1.5.0/24
mon_pg_warn_max_per_osd = 0
[client]
rbd cache = true
[osd]
keyring = /var/lib/ceph/osd/ceph-$id/keyring
osd max backfills = 1
osd recovery max active = 1
osd_disk_threads = 1
osd_disk_thread_ioprio_class = idle
osd_disk_thread_ioprio_priority = 7
[mon.2]
host = blade5
mon addr = 10.1.5.205:6789
[mon.1]
host = blade3
mon addr = 10.1.5.203:6789
[mon.3]
host = blade7
mon addr = 10.1.5.207:6789
[mon.0]
host = blade1
mon addr = 10.1.5.201:6789
[mds]
mds data = /var/lib/ceph/mds/mds.$id
keyring = /var/lib/ceph/mds/mds.$id/mds.$id.keyring
[mds.0]
host = blade1
[mds.1]
host = blade3
[mds.2]
host = blade5
[mds.3]
host = blade7
Any ideas ? more information ?
The system on which you are running the "ceph" client, blade3
(10.1.5.203) is trying to contact monitors on 10.1.5.207 (blade7) port
6789 and 10.1.5.205 (blade5) port 6789. You need to check the ceph-mon
binary is running on blade7 and blade5 and that they are listening on
port 6789 and that that port is accessible from blade3. The simplest
explanation is the MONs are not running. The next simplest is their is
a firewall interfering with blade3's ability to connect to port 6789
on those machines. Check the above and see what you find.
Hi,
After what Brad wrote, as for what would cause your MONs to not be
running...
Check kernel logs / dmesg... bad blocks? (Unlikely to knock out both
MONs)
Check disk space on /var/lib/ceph/mon/... Did it full up? (check both
blocks and inodes)
You said it was running without issues... just to double check... were
ALL your PGs healthy? (i.e. active+clean)? MONs will not trim their
logs if any PG is not healthy. Newer versions of Ceph do grow their
logs as fast as the older versions.
Good luck!
Dyweni
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com