Re: Ceph 10.2.11 - Status not working

Dyweni - Ceph-Users <6EXbab4FYk8H@xxxxxxxxxx> · Mon, 17 Dec 2018 22:50:31 -0600

On 2018-12-17 20:16, Brad Hubbard wrote:
On Tue, Dec 18, 2018 at 10:23 AM Mike O'Connor <mike@xxxxxxxxxx> wrote:

Hi All

I have a ceph cluster which has been working with out issues for about 
2
years now, it was upgrade about 6 month ago to 10.2.11

root@blade3:/var/lib/ceph/mon# ceph status
2018-12-18 10:42:39.242217 7ff770471700  0 -- 10.1.5.203:0/1608630285 
>>
10.1.5.207:6789/0 pipe(0x7ff768000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1
c=0x7ff768001f90).fault
2018-12-18 10:42:45.242745 7ff770471700  0 -- 10.1.5.203:0/1608630285 
>>
10.1.5.207:6789/0 pipe(0x7ff7680051e0 sd=3 :0 s=1 pgs=0 cs=0 l=1
c=0x7ff768002410).fault
2018-12-18 10:42:51.243230 7ff770471700  0 -- 10.1.5.203:0/1608630285 
>>
10.1.5.207:6789/0 pipe(0x7ff7680051e0 sd=3 :0 s=1 pgs=0 cs=0 l=1
c=0x7ff768002f40).fault
2018-12-18 10:42:54.243452 7ff770572700  0 -- 10.1.5.203:0/1608630285 
>>
10.1.5.205:6789/0 pipe(0x7ff768000c80 sd=4 :0 s=1 pgs=0 cs=0 l=1
c=0x7ff768008060).fault
2018-12-18 10:42:57.243715 7ff770471700  0 -- 10.1.5.203:0/1608630285 
>>
10.1.5.207:6789/0 pipe(0x7ff7680051e0 sd=3 :0 s=1 pgs=0 cs=0 l=1
c=0x7ff768003580).fault
2018-12-18 10:43:03.244280 7ff7781b9700  0 -- 10.1.5.203:0/1608630285 
>>
10.1.5.205:6789/0 pipe(0x7ff7680051e0 sd=3 :0 s=1 pgs=0 cs=0 l=1
c=0x7ff768003670).fault

All system can ping each other. I simple can not see why its failing.

ceph.conf

[global]
     auth client required = cephx
     auth cluster required = cephx
     auth service required = cephx
     cluster network = 10.1.5.0/24
     filestore xattr use omap = true
     fsid = 42a0f015-76da-4f47-b506-da5cdacd030f
     keyring = /etc/pve/priv/$cluster.$name.keyring
     osd journal size = 5120
     osd pool default min size = 1
     public network = 10.1.5.0/24
     mon_pg_warn_max_per_osd = 0

[client]
     rbd cache = true
[osd]
     keyring = /var/lib/ceph/osd/ceph-$id/keyring
     osd max backfills = 1
     osd recovery max active = 1
     osd_disk_threads = 1
     osd_disk_thread_ioprio_class = idle
     osd_disk_thread_ioprio_priority = 7
[mon.2]
     host = blade5
     mon addr = 10.1.5.205:6789
[mon.1]
     host = blade3
     mon addr = 10.1.5.203:6789
[mon.3]
     host = blade7
     mon addr = 10.1.5.207:6789
[mon.0]
     host = blade1
     mon addr = 10.1.5.201:6789
[mds]
         mds data = /var/lib/ceph/mds/mds.$id
         keyring = /var/lib/ceph/mds/mds.$id/mds.$id.keyring
[mds.0]
         host = blade1
[mds.1]
         host = blade3
[mds.2]
         host = blade5
[mds.3]
         host = blade7

Any ideas ? more information ?

The system on which you are running the "ceph" client, blade3
(10.1.5.203) is trying to contact monitors on 10.1.5.207 (blade7) port
6789 and 10.1.5.205 (blade5) port 6789. You need to check the ceph-mon
binary is running on blade7 and blade5 and that they are listening on
port 6789 and that that port is accessible from blade3. The simplest
explanation is the MONs are not running. The next simplest is their is
a firewall interfering with blade3's ability to connect to port 6789
on those machines. Check the above and see what you find.

Hi,

After what Brad wrote, as for what would cause your MONs to not be 
running...

Check kernel logs / dmesg... bad blocks?  (Unlikely to knock out both 
MONs)
Check disk space on /var/lib/ceph/mon/...  Did it full up?  (check both 
blocks and inodes)

You said it was running without issues... just to double check... were 
ALL your PGs healthy?  (i.e.  active+clean)?  MONs will not trim their 
logs if any PG is not healthy.  Newer versions of Ceph do grow their 
logs as fast as the older versions.

Good luck!
Dyweni

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com