Re: OSDs wrongly marked down

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Wed, 20 Dec 2017 17:30:06 +0200

Could also be your hardware under powered for the io you have. try to check your resource load during peak workload  together with recovery and scrubbing going on at same time. 
On 2017-12-20 17:03, David Turner wrote:

When I have OSDs wrongly marked down it's usually to do with the filestore_split_multiple and filestore_merge_threshold in a thing I call PG subfolder splitting.  This is no longer a factor with bluestore, but as you're running hammer, it's worth a look.  http://docs.ceph.com/docs/hammer/rados/configuration/filestore-config-ref/

On Wed, Dec 20, 2017 at 9:31 AM Garuti, Lorenzo <garuti.l@xxxxxxxxxx> wrote:

Hi Sergio, 

in my case it was a network problem, occasionally  (due to network problems) mon.{id} can't reach osd.{id}.
The massage  fault, initiating reconnect and  failed lossy con in your logs suggest a network problem.

See also:

http://docs.ceph.com/docs/giant/rados/troubleshooting/troubleshooting-osd/#flapping-osds
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/troubleshooting_guide/troubleshooting-osds#flapping-osds

Lorenzo

2017-12-20 15:13 GMT+01:00 Sergio Morales <smorales@xxxxxxxxx>:

Hi.

I'm having problem with the OSD en  my cluster.

Randomly some OSD get  wrongly marked down. I set my "mon osd min down reporters " to OSD +1, but i still get this problem.

Any tips or ideas to do the troubleshooting? I'm using Ceph 0.94.5 on Centos 7.

The logs shows this:

2017-12-19 16:59:26.357707 7fa9177d3700  0 -- 172.17.4.2:6830/4775054 >> 172.17.4.3:6800/2009784 pipe(0x7fa8a0907000 sd=43 :45955 s=1 pgs=1089 cs=1 l=0 c=0x7fa8a0965f00).connect got RESETSESSION
2017-12-19 16:59:26.360240 7fa8e5652700  0 -- 172.17.4.2:6830/4775054 >> 172.17.4.1:6808/6007742 pipe(0x7fa9310e3000 sd=26 :53375 s=2 pgs=5272 cs=1 l=0 c=0x7fa931045680).fault, initiating reconnect

2017-12-19 16:59:25.716758 7fa8e74c1700  0 -- 172.17.4.2:6830/4775054 >> 172.17.4.1:6826/1007559 pipe(0x7fa907052000 sd=17 :45743 s=1 pgs=2105 cs=1 l=0 c=0x7fa8a051a180).connect got RESETSESSION
2017-12-19 16:59:25.716308 7fa9849ed700  0 -- 172.17.3.2:6802/3775054 submit_message osd_op_reply(392 rbd_data.129d2042eabc234.0000000000000605 [set-alloc-hint object_size 4194304 write_size 4194304,write 0~126976] v26497'18879046 uv18879046 _ondisk_ = 0) v6 remote, 172.17.1.3:0/5911141, failed lossy con, dropping message 0x7fa8830edb00
2017-12-19 16:59:25.718694 7fa9849ed700  0 -- 172.17.3.2:6802/3775054 submit_message osd_op_reply(10610054 rbd_data.6ccd3348ab9aac.000000000000011d [set-alloc-hint object_size 8388608 write_size 8388608,write 876544~4096] v26497'15075797 uv15075797 _ondisk_ = 0) v6 remote, 172.17.1.4:0/1028032, failed lossy con, dropping message 0x7fa87a911700

-- 

Sergio A. Morales
Ingeniero de Sistemas
LINETS CHILE - 56 2 2412 5858

_______________________________________________
 ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 

Lorenzo Garuti
CED MaxMara
email: garuti.l@xxxxxxxxxx
tel: 0522 3993772 - 335 8416054

_______________________________________________
 ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
 ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com