Hi cephers.
I think this is solved.
The issue is because of puppet.
and the new interface naming of centos7.
In our puppet configs we defined an iptable module which restricts
access to the private ceph network based on src and on destiny
interface. We had eth1 hardwired and in this new server, it is coming as
p2p1.
I understood there was a communication issue because the osds shutdown
by themselves, they do not crash. Inspecting the logs of the osds
themselves I saw a message saying that it was marked down. Then I looked
to dmesg which was full of iptables messages of hosts on the
10.100.1.0/24 being blocked.
I think i had the firewall disabled when I bootstrapped the osds in the
machines and that might explain why there was some transfer of data.
Sorry for the entropy.
Cheers
G.
On 07/27/2016 08:44 AM, Goncalo Borges wrote:
Hi cephers...
Our production cluster is running Jewel 10.2.2.
We were running a production cluster with 8 servers each with 8 osds making a gran total of 64 osds. Each server also hosts 2 ssds for journals. Each sshd supports 4 journals.
We had 1/3 of our osds above 80% occupied, and we decided that we had to reweigh and the add more osds.
We have added a new node with 16 osds. The setup is similar to the other servers: 4 ssds (instead of 2) for journals containing 4 partitions each. All osds are of the same size as in our previous setup. All servers have a public interface and private one for data migration at 10 GE.
I've installed the new server, and after restarting all osds, the migration of data started. It has been going on during the night, but now I see that osds in that server stop. If I restart them, they work for a while and then stop again, I've tried to have 8 running and the same behaviour. I then tried to have 4 running (each on separate journals and the same behaviour), Currently I am only running two but i am unsure how long it will last.
The tail of the log of one of the osds before it shutdowns follows (X.X.X represents the public ip prefix of the infrasrtructure nodes). Let me know if you need a bit further:
My config is the following:
# cat /etc/ceph/ceph.conf
[global]
auth_service_required = cephx
auth_client_required = cephx
auth_cluster_required = cephx
mon_host = X.X.X.8,X.X.X.34,X.X.X.26
mon_initial_members = rccephmon1, rccephmon2, rccephmon3
fsid = a9431bc6-3ee1-4b0a-8d21-0ad883a4d2ed
public network = X.X.X.0/24
cluster network = 10.100.1.0/24
filestore xattr use omap = true
filestore journal writeahead = true
osd journal size = 20000
osd pool default size = 3
osd pool default min size = 2
osd pool default pg num = 4096
osd pool default pgp num = 4096
osd crush chooseleaf type = 1
osd map cache size = 100
osd max write size = 512
osd max backfills = 1
osd recovery max active = 5
osd mount options xfs = "rw,largeio,inode64,swalloc,logbufs=8,logbsize=256k,attr2,noatime,nodiratime,noquota"
# osd backfill full ratio = 0.85
osd backfill full ratio = 0.92
mds beacon grace = 15
mds session timeout = 60
mds reconnect timeout = 45
mds session autoclose = 300
mds cache size = 2000000
# mon osd full ratio = 0.95
# mon osd nearfull ratio = 0.85
mon osd nearfull ratio = 0.90
# debug client = 20
# debug objectcacher = 20
[mds.rccephmds]
host = rccephmds
mds standby replay = true
[mds.rccephmds2]
host = rccephmds2
mds standby_for_rank = rccephmds
mds standby replay = true
Help in trying to recover would be much appreciated.
Cheers
Goncalo
2016-07-27 08:08:55.530271 7f7a75a1b700 30 osd.68 pg_epoch: 4869 pg[6.242( v 4707'275156 (1455'272156,4707'275156] lb MIN (bitwise) local-les=4735 n=0 ec=341 les/c/f 4855/4841/0 4868/4868/2145) [29,1]/[29,1,35] r=-1 lpr=4868 pi=2168-4867/246 crt=3153'275118 lcod 0'0 remapped NOTIFY] lock
2016-07-27 08:08:55.531819 7f7a75a1b700 20 osd.68 4869 kicking pg 6.330
2016-07-27 08:08:55.531824 7f7a75a1b700 30 osd.68 pg_epoch: 4869 pg[6.330( v 3159'219778 (1410'216775,3159'219778] lb 6:0cc695f6:::1000023bb3d.0000001d:head (bitwise) local-les=3075 n=1376 ec=341 les/c/f 4822/4817/0 4868/4868/2145) [24,63]/[24,63,9] r=-1 lpr=4868 pi=2161-4867/240 crt=3159'219778 lcod 0'0 remapped NOTIFY] lock
2016-07-27 08:08:55.532831 7f7a75a1b700 20 osd.68 4869 kicking pg 5.ac
2016-07-27 08:08:55.532837 7f7a75a1b700 30 osd.68 pg_epoch: 4869 pg[5.ac( v 2109'70695 (1963'67695,2109'70695] local-les=4823 n=232 ec=339 les/c/f 4823/4823/0 4868/4868/2168) [7,40] r=-1 lpr=4868 pi=2168-4867/250 crt=2109'70695 lcod 0'0 inactive NOTIFY] lock
2016-07-27 08:08:55.533882 7f7a75a1b700 20 osd.68 4869 kicking pg 5.175
2016-07-27 08:08:55.533888 7f7a75a1b700 30 osd.68 pg_epoch: 4869 pg[5.175( v 2109'51022 (1963'48022,2109'51022] local-les=4848 n=228 ec=339 les/c/f 4848/4848/0 4868/4868/4868) [41,54] r=-1 lpr=4868 pi=4847-4867/3 crt=2109'51022 lcod 0'0 inactive NOTIFY] lock
2016-07-27 08:08:55.534846 7f7a75a1b700 20 osd.68 4869 kicking pg 4.34b
2016-07-27 08:08:55.534852 7f7a75a1b700 30 osd.68 pg_epoch: 4869 pg[4.34b( empty local-les=4855 n=0 ec=337 les/c/f 4855/4856/0 4868/4868/2131) [42,49] r=-1 lpr=4868 pi=2153-4867/287 crt=0'0 inactive NOTIFY] lock
2016-07-27 08:08:55.534998 7f7a75a1b700 20 osd.68 4869 kicking pg 6.9e
2016-07-27 08:08:55.535003 7f7a75a1b700 30 osd.68 pg_epoch: 4869 pg[6.9e( v 3153'194378 (1005'191355,3153'194378] lb 6:791ed959:::1000023fa34.00000001:head (bitwise) local-les=2874 n=6300 ec=341 les/c/f 4847/4841/0 4868/4868/2116) [61,17]/[61,17,45] r=-1 lpr=4868 pi=2154-4867/240 crt=3153'194378 lcod 0'0 remapped NOTIFY] lock
2016-07-27 08:08:55.536160 7f7a75a1b700 1 -- X.X.X.171:6802/19137 mark_down 0x7f7abd7db600 -- 0x7f7abd394000
2016-07-27 08:08:55.536175 7f7a75a1b700 10 -- X.X.X.171:6802/19137 >> X.X.X.26:6789/0 pipe(0x7f7abd394000 sd=158 :38770 s=2 pgs=47454 cs=1 l=1 c=0x7f7abd7db600).unregister_pipe
2016-07-27 08:08:55.536181 7f7a75a1b700 10 -- X.X.X.171:6802/19137 >> X.X.X.26:6789/0 pipe(0x7f7abd394000 sd=158 :38770 s=2 pgs=47454 cs=1 l=1 c=0x7f7abd7db600).stop
2016-07-27 08:08:55.536246 7f7a67798700 20 -- X.X.X.171:6802/19137 >> X.X.X.26:6789/0 pipe(0x7f7abd394000 sd=158 :38770 s=4 pgs=47454 cs=1 l=1 c=0x7f7abd7db600).writer finishing
2016-07-27 08:08:55.536306 7f7a67798700 10 -- X.X.X.171:6802/19137 >> X.X.X.26:6789/0 pipe(0x7f7abd394000 sd=158 :38770 s=4 pgs=47454 cs=1 l=1 c=0x7f7abd7db600).writer done
2016-07-27 08:08:55.536320 7f7a92fdd700 2 -- X.X.X.171:6802/19137 >> X.X.X.26:6789/0 pipe(0x7f7abd394000 sd=158 :38770 s=4 pgs=47454 cs=1 l=1 c=0x7f7abd7db600).reader couldn't read tag, (0) Success
2016-07-27 08:08:55.536377 7f7a92fdd700 2 -- X.X.X.171:6802/19137 >> X.X.X.26:6789/0 pipe(0x7f7abd394000 sd=158 :38770 s=4 pgs=47454 cs=1 l=1 c=0x7f7abd7db600).fault (0) Success
2016-07-27 08:08:55.536390 7f7a92fdd700 10 -- X.X.X.171:6802/19137 >> X.X.X.26:6789/0 pipe(0x7f7abd394000 sd=158 :38770 s=4 pgs=47454 cs=1 l=1 c=0x7f7abd7db600).fault already closed|closing
2016-07-27 08:08:55.536408 7f7a92fdd700 10 -- X.X.X.171:6802/19137 queue_reap 0x7f7abd394000
2016-07-27 08:08:55.536418 7f7a92fdd700 10 -- X.X.X.171:6802/19137 >> X.X.X.26:6789/0 pipe(0x7f7abd394000 sd=158 :38770 s=4 pgs=47454 cs=1 l=1 c=0x7f7abd7db600).reader done
2016-07-27 08:08:55.536521 7f7a982f1700 10 -- X.X.X.171:6802/19137 reaper
2016-07-27 08:08:55.536556 7f7a982f1700 10 -- X.X.X.171:6802/19137 reaper reaping pipe 0x7f7abd394000 X.X.X.26:6789/0
2016-07-27 08:08:55.536572 7f7a982f1700 10 -- X.X.X.171:6802/19137 >> X.X.X.26:6789/0 pipe(0x7f7abd394000 sd=158 :38770 s=4 pgs=47454 cs=1 l=1 c=0x7f7abd7db600).discard_queue
2016-07-27 08:08:55.536601 7f7a982f1700 10 -- X.X.X.171:6802/19137 >> X.X.X.26:6789/0 pipe(0x7f7abd394000 sd=158 :38770 s=4 pgs=47454 cs=1 l=1 c=0x7f7abd7db600).unregister_pipe - not registered
2016-07-27 08:08:55.536616 7f7a982f1700 20 -- X.X.X.171:6802/19137 >> X.X.X.26:6789/0 pipe(0x7f7abd394000 sd=158 :38770 s=4 pgs=47454 cs=1 l=1 c=0x7f7abd7db600).join
2016-07-27 08:08:55.536677 7f7a982f1700 10 -- X.X.X.171:6802/19137 reaper reaped pipe 0x7f7abd394000 X.X.X.26:6789/0
2016-07-27 08:08:55.536706 7f7a982f1700 10 -- X.X.X.171:6802/19137 reaper deleted pipe 0x7f7abd394000
2016-07-27 08:08:55.536710 7f7a982f1700 10 -- X.X.X.171:6802/19137 reaper done
2016-07-27 08:08:55.538307 7f7a75a1b700 10 -- X.X.X.171:6802/19137 shutdown X.X.X.171:6802/19137
2016-07-27 08:08:55.538319 7f7a75a1b700 1 -- X.X.X.171:6802/19137 mark_down_all
2016-07-27 08:08:55.538327 7f7a75a1b700 10 -- 10.100.1.171:6802/8019137 shutdown 10.100.1.171:6802/8019137
2016-07-27 08:08:55.538330 7f7a75a1b700 1 -- 10.100.1.171:6802/8019137 mark_down_all
2016-07-27 08:08:55.538341 7f7a75a1b700 10 -- X.X.X.171:0/19137 shutdown X.X.X.171:0/19137
2016-07-27 08:08:55.538346 7f7a75a1b700 1 -- X.X.X.171:0/19137 mark_down_all
2016-07-27 08:08:55.538356 7f7a75a1b700 10 -- :/19137 shutdown :/19137
2016-07-27 08:08:55.538358 7f7a75a1b700 1 -- :/19137 mark_down_all
2016-07-27 08:08:55.538369 7f7a75a1b700 10 -- X.X.X.171:6807/8019137 shutdown X.X.X.171:6807/8019137
2016-07-27 08:08:55.538388 7f7a75a1b700 1 -- X.X.X.171:6807/8019137 mark_down_all
2016-07-27 08:08:55.538406 7f7a75a1b700 10 -- 10.100.1.171:6809/8019137 shutdown 10.100.1.171:6809/8019137
2016-07-27 08:08:55.538410 7f7a75a1b700 1 -- 10.100.1.171:6809/8019137 mark_down_all
2016-07-27 08:08:55.540063 7f7a9fd83800 10 -- X.X.X.171:6802/19137 wait: dispatch queue is stopped
2016-07-27 08:08:55.540109 7f7a9fd83800 20 -- X.X.X.171:6802/19137 wait: stopping accepter thread
2016-07-27 08:08:55.540116 7f7a9fd83800 10 accepter.stop accepter
2016-07-27 08:08:55.540211 7f7a88240700 20 accepter.accepter poll got 1
2016-07-27 08:08:55.540241 7f7a88240700 20 accepter.accepter closing
2016-07-27 08:08:55.540273 7f7a88240700 10 accepter.accepter stopping
2016-07-27 08:08:55.540407 7f7a9fd83800 20 -- X.X.X.171:6802/19137 wait: stopped accepter thread
2016-07-27 08:08:55.540439 7f7a9fd83800 20 -- X.X.X.171:6802/19137 wait: stopping reaper thread
2016-07-27 08:08:55.540517 7f7a982f1700 10 -- X.X.X.171:6802/19137 reaper_entry done
2016-07-27 08:08:55.540693 7f7a9fd83800 20 -- X.X.X.171:6802/19137 wait: stopped reaper thread
2016-07-27 08:08:55.540725 7f7a9fd83800 10 -- X.X.X.171:6802/19137 wait: closing pipes
2016-07-27 08:08:55.540731 7f7a9fd83800 10 -- X.X.X.171:6802/19137 reaper
2016-07-27 08:08:55.540743 7f7a9fd83800 10 -- X.X.X.171:6802/19137 reaper done
2016-07-27 08:08:55.540756 7f7a9fd83800 10 -- X.X.X.171:6802/19137 wait: waiting for pipes to close
2016-07-27 08:08:55.540761 7f7a9fd83800 10 -- X.X.X.171:6802/19137 wait: done.
2016-07-27 08:08:55.540768 7f7a9fd83800 1 -- X.X.X.171:6802/19137 shutdown complete.
2016-07-27 08:08:55.540776 7f7a9fd83800 10 -- X.X.X.171:0/19137 wait: waiting for dispatch queue
2016-07-27 08:08:55.540874 7f7a9fd83800 10 -- X.X.X.171:0/19137 wait: dispatch queue is stopped
2016-07-27 08:08:55.540891 7f7a9fd83800 20 -- X.X.X.171:0/19137 wait: stopping reaper thread
2016-07-27 08:08:55.540956 7f7a97af0700 10 -- X.X.X.171:0/19137 reaper_entry done
2016-07-27 08:08:55.541236 7f7a9fd83800 20 -- X.X.X.171:0/19137 wait: stopped reaper thread
2016-07-27 08:08:55.541266 7f7a9fd83800 10 -- X.X.X.171:0/19137 wait: closing pipes
2016-07-27 08:08:55.541273 7f7a9fd83800 10 -- X.X.X.171:0/19137 reaper
2016-07-27 08:08:55.541287 7f7a9fd83800 10 -- X.X.X.171:0/19137 reaper done
2016-07-27 08:08:55.541297 7f7a9fd83800 10 -- X.X.X.171:0/19137 wait: waiting for pipes to close
2016-07-27 08:08:55.541305 7f7a9fd83800 10 -- X.X.X.171:0/19137 wait: done.
2016-07-27 08:08:55.541311 7f7a9fd83800 1 -- X.X.X.171:0/19137 shutdown complete.
2016-07-27 08:08:55.541319 7f7a9fd83800 10 -- X.X.X.171:6807/8019137 wait: waiting for dispatch queue
2016-07-27 08:08:55.541437 7f7a9fd83800 10 -- X.X.X.171:6807/8019137 wait: dispatch queue is stopped
2016-07-27 08:08:55.541453 7f7a9fd83800 20 -- X.X.X.171:6807/8019137 wait: stopping accepter thread
2016-07-27 08:08:55.541459 7f7a9fd83800 10 accepter.stop accepter
2016-07-27 08:08:55.541533 7f7a84238700 20 accepter.accepter poll got 1
2016-07-27 08:08:55.541551 7f7a84238700 20 accepter.accepter closing
2016-07-27 08:08:55.541570 7f7a84238700 10 accepter.accepter stopping
2016-07-27 08:08:55.541677 7f7a9fd83800 20 -- X.X.X.171:6807/8019137 wait: stopped accepter thread
2016-07-27 08:08:55.541710 7f7a9fd83800 20 -- X.X.X.171:6807/8019137 wait: stopping reaper thread
2016-07-27 08:08:55.541762 7f7a972ef700 10 -- X.X.X.171:6807/8019137 reaper_entry done
2016-07-27 08:08:55.541889 7f7a9fd83800 20 -- X.X.X.171:6807/8019137 wait: stopped reaper thread
2016-07-27 08:08:55.541918 7f7a9fd83800 10 -- X.X.X.171:6807/8019137 wait: closing pipes
2016-07-27 08:08:55.541930 7f7a9fd83800 10 -- X.X.X.171:6807/8019137 reaper
2016-07-27 08:08:55.541945 7f7a9fd83800 10 -- X.X.X.171:6807/8019137 reaper done
2016-07-27 08:08:55.541954 7f7a9fd83800 10 -- X.X.X.171:6807/8019137 wait: waiting for pipes to close
2016-07-27 08:08:55.541961 7f7a9fd83800 10 -- X.X.X.171:6807/8019137 wait: done.
2016-07-27 08:08:55.541965 7f7a9fd83800 1 -- X.X.X.171:6807/8019137 shutdown complete.
2016-07-27 08:08:55.541973 7f7a9fd83800 10 -- 10.100.1.171:6809/8019137 wait: waiting for dispatch queue
2016-07-27 08:08:55.542071 7f7a9fd83800 10 -- 10.100.1.171:6809/8019137 wait: dispatch queue is stopped
2016-07-27 08:08:55.542085 7f7a9fd83800 20 -- 10.100.1.171:6809/8019137 wait: stopping accepter thread
2016-07-27 08:08:55.542090 7f7a9fd83800 10 accepter.stop accepter
2016-07-27 08:08:55.542147 7f7a82a35700 20 accepter.accepter poll got 1
2016-07-27 08:08:55.542169 7f7a82a35700 20 accepter.accepter closing
2016-07-27 08:08:55.542190 7f7a82a35700 10 accepter.accepter stopping
2016-07-27 08:08:55.542264 7f7a9fd83800 20 -- 10.100.1.171:6809/8019137 wait: stopped accepter thread
2016-07-27 08:08:55.542277 7f7a9fd83800 20 -- 10.100.1.171:6809/8019137 wait: stopping reaper thread
2016-07-27 08:08:55.542344 7f7a96aee700 10 -- 10.100.1.171:6809/8019137 reaper_entry done
2016-07-27 08:08:55.542484 7f7a9fd83800 20 -- 10.100.1.171:6809/8019137 wait: stopped reaper thread
2016-07-27 08:08:55.542512 7f7a9fd83800 10 -- 10.100.1.171:6809/8019137 wait: closing pipes
2016-07-27 08:08:55.542519 7f7a9fd83800 10 -- 10.100.1.171:6809/8019137 reaper
2016-07-27 08:08:55.542524 7f7a9fd83800 10 -- 10.100.1.171:6809/8019137 reaper done
2016-07-27 08:08:55.542533 7f7a9fd83800 10 -- 10.100.1.171:6809/8019137 wait: waiting for pipes to close
2016-07-27 08:08:55.542540 7f7a9fd83800 10 -- 10.100.1.171:6809/8019137 wait: done.
2016-07-27 08:08:55.542548 7f7a9fd83800 1 -- 10.100.1.171:6809/8019137 shutdown complete.
2016-07-27 08:08:55.542554 7f7a9fd83800 10 -- 10.100.1.171:6802/8019137 wait: waiting for dispatch queue
2016-07-27 08:08:55.542790 7f7a9fd83800 10 -- 10.100.1.171:6802/8019137 wait: dispatch queue is stopped
2016-07-27 08:08:55.542815 7f7a9fd83800 20 -- 10.100.1.171:6802/8019137 wait: stopping accepter thread
2016-07-27 08:08:55.542823 7f7a9fd83800 10 accepter.stop accepter
2016-07-27 08:08:55.542889 7f7a86a3d700 20 accepter.accepter poll got 1
2016-07-27 08:08:55.542908 7f7a86a3d700 20 accepter.accepter closing
2016-07-27 08:08:55.542931 7f7a86a3d700 10 accepter.accepter stopping
2016-07-27 08:08:55.543004 7f7a9fd83800 20 -- 10.100.1.171:6802/8019137 wait: stopped accepter thread
2016-07-27 08:08:55.543036 7f7a9fd83800 20 -- 10.100.1.171:6802/8019137 wait: stopping reaper thread
2016-07-27 08:08:55.543103 7f7a962ed700 10 -- 10.100.1.171:6802/8019137 reaper_entry done
2016-07-27 08:08:55.543928 7f7a9fd83800 20 -- 10.100.1.171:6802/8019137 wait: stopped reaper thread
2016-07-27 08:08:55.543944 7f7a9fd83800 10 -- 10.100.1.171:6802/8019137 wait: closing pipes
2016-07-27 08:08:55.543949 7f7a9fd83800 10 -- 10.100.1.171:6802/8019137 reaper
2016-07-27 08:08:55.543954 7f7a9fd83800 10 -- 10.100.1.171:6802/8019137 reaper done
2016-07-27 08:08:55.543958 7f7a9fd83800 10 -- 10.100.1.171:6802/8019137 wait: waiting for pipes to close
2016-07-27 08:08:55.543963 7f7a9fd83800 10 -- 10.100.1.171:6802/8019137 wait: done.
2016-07-27 08:08:55.543975 7f7a9fd83800 1 -- 10.100.1.171:6802/8019137 shutdown complete.
2016-07-27 08:08:55.543980 7f7a9fd83800 10 -- :/19137 wait: waiting for dispatch queue
2016-07-27 08:08:55.544030 7f7a9fd83800 10 -- :/19137 wait: dispatch queue is stopped
2016-07-27 08:08:55.544036 7f7a9fd83800 20 -- :/19137 wait: stopping reaper thread
2016-07-27 08:08:55.544135 7f7a95aec700 10 -- :/19137 reaper_entry done
2016-07-27 08:08:55.544195 7f7a9fd83800 20 -- :/19137 wait: stopped reaper thread
2016-07-27 08:08:55.544204 7f7a9fd83800 10 -- :/19137 wait: closing pipes
2016-07-27 08:08:55.544207 7f7a9fd83800 10 -- :/19137 reaper
2016-07-27 08:08:55.544209 7f7a9fd83800 10 -- :/19137 reaper done
2016-07-27 08:08:55.544212 7f7a9fd83800 10 -- :/19137 wait: waiting for pipes to close
2016-07-27 08:08:55.544214 7f7a9fd83800 10 -- :/19137 wait: done.
2016-07-27 08:08:55.544216 7f7a9fd83800 1 -- :/19137 shutdown complete.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW 2006
T: +61 2 93511937
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com