Hi cephers... Our production cluster is running Jewel 10.2.2. We were running a production cluster with 8 servers each with 8 osds making a gran total of 64 osds. Each server also hosts 2 ssds for journals. Each sshd supports 4 journals. We had 1/3 of our osds above 80% occupied, and we decided that we had to reweigh and the add more osds. We have added a new node with 16 osds. The setup is similar to the other servers: 4 ssds (instead of 2) for journals containing 4 partitions each. All osds are of the same size as in our previous setup. All servers have a public interface and private one for data migration at 10 GE. I've installed the new server, and after restarting all osds, the migration of data started. It has been going on during the night, but now I see that osds in that server stop. If I restart them, they work for a while and then stop again, I've tried to have 8 running and the same behaviour. I then tried to have 4 running (each on separate journals and the same behaviour), Currently I am only running two but i am unsure how long it will last. The tail of the log of one of the osds before it shutdowns follows (X.X.X represents the public ip prefix of the infrasrtructure nodes). Let me know if you need a bit further: My config is the following: # cat /etc/ceph/ceph.conf [global] auth_service_required = cephx auth_client_required = cephx auth_cluster_required = cephx mon_host = X.X.X.8,X.X.X.34,X.X.X.26 mon_initial_members = rccephmon1, rccephmon2, rccephmon3 fsid = a9431bc6-3ee1-4b0a-8d21-0ad883a4d2ed public network = X.X.X.0/24 cluster network = 10.100.1.0/24 filestore xattr use omap = true filestore journal writeahead = true osd journal size = 20000 osd pool default size = 3 osd pool default min size = 2 osd pool default pg num = 4096 osd pool default pgp num = 4096 osd crush chooseleaf type = 1 osd map cache size = 100 osd max write size = 512 osd max backfills = 1 osd recovery max active = 5 osd mount options xfs = "rw,largeio,inode64,swalloc,logbufs=8,logbsize=256k,attr2,noatime,nodiratime,noquota" # osd backfill full ratio = 0.85 osd backfill full ratio = 0.92 mds beacon grace = 15 mds session timeout = 60 mds reconnect timeout = 45 mds session autoclose = 300 mds cache size = 2000000 # mon osd full ratio = 0.95 # mon osd nearfull ratio = 0.85 mon osd nearfull ratio = 0.90 # debug client = 20 # debug objectcacher = 20 [mds.rccephmds] host = rccephmds mds standby replay = true [mds.rccephmds2] host = rccephmds2 mds standby_for_rank = rccephmds mds standby replay = true Help in trying to recover would be much appreciated. Cheers Goncalo 2016-07-27 08:08:55.530271 7f7a75a1b700 30 osd.68 pg_epoch: 4869 pg[6.242( v 4707'275156 (1455'272156,4707'275156] lb MIN (bitwise) local-les=4735 n=0 ec=341 les/c/f 4855/4841/0 4868/4868/2145) [29,1]/[29,1,35] r=-1 lpr=4868 pi=2168-4867/246 crt=3153'275118 lcod 0'0 remapped NOTIFY] lock 2016-07-27 08:08:55.531819 7f7a75a1b700 20 osd.68 4869 kicking pg 6.330 2016-07-27 08:08:55.531824 7f7a75a1b700 30 osd.68 pg_epoch: 4869 pg[6.330( v 3159'219778 (1410'216775,3159'219778] lb 6:0cc695f6:::1000023bb3d.0000001d:head (bitwise) local-les=3075 n=1376 ec=341 les/c/f 4822/4817/0 4868/4868/2145) [24,63]/[24,63,9] r=-1 lpr=4868 pi=2161-4867/240 crt=3159'219778 lcod 0'0 remapped NOTIFY] lock 2016-07-27 08:08:55.532831 7f7a75a1b700 20 osd.68 4869 kicking pg 5.ac 2016-07-27 08:08:55.532837 7f7a75a1b700 30 osd.68 pg_epoch: 4869 pg[5.ac( v 2109'70695 (1963'67695,2109'70695] local-les=4823 n=232 ec=339 les/c/f 4823/4823/0 4868/4868/2168) [7,40] r=-1 lpr=4868 pi=2168-4867/250 crt=2109'70695 lcod 0'0 inactive NOTIFY] lock 2016-07-27 08:08:55.533882 7f7a75a1b700 20 osd.68 4869 kicking pg 5.175 2016-07-27 08:08:55.533888 7f7a75a1b700 30 osd.68 pg_epoch: 4869 pg[5.175( v 2109'51022 (1963'48022,2109'51022] local-les=4848 n=228 ec=339 les/c/f 4848/4848/0 4868/4868/4868) [41,54] r=-1 lpr=4868 pi=4847-4867/3 crt=2109'51022 lcod 0'0 inactive NOTIFY] lock 2016-07-27 08:08:55.534846 7f7a75a1b700 20 osd.68 4869 kicking pg 4.34b 2016-07-27 08:08:55.534852 7f7a75a1b700 30 osd.68 pg_epoch: 4869 pg[4.34b( empty local-les=4855 n=0 ec=337 les/c/f 4855/4856/0 4868/4868/2131) [42,49] r=-1 lpr=4868 pi=2153-4867/287 crt=0'0 inactive NOTIFY] lock 2016-07-27 08:08:55.534998 7f7a75a1b700 20 osd.68 4869 kicking pg 6.9e 2016-07-27 08:08:55.535003 7f7a75a1b700 30 osd.68 pg_epoch: 4869 pg[6.9e( v 3153'194378 (1005'191355,3153'194378] lb 6:791ed959:::1000023fa34.00000001:head (bitwise) local-les=2874 n=6300 ec=341 les/c/f 4847/4841/0 4868/4868/2116) [61,17]/[61,17,45] r=-1 lpr=4868 pi=2154-4867/240 crt=3153'194378 lcod 0'0 remapped NOTIFY] lock 2016-07-27 08:08:55.536160 7f7a75a1b700 1 -- X.X.X.171:6802/19137 mark_down 0x7f7abd7db600 -- 0x7f7abd394000 2016-07-27 08:08:55.536175 7f7a75a1b700 10 -- X.X.X.171:6802/19137 >> X.X.X.26:6789/0 pipe(0x7f7abd394000 sd=158 :38770 s=2 pgs=47454 cs=1 l=1 c=0x7f7abd7db600).unregister_pipe 2016-07-27 08:08:55.536181 7f7a75a1b700 10 -- X.X.X.171:6802/19137 >> X.X.X.26:6789/0 pipe(0x7f7abd394000 sd=158 :38770 s=2 pgs=47454 cs=1 l=1 c=0x7f7abd7db600).stop 2016-07-27 08:08:55.536246 7f7a67798700 20 -- X.X.X.171:6802/19137 >> X.X.X.26:6789/0 pipe(0x7f7abd394000 sd=158 :38770 s=4 pgs=47454 cs=1 l=1 c=0x7f7abd7db600).writer finishing 2016-07-27 08:08:55.536306 7f7a67798700 10 -- X.X.X.171:6802/19137 >> X.X.X.26:6789/0 pipe(0x7f7abd394000 sd=158 :38770 s=4 pgs=47454 cs=1 l=1 c=0x7f7abd7db600).writer done 2016-07-27 08:08:55.536320 7f7a92fdd700 2 -- X.X.X.171:6802/19137 >> X.X.X.26:6789/0 pipe(0x7f7abd394000 sd=158 :38770 s=4 pgs=47454 cs=1 l=1 c=0x7f7abd7db600).reader couldn't read tag, (0) Success 2016-07-27 08:08:55.536377 7f7a92fdd700 2 -- X.X.X.171:6802/19137 >> X.X.X.26:6789/0 pipe(0x7f7abd394000 sd=158 :38770 s=4 pgs=47454 cs=1 l=1 c=0x7f7abd7db600).fault (0) Success 2016-07-27 08:08:55.536390 7f7a92fdd700 10 -- X.X.X.171:6802/19137 >> X.X.X.26:6789/0 pipe(0x7f7abd394000 sd=158 :38770 s=4 pgs=47454 cs=1 l=1 c=0x7f7abd7db600).fault already closed|closing 2016-07-27 08:08:55.536408 7f7a92fdd700 10 -- X.X.X.171:6802/19137 queue_reap 0x7f7abd394000 2016-07-27 08:08:55.536418 7f7a92fdd700 10 -- X.X.X.171:6802/19137 >> X.X.X.26:6789/0 pipe(0x7f7abd394000 sd=158 :38770 s=4 pgs=47454 cs=1 l=1 c=0x7f7abd7db600).reader done 2016-07-27 08:08:55.536521 7f7a982f1700 10 -- X.X.X.171:6802/19137 reaper 2016-07-27 08:08:55.536556 7f7a982f1700 10 -- X.X.X.171:6802/19137 reaper reaping pipe 0x7f7abd394000 X.X.X.26:6789/0 2016-07-27 08:08:55.536572 7f7a982f1700 10 -- X.X.X.171:6802/19137 >> X.X.X.26:6789/0 pipe(0x7f7abd394000 sd=158 :38770 s=4 pgs=47454 cs=1 l=1 c=0x7f7abd7db600).discard_queue 2016-07-27 08:08:55.536601 7f7a982f1700 10 -- X.X.X.171:6802/19137 >> X.X.X.26:6789/0 pipe(0x7f7abd394000 sd=158 :38770 s=4 pgs=47454 cs=1 l=1 c=0x7f7abd7db600).unregister_pipe - not registered 2016-07-27 08:08:55.536616 7f7a982f1700 20 -- X.X.X.171:6802/19137 >> X.X.X.26:6789/0 pipe(0x7f7abd394000 sd=158 :38770 s=4 pgs=47454 cs=1 l=1 c=0x7f7abd7db600).join 2016-07-27 08:08:55.536677 7f7a982f1700 10 -- X.X.X.171:6802/19137 reaper reaped pipe 0x7f7abd394000 X.X.X.26:6789/0 2016-07-27 08:08:55.536706 7f7a982f1700 10 -- X.X.X.171:6802/19137 reaper deleted pipe 0x7f7abd394000 2016-07-27 08:08:55.536710 7f7a982f1700 10 -- X.X.X.171:6802/19137 reaper done 2016-07-27 08:08:55.538307 7f7a75a1b700 10 -- X.X.X.171:6802/19137 shutdown X.X.X.171:6802/19137 2016-07-27 08:08:55.538319 7f7a75a1b700 1 -- X.X.X.171:6802/19137 mark_down_all 2016-07-27 08:08:55.538327 7f7a75a1b700 10 -- 10.100.1.171:6802/8019137 shutdown 10.100.1.171:6802/8019137 2016-07-27 08:08:55.538330 7f7a75a1b700 1 -- 10.100.1.171:6802/8019137 mark_down_all 2016-07-27 08:08:55.538341 7f7a75a1b700 10 -- X.X.X.171:0/19137 shutdown X.X.X.171:0/19137 2016-07-27 08:08:55.538346 7f7a75a1b700 1 -- X.X.X.171:0/19137 mark_down_all 2016-07-27 08:08:55.538356 7f7a75a1b700 10 -- :/19137 shutdown :/19137 2016-07-27 08:08:55.538358 7f7a75a1b700 1 -- :/19137 mark_down_all 2016-07-27 08:08:55.538369 7f7a75a1b700 10 -- X.X.X.171:6807/8019137 shutdown X.X.X.171:6807/8019137 2016-07-27 08:08:55.538388 7f7a75a1b700 1 -- X.X.X.171:6807/8019137 mark_down_all 2016-07-27 08:08:55.538406 7f7a75a1b700 10 -- 10.100.1.171:6809/8019137 shutdown 10.100.1.171:6809/8019137 2016-07-27 08:08:55.538410 7f7a75a1b700 1 -- 10.100.1.171:6809/8019137 mark_down_all 2016-07-27 08:08:55.540063 7f7a9fd83800 10 -- X.X.X.171:6802/19137 wait: dispatch queue is stopped 2016-07-27 08:08:55.540109 7f7a9fd83800 20 -- X.X.X.171:6802/19137 wait: stopping accepter thread 2016-07-27 08:08:55.540116 7f7a9fd83800 10 accepter.stop accepter 2016-07-27 08:08:55.540211 7f7a88240700 20 accepter.accepter poll got 1 2016-07-27 08:08:55.540241 7f7a88240700 20 accepter.accepter closing 2016-07-27 08:08:55.540273 7f7a88240700 10 accepter.accepter stopping 2016-07-27 08:08:55.540407 7f7a9fd83800 20 -- X.X.X.171:6802/19137 wait: stopped accepter thread 2016-07-27 08:08:55.540439 7f7a9fd83800 20 -- X.X.X.171:6802/19137 wait: stopping reaper thread 2016-07-27 08:08:55.540517 7f7a982f1700 10 -- X.X.X.171:6802/19137 reaper_entry done 2016-07-27 08:08:55.540693 7f7a9fd83800 20 -- X.X.X.171:6802/19137 wait: stopped reaper thread 2016-07-27 08:08:55.540725 7f7a9fd83800 10 -- X.X.X.171:6802/19137 wait: closing pipes 2016-07-27 08:08:55.540731 7f7a9fd83800 10 -- X.X.X.171:6802/19137 reaper 2016-07-27 08:08:55.540743 7f7a9fd83800 10 -- X.X.X.171:6802/19137 reaper done 2016-07-27 08:08:55.540756 7f7a9fd83800 10 -- X.X.X.171:6802/19137 wait: waiting for pipes to close 2016-07-27 08:08:55.540761 7f7a9fd83800 10 -- X.X.X.171:6802/19137 wait: done. 2016-07-27 08:08:55.540768 7f7a9fd83800 1 -- X.X.X.171:6802/19137 shutdown complete. 2016-07-27 08:08:55.540776 7f7a9fd83800 10 -- X.X.X.171:0/19137 wait: waiting for dispatch queue 2016-07-27 08:08:55.540874 7f7a9fd83800 10 -- X.X.X.171:0/19137 wait: dispatch queue is stopped 2016-07-27 08:08:55.540891 7f7a9fd83800 20 -- X.X.X.171:0/19137 wait: stopping reaper thread 2016-07-27 08:08:55.540956 7f7a97af0700 10 -- X.X.X.171:0/19137 reaper_entry done 2016-07-27 08:08:55.541236 7f7a9fd83800 20 -- X.X.X.171:0/19137 wait: stopped reaper thread 2016-07-27 08:08:55.541266 7f7a9fd83800 10 -- X.X.X.171:0/19137 wait: closing pipes 2016-07-27 08:08:55.541273 7f7a9fd83800 10 -- X.X.X.171:0/19137 reaper 2016-07-27 08:08:55.541287 7f7a9fd83800 10 -- X.X.X.171:0/19137 reaper done 2016-07-27 08:08:55.541297 7f7a9fd83800 10 -- X.X.X.171:0/19137 wait: waiting for pipes to close 2016-07-27 08:08:55.541305 7f7a9fd83800 10 -- X.X.X.171:0/19137 wait: done. 2016-07-27 08:08:55.541311 7f7a9fd83800 1 -- X.X.X.171:0/19137 shutdown complete. 2016-07-27 08:08:55.541319 7f7a9fd83800 10 -- X.X.X.171:6807/8019137 wait: waiting for dispatch queue 2016-07-27 08:08:55.541437 7f7a9fd83800 10 -- X.X.X.171:6807/8019137 wait: dispatch queue is stopped 2016-07-27 08:08:55.541453 7f7a9fd83800 20 -- X.X.X.171:6807/8019137 wait: stopping accepter thread 2016-07-27 08:08:55.541459 7f7a9fd83800 10 accepter.stop accepter 2016-07-27 08:08:55.541533 7f7a84238700 20 accepter.accepter poll got 1 2016-07-27 08:08:55.541551 7f7a84238700 20 accepter.accepter closing 2016-07-27 08:08:55.541570 7f7a84238700 10 accepter.accepter stopping 2016-07-27 08:08:55.541677 7f7a9fd83800 20 -- X.X.X.171:6807/8019137 wait: stopped accepter thread 2016-07-27 08:08:55.541710 7f7a9fd83800 20 -- X.X.X.171:6807/8019137 wait: stopping reaper thread 2016-07-27 08:08:55.541762 7f7a972ef700 10 -- X.X.X.171:6807/8019137 reaper_entry done 2016-07-27 08:08:55.541889 7f7a9fd83800 20 -- X.X.X.171:6807/8019137 wait: stopped reaper thread 2016-07-27 08:08:55.541918 7f7a9fd83800 10 -- X.X.X.171:6807/8019137 wait: closing pipes 2016-07-27 08:08:55.541930 7f7a9fd83800 10 -- X.X.X.171:6807/8019137 reaper 2016-07-27 08:08:55.541945 7f7a9fd83800 10 -- X.X.X.171:6807/8019137 reaper done 2016-07-27 08:08:55.541954 7f7a9fd83800 10 -- X.X.X.171:6807/8019137 wait: waiting for pipes to close 2016-07-27 08:08:55.541961 7f7a9fd83800 10 -- X.X.X.171:6807/8019137 wait: done. 2016-07-27 08:08:55.541965 7f7a9fd83800 1 -- X.X.X.171:6807/8019137 shutdown complete. 2016-07-27 08:08:55.541973 7f7a9fd83800 10 -- 10.100.1.171:6809/8019137 wait: waiting for dispatch queue 2016-07-27 08:08:55.542071 7f7a9fd83800 10 -- 10.100.1.171:6809/8019137 wait: dispatch queue is stopped 2016-07-27 08:08:55.542085 7f7a9fd83800 20 -- 10.100.1.171:6809/8019137 wait: stopping accepter thread 2016-07-27 08:08:55.542090 7f7a9fd83800 10 accepter.stop accepter 2016-07-27 08:08:55.542147 7f7a82a35700 20 accepter.accepter poll got 1 2016-07-27 08:08:55.542169 7f7a82a35700 20 accepter.accepter closing 2016-07-27 08:08:55.542190 7f7a82a35700 10 accepter.accepter stopping 2016-07-27 08:08:55.542264 7f7a9fd83800 20 -- 10.100.1.171:6809/8019137 wait: stopped accepter thread 2016-07-27 08:08:55.542277 7f7a9fd83800 20 -- 10.100.1.171:6809/8019137 wait: stopping reaper thread 2016-07-27 08:08:55.542344 7f7a96aee700 10 -- 10.100.1.171:6809/8019137 reaper_entry done 2016-07-27 08:08:55.542484 7f7a9fd83800 20 -- 10.100.1.171:6809/8019137 wait: stopped reaper thread 2016-07-27 08:08:55.542512 7f7a9fd83800 10 -- 10.100.1.171:6809/8019137 wait: closing pipes 2016-07-27 08:08:55.542519 7f7a9fd83800 10 -- 10.100.1.171:6809/8019137 reaper 2016-07-27 08:08:55.542524 7f7a9fd83800 10 -- 10.100.1.171:6809/8019137 reaper done 2016-07-27 08:08:55.542533 7f7a9fd83800 10 -- 10.100.1.171:6809/8019137 wait: waiting for pipes to close 2016-07-27 08:08:55.542540 7f7a9fd83800 10 -- 10.100.1.171:6809/8019137 wait: done. 2016-07-27 08:08:55.542548 7f7a9fd83800 1 -- 10.100.1.171:6809/8019137 shutdown complete. 2016-07-27 08:08:55.542554 7f7a9fd83800 10 -- 10.100.1.171:6802/8019137 wait: waiting for dispatch queue 2016-07-27 08:08:55.542790 7f7a9fd83800 10 -- 10.100.1.171:6802/8019137 wait: dispatch queue is stopped 2016-07-27 08:08:55.542815 7f7a9fd83800 20 -- 10.100.1.171:6802/8019137 wait: stopping accepter thread 2016-07-27 08:08:55.542823 7f7a9fd83800 10 accepter.stop accepter 2016-07-27 08:08:55.542889 7f7a86a3d700 20 accepter.accepter poll got 1 2016-07-27 08:08:55.542908 7f7a86a3d700 20 accepter.accepter closing 2016-07-27 08:08:55.542931 7f7a86a3d700 10 accepter.accepter stopping 2016-07-27 08:08:55.543004 7f7a9fd83800 20 -- 10.100.1.171:6802/8019137 wait: stopped accepter thread 2016-07-27 08:08:55.543036 7f7a9fd83800 20 -- 10.100.1.171:6802/8019137 wait: stopping reaper thread 2016-07-27 08:08:55.543103 7f7a962ed700 10 -- 10.100.1.171:6802/8019137 reaper_entry done 2016-07-27 08:08:55.543928 7f7a9fd83800 20 -- 10.100.1.171:6802/8019137 wait: stopped reaper thread 2016-07-27 08:08:55.543944 7f7a9fd83800 10 -- 10.100.1.171:6802/8019137 wait: closing pipes 2016-07-27 08:08:55.543949 7f7a9fd83800 10 -- 10.100.1.171:6802/8019137 reaper 2016-07-27 08:08:55.543954 7f7a9fd83800 10 -- 10.100.1.171:6802/8019137 reaper done 2016-07-27 08:08:55.543958 7f7a9fd83800 10 -- 10.100.1.171:6802/8019137 wait: waiting for pipes to close 2016-07-27 08:08:55.543963 7f7a9fd83800 10 -- 10.100.1.171:6802/8019137 wait: done. 2016-07-27 08:08:55.543975 7f7a9fd83800 1 -- 10.100.1.171:6802/8019137 shutdown complete. 2016-07-27 08:08:55.543980 7f7a9fd83800 10 -- :/19137 wait: waiting for dispatch queue 2016-07-27 08:08:55.544030 7f7a9fd83800 10 -- :/19137 wait: dispatch queue is stopped 2016-07-27 08:08:55.544036 7f7a9fd83800 20 -- :/19137 wait: stopping reaper thread 2016-07-27 08:08:55.544135 7f7a95aec700 10 -- :/19137 reaper_entry done 2016-07-27 08:08:55.544195 7f7a9fd83800 20 -- :/19137 wait: stopped reaper thread 2016-07-27 08:08:55.544204 7f7a9fd83800 10 -- :/19137 wait: closing pipes 2016-07-27 08:08:55.544207 7f7a9fd83800 10 -- :/19137 reaper 2016-07-27 08:08:55.544209 7f7a9fd83800 10 -- :/19137 reaper done 2016-07-27 08:08:55.544212 7f7a9fd83800 10 -- :/19137 wait: waiting for pipes to close 2016-07-27 08:08:55.544214 7f7a9fd83800 10 -- :/19137 wait: done. 2016-07-27 08:08:55.544216 7f7a9fd83800 1 -- :/19137 shutdown complete. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com