John, > 2016-02-12 12:53:43.340526 7f149bc71940 -1 journal FileJournal::_open: unable to setup io_context (0) Success Try increasing aio-max-nr: echo 131072 > /proc/sys/fs/aio-max-nr Best regards, Alexey On Fri, Feb 12, 2016 at 4:51 PM, John Hogenmiller (yt) <john@xxxxxxxxxxx> wrote: > > > I have 7 servers, each containing 60 x 6TB drives in jbod mode. When I first > started, I only activated a couple drives on 3 nodes as Ceph OSDs. > Yesterday, I went to expand to the remaining nodes as well as prepare and > activate all the drives. > > ceph-disk prepare worked just fine. However, ceph-disk activate-all managed > to only activate 33 drives and failed on the rest. This is consistent all 7 > nodes (existing and newly installed). At the end of the day, I have 33 Ceph > OSDs activated per server and can't activate any more. I did have to bump up > the pg_num and pgp_num on the pool in order to accommodate the drives that > did activate. I don't know if having a low pg number during the mass influx > of OSDs caused an issue or not within the pool. I don't think so because I > can only set the pg_num to a maximum value determined by the number of known > OSDs. But maybe you have to expand slowly, increase pg's, expand osds, > increase pgs in a slow fashion. I certainly have not seen anything to > suggest a magic "33/node limit", and I've seen references to servers with up > to 72 Ceph OSDs on them. > > I then attempted to activate individual ceph osd's and got the same set of > errors. I even wiped a drive, re-ran `ceph-disk prepare` and `ceph-disk > activate` to have it fail in the same way. > > status: > ``` > root@ljb01:/home/ceph/rain-cluster# ceph status > cluster 4ebe7995-6a33-42be-bd4d-20f51d02ae45 > health HEALTH_OK > monmap e5: 5 mons at > {hail02-r01-06=172.29.4.153:6789/0,hail02-r01-08=172.29.4.155:6789/0,rain02-r01-01=172.29.4.148:6789/0,rain02-r01-03=172.29.4.150:6789/0,rain02-r01-04=172.29.4.151:6789/0} > election epoch 12, quorum 0,1,2,3,4 > rain02-r01-01,rain02-r01-03,rain02-r01-04,hail02-r01-06,hail02-r01-08 > osdmap e1116: 420 osds: 232 up, 232 in > flags sortbitwise > pgmap v397198: 10872 pgs, 14 pools, 101 MB data, 8456 objects > 38666 MB used, 1264 TB / 1264 TB avail > 10872 active+clean > ``` > > > > Here is what I get when I run ceph-disk prepare on a blank drive: > > ``` > root@rain02-r01-01:/etc/ceph# ceph-disk prepare /dev/sdbh1 > The operation has completed successfully. > The operation has completed successfully. > meta-data=/dev/sdbh1 isize=2048 agcount=6, agsize=268435455 > blks > = sectsz=512 attr=2, projid32bit=0 > data = bsize=4096 blocks=1463819665, imaxpct=5 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0 > log =internal log bsize=4096 blocks=521728, version=2 > = sectsz=512 sunit=0 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > The operation has completed successfully. > > root@rain02-r01-01:/etc/ceph# parted /dev/sdh print > Model: ATA HUS726060ALA640 (scsi) > Disk /dev/sdh: 6001GB > Sector size (logical/physical): 512B/512B > Partition Table: gpt > > Number Start End Size File system Name Flags > 2 1049kB 5369MB 5368MB ceph journal > 1 5370MB 6001GB 5996GB xfs ceph data > ``` > > And finally the errors from attempting to activate the drive. > > ``` > root@rain02-r01-01:/etc/ceph# ceph-disk activate /dev/sdbh1 > got monmap epoch 5 > 2016-02-12 12:53:43.340526 7f149bc71940 -1 journal FileJournal::_open: > unable to setup io_context (0) Success > 2016-02-12 12:53:43.340748 7f1493f83700 -1 journal io_submit to 0~4096 got > (22) Invalid argument > 2016-02-12 12:53:43.341186 7f149bc71940 -1 > filestore(/var/lib/ceph/tmp/mnt.KRphD_) could not find > -1/23c2fcde/osd_superblock/0 in index: (2) No such file or directory > os/FileJournal.cc: In function 'int FileJournal::write_aio_bl(off64_t&, > ceph::bufferlist&, uint64_t)' thread 7f1493f83700 time 2016-02-12 > 12:53:43.341355 > os/FileJournal.cc: 1469: FAILED assert(0 == "io_submit got unexpected > error") > ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x8b) [0x7f149b767f2b] > 2: (FileJournal::write_aio_bl(long&, ceph::buffer::list&, unsigned > long)+0x5ad) [0x7f149b5fe27d] > 3: (FileJournal::do_aio_write(ceph::buffer::list&)+0x263) [0x7f149b602e63] > 4: (FileJournal::write_thread_entry()+0x4e4) [0x7f149b607394] > 5: (FileJournal::Writer::entry()+0xd) [0x7f149b44bddd] > 6: (()+0x8182) [0x7f1499d87182] > 7: (clone()+0x6d) [0x7f14980ce47d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > 2016-02-12 12:53:43.345434 7f1493f83700 -1 os/FileJournal.cc: In function > 'int FileJournal::write_aio_bl(off64_t&, ceph::bufferlist&, uint64_t)' > thread 7f1493f83700 time 2016-02-12 12:53:43.341355 > os/FileJournal.cc: 1469: FAILED assert(0 == "io_submit got unexpected > error") > > ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x8b) [0x7f149b767f2b] > 2: (FileJournal::write_aio_bl(long&, ceph::buffer::list&, unsigned > long)+0x5ad) [0x7f149b5fe27d] > 3: (FileJournal::do_aio_write(ceph::buffer::list&)+0x263) [0x7f149b602e63] > 4: (FileJournal::write_thread_entry()+0x4e4) [0x7f149b607394] > 5: (FileJournal::Writer::entry()+0xd) [0x7f149b44bddd] > 6: (()+0x8182) [0x7f1499d87182] > 7: (clone()+0x6d) [0x7f14980ce47d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > -4> 2016-02-12 12:53:43.340526 7f149bc71940 -1 journal > FileJournal::_open: unable to setup io_context (0) Success > -3> 2016-02-12 12:53:43.340748 7f1493f83700 -1 journal io_submit to > 0~4096 got (22) Invalid argument > -1> 2016-02-12 12:53:43.341186 7f149bc71940 -1 > filestore(/var/lib/ceph/tmp/mnt.KRphD_) could not find > -1/23c2fcde/osd_superblock/0 in index: (2) No such file or directory > 0> 2016-02-12 12:53:43.345434 7f1493f83700 -1 os/FileJournal.cc: In > function 'int FileJournal::write_aio_bl(off64_t&, ceph::bufferlist&, > uint64_t)' thread 7f1493f83700 time 2016-02-12 12:53:43.3 > 41355 > os/FileJournal.cc: 1469: FAILED assert(0 == "io_submit got unexpected > error”) > > > ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x8b) [0x7f149b767f2b] > 2: (FileJournal::write_aio_bl(long&, ceph::buffer::list&, unsigned > long)+0x5ad) [0x7f149b5fe27d] > 3: (FileJournal::do_aio_write(ceph::buffer::list&)+0x263) [0x7f149b602e63] > 4: (FileJournal::write_thread_entry()+0x4e4) [0x7f149b607394] > 5: (FileJournal::Writer::entry()+0xd) [0x7f149b44bddd] > 6: (()+0x8182) [0x7f1499d87182] > 7: (clone()+0x6d) [0x7f14980ce47d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > terminate called after throwing an instance of 'ceph::FailedAssertion' > *** Caught signal (Aborted) ** > in thread 7f1493f83700 > ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) > 1: (()+0x7d02ca) [0x7f149b67b2ca] > 2: (()+0x10340) [0x7f1499d8f340] > 3: (gsignal()+0x39) [0x7f149800acc9] > 4: (abort()+0x148) [0x7f149800e0d8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f1498915535] > 6: (()+0x5e6d6) [0x7f14989136d6] > 7: (()+0x5e703) [0x7f1498913703] > 8: (()+0x5e922) [0x7f1498913922] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x278) [0x7f149b768118] > 10: (FileJournal::write_aio_bl(long&, ceph::buffer::list&, unsigned > long)+0x5ad) [0x7f149b5fe27d] > 11: (FileJournal::do_aio_write(ceph::buffer::list&)+0x263) [0x7f149b602e63] > 12: (FileJournal::write_thread_entry()+0x4e4) [0x7f149b607394] > 13: (FileJournal::Writer::entry()+0xd) [0x7f149b44bddd] > 14: (()+0x8182) [0x7f1499d87182] > 15: (clone()+0x6d) [0x7f14980ce47d] > 2016-02-12 12:53:43.348498 7f1493f83700 -1 *** Caught signal (Aborted) ** > in thread 7f1493f83700 > > ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) > 1: (()+0x7d02ca) [0x7f149b67b2ca] > 2: (()+0x10340) [0x7f1499d8f340] > 3: (gsignal()+0x39) [0x7f149800acc9] > 4: (abort()+0x148) [0x7f149800e0d8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f1498915535] > 6: (()+0x5e6d6) [0x7f14989136d6] > 7: (()+0x5e703) [0x7f1498913703] > 8: (()+0x5e922) [0x7f1498913922] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x278) [0x7f149b768118] > 10: (FileJournal::write_aio_bl(long&, ceph::buffer::list&, unsigned > long)+0x5ad) [0x7f149b5fe27d] > 11: (FileJournal::do_aio_write(ceph::buffer::list&)+0x263) [0x7f149b602e63] > 12: (FileJournal::write_thread_entry()+0x4e4) [0x7f149b607394] > 13: (FileJournal::Writer::entry()+0xd) [0x7f149b44bddd] > 14: (()+0x8182) [0x7f1499d87182] > 15: (clone()+0x6d) [0x7f14980ce47d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > 0> 2016-02-12 12:53:43.348498 7f1493f83700 -1 *** Caught signal > (Aborted) ** > in thread 7f1493f83700 > > ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) > 1: (()+0x7d02ca) [0x7f149b67b2ca] > 2: (()+0x10340) [0x7f1499d8f340] > 3: (gsignal()+0x39) [0x7f149800acc9] > 4: (abort()+0x148) [0x7f149800e0d8] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f1498915535] > 6: (()+0x5e6d6) [0x7f14989136d6] > 7: (()+0x5e703) [0x7f1498913703] > 8: (()+0x5e922) [0x7f1498913922] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x278) [0x7f149b768118] > 10: (FileJournal::write_aio_bl(long&, ceph::buffer::list&, unsigned > long)+0x5ad) [0x7f149b5fe27d] > 11: (FileJournal::do_aio_write(ceph::buffer::list&)+0x263) [0x7f149b602e63] > 12: (FileJournal::write_thread_entry()+0x4e4) [0x7f149b607394] 13: > (FileJournal::Writer::entry()+0xd) [0x7f149b44bddd] > 14: (()+0x8182) [0x7f1499d87182] > 15: (clone()+0x6d) [0x7f14980ce47d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > ERROR:ceph-disk:Failed to activate > Traceback (most recent call last): > File "/usr/sbin/ceph-disk", line 3576, in <module> > main(sys.argv[1:]) > File "/usr/sbin/ceph-disk", line 3532, in main > main_catch(args.func, args) > File "/usr/sbin/ceph-disk", line 3554, in main_catch > func(args) > File "/usr/sbin/ceph-disk", line 2424, in main_activate > dmcrypt_key_dir=args.dmcrypt_key_dir, > File "/usr/sbin/ceph-disk", line 2197, in mount_activate > (osd_id, cluster) = activate(path, activate_key_template, init) > File "/usr/sbin/ceph-disk", line 2360, in activate > keyring=keyring, > File "/usr/sbin/ceph-disk", line 1950, in mkfs > '--setgroup', get_ceph_user(), File "/usr/sbin/ceph-disk", line 349, in > command_check_call return subprocess.check_call(arguments) > File "/usr/lib/python2.7/subprocess.py", line 540, in check_call raise > CalledProcessError(retcode, cmd)subprocess.CalledProcessError: Command > '['/usr/bin/ceph-osd', '--cluster', 'ceph', '--mkfs', '--mkkey', '-i', > '165', '--monmap', '/var/lib/ceph/tmp/mnt.KRphD_/activate.monmap', > '--osd-data', '/var/lib/ceph/tmp/mnt.KRphD_', '--osd-journal', > '/var/lib/ceph/tmp/mnt.K > > root@rain02-r01-01:/etc/ceph# ls -l /var/lib/ceph/tmp/ > total 0 > -rw-r--r-- 1 root root 0 Feb 12 12:58 ceph-disk.activate.lock > -rw-r--r-- 1 root root 0 Feb 12 12:58 ceph-disk.prepare.lock > ``` > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com