On Sunday 18 September 2016 12:12 PM, Kamble, Nitin A wrote: >> On Sep 16, 2016, at 4:15 PM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> wrote: >> >> The sizes of the partitions from ceph.conf is not mandatory. >> If you mention , the partition of db/wal/data , it should be using the entire size of the partition. If you mention the sizes on top , it will truncate to that size..Here is one sample config.. >> >> [osd.0] >> host = emsnode10 >> devs = /dev/sdb1 >> bluestore_block_db_path = /dev/sdb2 >> bluestore_block_wal_path = /dev/sdb3 >> bluestore_block_path = /dev/sdb4 >> >> I guess devs is not required for ceph-disk. I am using old mkcephfs stuff and that's why it is needed. >> >> Thanks & Regards >> Somnath >> > Instead of using the config for osd definition I am creating the OSDs manually using a script in which all > the wal,db,block links are created. Even if one does not specify the partitions sizes, there are defaults > predefined. Is there a way to ignore the size from the config and use the size by probing the device? if the size is not specified in the config(ceph.conf), default the drive/partition is probed and whole capacity is used. varada > > Thanks, > Nitin > >> -----Original Message----- >> From: Kamble, Nitin A [mailto:Nitin.Kamble@xxxxxxxxxxxx] >> Sent: Friday, September 16, 2016 4:09 PM >> To: Sage Weil >> Cc: Somnath Roy; Ceph Development >> Subject: Re: Bluestore OSD support in ceph-disk >> >> >>> On Sep 16, 2016, at 1:54 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >>> >>> On Fri, 16 Sep 2016, Kamble, Nitin A wrote: >>>>> On Sep 16, 2016, at 12:23 PM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> wrote: >>>>> >>>>> How you configured bluestore, all default ? i.e all in single partition , no separate partition for db/wal ? >>>> It is separated partitions for data(SSD), wal(SSD), rocksdb(SSD), & block store (HDD). >>>> >>>>> Wondering if you are out of db space/disk space ? >>>> I notice a misconfiguration on the cluster now. The wal & db partition use got swapped, so it is getting just 128MB db partition now. Probably this is the cause of the assert. >>> FWIW bluefs is supposed to fall back on any allocation failure to the >>> next larger/slower device (wal -> db -> primary), so having a tiny wal >>> or tiny db shouldn't actually matter. A bluefs log (debug bluefs = 10 >>> or 20) leading up to any crash there would be helpful. >>> >>> Thanks! >>> sage >>> >> Good to know this fall back mechanism. >> In my previous run the partitions and sizes in config did not match. I see the ceph-daemon-dump showing 900MB+ used for db while db partition was 128MB. I was thinking it started overwriting on to the next partition. But instead as per this backup logic it started using HDD for db. >> >> One issue I see is that, ceph.conf lists the sizes of the wal,db,& block. But it is possible that actual partitions may have different sizes. From the ceph-daemon-dump output looks like it is not looking at the partition’s real size, instead the code is assuming the sizes from the config file as the partition sizes. I think probing of the size of existing devices/files will be better than taking the sizes from the config file blindly. >> >> After 5 hours or so 6+ osds were down out of 30. >> We will be running the stress test one again with the fixed partition configuration with debug level of 0, to get max performance out. And if that fails then I will switch to debug level 10 or 20, and gather some detailed logs. >> >> Thanks, >> Nitin >> >>>>> We had some issues in this front sometimes back which was fixed, may >>>>> be a new issue (?). Need verbose log for at least bluefs >>>>> (debug_bluefs = 20/20) >>>> Let me fix the cluster configuration, to give better space to the DB partition. And if with that this issue comes up then I will try capturing detailed logs. >>>> >>>>> BTW, what is your workload (block size, IO pattern ) ? >>>> The workload is internal teradata benchmark, which simulates IO pattern of database disk access with various block sizes and IO pattern. >>>> >>>> Thanks, >>>> Nitin >>>> >>>> >>>> >>>>> -----Original Message----- >>>>> From: Kamble, Nitin A [mailto:Nitin.Kamble@xxxxxxxxxxxx] >>>>> Sent: Friday, September 16, 2016 12:00 PM >>>>> To: Somnath Roy >>>>> Cc: Sage Weil; Ceph Development >>>>> Subject: Re: Bluestore OSD support in ceph-disk >>>>> >>>>> >>>>>> On Sep 16, 2016, at 11:43 AM, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> wrote: >>>>>> >>>>>> Please send the snippet (very first trace , go up in the log) where it is actually printing the assert. >>>>>> BTW, what workload you are running ? >>>>>> >>>>>> Thanks & Regards >>>>>> Somnath >>>>>> >>>>> Here it is. >>>>> >>>>> 2016-09-16 08:49:30.605845 7fb5a96ba700 -1 >>>>> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild >>>>> /BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: In >>>>> function 'int BlueFS::_allocate(uint8_t, uint64_t, std::vecto >>>>> r<bluefs_extent_t>*)' thread 7fb5a96ba700 time 2016-09-16 >>>>> 08:49:30.602139 >>>>> /build/nitin/nightly_builds/20160914_125459-master/ceph.git/rpmbuild >>>>> /BUILD/ceph-v11.0.0-2309.g9096ad3/src/os/bluestore/BlueFS.cc: 1686: >>>>> FAILED assert(0 == "allocate failed... wtf") >>>>> >>>>> ceph version v11.0.0-2309-g9096ad3 >>>>> (9096ad37f2c0798c26d7784fb4e7a781feb72cb8) >>>>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>>>> const*)+0x8b) [0x7fb5bf43a11b] >>>>> 2: (BlueFS::_allocate(unsigned char, unsigned long, >>>>> std::vector<bluefs_extent_t, std::allocator<bluefs_extent_t> >>>>>> *)+0x8ad) [0x7fb5bf2735dd] >>>>> 3: (BlueFS::_flush_and_sync_log(std::unique_lock<std::mutex>&, >>>>> unsigned long, unsigned long)+0xb4f) [0x7fb5bf27aa1f] >>>>> 4: (BlueFS::_fsync(BlueFS::FileWriter*, >>>>> std::unique_lock<std::mutex>&)+0x29b) [0x7fb5bf27bc9b] >>>>> 5: (BlueRocksWritableFile::Sync()+0x4e) [0x7fb5bf29125e] >>>>> 6: (rocksdb::WritableFileWriter::SyncInternal(bool)+0x139) >>>>> [0x7fb5bf388699] >>>>> 7: (rocksdb::WritableFileWriter::Sync(bool)+0x88) [0x7fb5bf389238] >>>>> 8: (rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, >>>>> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, >>>>> unsigned long, bool)+0x13cf) [0x7fb5bf2e0a2f] >>>>> 9: (rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, >>>>> rocksdb::WriteBatch*)+0x27) [0x7fb5bf2e1637] >>>>> 10: >>>>> (RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::T >>>>> ransactionImpl>)+0x5b) [0x7fb5bf21a14b] >>>>> 11: (BlueStore::_kv_sync_thread()+0xf5a) [0x7fb5bf1e7ffa] >>>>> 12: (BlueStore::KVSyncThread::entry()+0xd) [0x7fb5bf1f5a6d] >>>>> 13: (()+0x80a4) [0x7fb5bb4a70a4] >>>>> 14: (clone()+0x6d) [0x7fb5ba32004d] >>>>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> Nitin >>>>> >>>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo >>>> info at http://vger.kernel.org/majordomo-info.html > N�����r��y���b�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+�����ݢj"��!�i ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f