RE: Anybody else hitting this panic in latest master with bluestore?

Somnath Roy <Somnath.Roy@xxxxxxxxxxx> · Sun, 10 Jul 2016 14:52:20 +0000

Thanks Kevan for confirming this.
After I properly reformatted the drives, I didn't hit the issue, so, didn't bother chasing it.
Ramesh,
Could you please look into this ?

Regards
Somnath

-----Original Message-----
From: Kevan Rehm [mailto:krehm@xxxxxxxx]
Sent: Sunday, July 10, 2016 6:53 AM
To: Somnath Roy
Cc: ceph-devel
Subject: Re: Anybody else hitting this panic in latest master with bluestore?

Somnath,

I hit this same bug while testing bluestore with a PMEM device, ceph-deploy created a partition whose size did not fall on a 4096-byte
boundary.

I opened ceph issue 16644 to document the problem, see the issue for a 3-line patch I proposed that fixes it.

Kevan

On 6/8/16, 2:14 AM, "ceph-devel-owner@xxxxxxxxxxxxxxx on behalf of Somnath Roy" <ceph-devel-owner@xxxxxxxxxxxxxxx on behalf of Somnath.Roy@xxxxxxxxxxx> wrote:

>Try to format a device with 512 sector size. I will revert back the
>same device to 512 sector tomorrow and see if I can still reproduce.
>Here is the verbose log I collected, see if that helps.
>
>2016-06-07 13:32:25.431373 7fce0cee28c0 10 stupidalloc commit_start
>releasing 0 in extents 0
>2016-06-07 13:32:25.431580 7fce0cee28c0 10 stupidalloc commit_finish
>released 0 in extents 0
>2016-06-07 13:32:25.431733 7fce0cee28c0 10 stupidalloc reserve need
>1048576 num_free 306824863744 num_reserved 0
>2016-06-07 13:32:25.431743 7fce0cee28c0 10 stupidalloc allocate
>want_size
>1048576 alloc_unit 1048576 hint 0
>2016-06-07 13:32:25.435021 7fce0cee28c0  4 rocksdb: DB pointer
>0x7fce08909200
>2016-06-07 13:32:25.435049 7fce0cee28c0  1
>bluestore(/var/lib/ceph/osd/ceph-15) _open_db opened rocksdb path db
>options
>compression=kNoCompression,max_write_buffer_number=16,min_write_buffer_
>num
>ber_to_merge=3,recycle_log_file_num=16
>2016-06-07 13:32:25.435057 7fce0cee28c0 20
>bluestore(/var/lib/ceph/osd/ceph-15) _open_fm initializing freespace
>2016-06-07 13:32:25.435066 7fce0cee28c0 10 freelist _init_misc
>bytes_per_key 0x80000, key_mask 0xfffffffffff80000
>2016-06-07 13:32:25.435074 7fce0cee28c0 10 freelist create rounding
>blocks up from 0x6f9fd151e00 to 0x6f9fd180000 (0x6f9fd180 blocks)
>2016-06-07 13:32:25.438853 7fce0cee28c0 -1
>os/bluestore/BitmapFreelistManager.cc: In function 'void
>BitmapFreelistManager::_xor(uint64_t, uint64_t, KeyValueDB::Transaction)'
>thread 7fce0cee28c0 time 2016-06-07 13:32:25.435087
>os/bluestore/BitmapFreelistManager.cc: 477: FAILED assert((offset &
>block_mask) == offset)
>
> ceph version 10.2.0-2021-g55cb608
>(55cb608f63787f7969514ad0d7222da68ab84d88)
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>const*)+0x80) [0x562bdda880a0]
> 2: (BitmapFreelistManager::_xor(unsigned long, unsigned long,
>std::shared_ptr<KeyValueDB::TransactionImpl>)+0x12ed) [0x562bdd75a96d]
> 3: (BitmapFreelistManager::create(unsigned long,
>std::shared_ptr<KeyValueDB::TransactionImpl>)+0x33f) [0x562bdd75b34f]
> 4: (BlueStore::_open_fm(bool)+0xcd3) [0x562bdd641683]
> 5: (BlueStore::mkfs()+0x8b9) [0x562bdd6839b9]
> 6: (OSD::mkfs(CephContext*, ObjectStore*,
>std::__cxx11::basic_string<char, std::char_traits<char>,
>std::allocator<char> > const&, uuid_d, int)+0x117) [0x562bdd3226c7]
> 7: (main()+0x1003) [0x562bdd2b4533]
> 8: (__libc_start_main()+0xf0) [0x7fce09946830]
> 9: (_start()+0x29) [0x562bdd3038b9]
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>needed to interpret this.
>
>Thanks & Regards
>Somnath
>
>
>-----Original Message-----
>From: Ramesh Chander
>Sent: Tuesday, June 07, 2016 11:01 PM
>To: Somnath Roy; Mark Nelson; Sage Weil
>Cc: ceph-devel
>Subject: RE: Anybody else hitting this panic in latest master with
>bluestore?
>
>Hi Somnath,
>
>I think setting 4k block size is done intentionally.
>
>127
>128   // Operate as though the block size is 4 KB.  The backing file
>129   // blksize doesn't strictly matter except that some file systems may
>130   // require a read/modify/write if we write something smaller than
>131   // it.
>132   block_size = g_conf->bdev_block_size;
>133   if (block_size != (unsigned)st.st_blksize) {
>134     dout(1) << __func__ << " backing device/file reports st_blksize "
>135       << st.st_blksize << ", using bdev_block_size "
>136       << block_size << " anyway" << dendl;
>137   }
>138
>
>Other than more fragmentation we should not see any issue by taking
>block size as 4k instead of 512. At least I am not aware of.
>
>How to reproduce it? I can have a look.
>
>-Ramesh
>
>> -----Original Message-----
>> From: Somnath Roy
>> Sent: Wednesday, June 08, 2016 5:04 AM
>> To: Somnath Roy; Mark Nelson; Sage Weil
>> Cc: Ramesh Chander; ceph-devel
>> Subject: RE: Anybody else hitting this panic in latest master with
>>bluestore?
>>
>> Ok , I think I found out what is happening in my environment. This
>>drive is formatted with 512 logical block size.
>> BitMap allocator is by default is working with 4K block size and the
>>calculation is breaking (?). I have reformatted the device with 4K and
>>it worked fine.
>> I don't think taking this logical block size parameter as user input
>>may not be *wise*.
>> Since OS needs that all devices is advertising the correct logical
>>block size here.
>>
>> /sys/block/sdb/queue/logical_block_size
>>
>> Allocator needs to read the correct size from the above location.
>> Sage/Ramesh ?
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
>>owner@xxxxxxxxxxxxxxx] On Behalf Of Somnath Roy
>> Sent: Tuesday, June 07, 2016 1:12 PM
>> To: Mark Nelson; Sage Weil
>> Cc: Ramesh Chander; ceph-devel
>> Subject: RE: Anybody else hitting this panic in latest master with
>>bluestore?
>>
>> Mark/Sage,
>> That problem seems to be gone. BTW, rocksdb folder is not cleaned
>> with 'make clean'. I took latest master and manually clean rocksdb
>> folder as you suggested..
>> But, now I am hitting the following crash in some of my drives. It
>> seems to be related to block alignment.
>>
>>      0> 2016-06-07 11:50:12.353375 7f5c0fe938c0 -1
>> os/bluestore/BitmapFreelistManager.cc: In function 'void
>>BitmapFreelistManager::_xor(uint64_t, uint64_t,
>>KeyValueDB::Transaction)'
>> thread 7f5c0fe938c0 time 2016-06-07 11:50:12.349722
>> os/bluestore/BitmapFreelistManager.cc: 477: FAILED assert((offset &
>> block_mask) == offset)
>>
>>  ceph version 10.2.0-2021-g55cb608
>> (55cb608f63787f7969514ad0d7222da68ab84d88)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x80) [0x5652219dd0a0]
>>  2: (BitmapFreelistManager::_xor(unsigned long, unsigned long,
>> std::shared_ptr<KeyValueDB::TransactionImpl>)+0x12ed)
>> [0x5652216af96d]
>>  3: (BitmapFreelistManager::create(unsigned long,
>> std::shared_ptr<KeyValueDB::TransactionImpl>)+0x33f) [0x5652216b034f]
>>  4: (BlueStore::_open_fm(bool)+0xcd3) [0x565221596683]
>>  5: (BlueStore::mkfs()+0x8b9) [0x5652215d89b9]
>>  6: (OSD::mkfs(CephContext*, ObjectStore*,
>> std::__cxx11::basic_string<char, std::char_traits<char>,
>> std::allocator<char>
>> > const&, uuid_d, int)+0x117) [0x5652212776c7]
>>  7: (main()+0x1003) [0x565221209533]
>>  8: (__libc_start_main()+0xf0) [0x7f5c0c8f7830]
>>  9: (_start()+0x29) [0x5652212588b9]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed to interpret this.
>>
>> Here is my disk partitions..
>>
>> Osd.15 on /dev/sdi crashed..
>>
>>
>> sdi       8:128  0     7T  0 disk
>> ├─sdi1    8:129  0    10G  0 part /var/lib/ceph/osd/ceph-15
>> └─sdi2    8:130  0     7T  0 part
>> nvme0n1 259:0    0  15.4G  0 disk
>> root@emsnode11:~/ceph-master/src# fdisk /dev/sdi
>>
>> Welcome to fdisk (util-linux 2.27.1).
>> Changes will remain in memory only, until you decide to write them.
>> Be careful before using the write command.
>>
>>
>> Command (m for help): p
>> Disk /dev/sdi: 7 TiB, 7681501126656 bytes, 15002931888 sectors
>> Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical):
>> 512 bytes / 16384 bytes I/O size
>> (minimum/optimal): 16384 bytes / 16384 bytes Disklabel type: gpt Disk
>> identifier: 4A3182B9-23EA-441A-A113-FE904E81BF3E
>>
>> Device        Start         End     Sectors Size Type
>> /dev/sdi1      2048    20973567    20971520  10G Linux filesystem
>> /dev/sdi2  20973568 15002931854 14981958287   7T Linux filesystem
>>
>> Seems to be aligned properly , what alignment bitmap allocator is
>> looking for (Ramesh ?).
>> I will debug further and update.
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Somnath Roy
>> Sent: Tuesday, June 07, 2016 11:06 AM
>> To: 'Mark Nelson'; Sage Weil
>> Cc: Ramesh Chander; ceph-devel
>> Subject: RE: Anybody else hitting this panic in latest master with
>>bluestore?
>>
>> I will try now and let you know.
>>
>> Thanks & Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
>> Sent: Tuesday, June 07, 2016 10:57 AM
>> To: Somnath Roy; Sage Weil
>> Cc: Ramesh Chander; ceph-devel
>> Subject: Re: Anybody else hitting this panic in latest master with
>>bluestore?
>>
>> Hi Somnath,
>>
>> Did Sage's suggestion fix it for you?  In my tests rocksdb wasn't
>> building properly after an upstream commit to detect when jemalloc
>> isn't
>> present:
>>
>> https://github.com/facebook/rocksdb/commit/0850bc514737a64dc8ca13de8
>> 510fcad4756616a
>>
>> I've submitted a fix that is now in master.  If you clean the rocksdb
>>folder and try again with current master I believe it should work for
>>you.
>>
>> Thanks,
>> Mark
>>
>> On 06/07/2016 09:23 AM, Somnath Roy wrote:
>> > Sage,
>> > I did a global 'make clean' before build, isn't that sufficient ?
>> > Still need to go
>> to rocksdb folder and clean ?
>> >
>> >
>> > -----Original Message-----
>> > From: Sage Weil [mailto:sage@xxxxxxxxxxxx]
>> > Sent: Tuesday, June 07, 2016 6:06 AM
>> > To: Mark Nelson
>> > Cc: Somnath Roy; Ramesh Chander; ceph-devel
>> > Subject: Re: Anybody else hitting this panic in latest master with
>>bluestore?
>> >
>> > On Tue, 7 Jun 2016, Mark Nelson wrote:
>> >> I believe this is due to the rocksdb submodule update in PR #9466.
>> >> I'm working on tracking down the commit in rocksdb that's causing it.
>> >
>> > Is it possible that the problem is that your build *didn't* update
>>rocksdb?
>> >
>> > The ceph makefile isn't smart enough to notice changes in the
>> > rocksdb/ dir
>> and rebuild.  You have to 'cd rocksdb ; make clean ; cd ..' after the
>> submodule updates to get a fresh build.
>> >
>> > Maybe you didn't do that, and some of the ceph code is build using
>> > the
>> new headers and data structures that don't match the previously
>> compiled rocksdb code?
>> >
>> > sage
>> > PLEASE NOTE: The information contained in this electronic mail
>> > message is
>> intended only for the use of the designated recipient(s) named above.
>> If the reader of this message is not the intended recipient, you are
>> hereby notified that you have received this message in error and that
>> any review, dissemination, distribution, or copying of this message
>> is strictly prohibited. If you have received this communication in
>> error, please notify the sender by telephone or e-mail (as shown
>> above) immediately and destroy any and all copies of this message in
>> your possession (whether hard copies or electronically stored copies).
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
>> info at http://vger.kernel.org/majordomo-info.html
>PLEASE NOTE: The information contained in this electronic mail message
>is intended only for the use of the designated recipient(s) named
>above. If the reader of this message is not the intended recipient, you
>are hereby notified that you have received this message in error and
>that any review, dissemination, distribution, or copying of this
>message is strictly prohibited. If you have received this communication
>in error, please notify the sender by telephone or e-mail (as shown
>above) immediately and destroy any and all copies of this message in
>your possession (whether hard copies or electronically stored copies).
>--
>To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
>info at  http://vger.kernel.org/majordomo-info.html

PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html