Re: Anybody else hitting this panic in latest master with bluestore?

Kevan Rehm <krehm@xxxxxxxx> · Sun, 10 Jul 2016 15:57:04 +0000

Ramesh,

Yes, and I'm not suggesting a change to that.  Bluestore already has some
logic in it to "round down" the size of the block device to a
blocks_per_key boundary, by marking any trailing blocks as "in-use".  I
just tweaked the code to detect and include any trailing partial-block in
the range to be marked as in-use.

Kevan

On 7/10/16, 10:15 AM, "Ramesh Chander" <Ramesh.Chander@xxxxxxxxxxx> wrote:

>I think there are some calculations that expect storage to be 4k aligned
>in both allocators.
>
>I will look in to it.
>
>-Ramesh
>
>> -----Original Message-----
>> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
>> owner@xxxxxxxxxxxxxxx] On Behalf Of Somnath Roy
>> Sent: Sunday, July 10, 2016 8:22 PM
>> To: Kevan Rehm
>> Cc: ceph-devel
>> Subject: RE: Anybody else hitting this panic in latest master with
>>bluestore?
>>
>> Thanks Kevan for confirming this.
>> After I properly reformatted the drives, I didn't hit the issue, so,
>>didn't
>> bother chasing it.
>> Ramesh,
>> Could you please look into this ?
>>
>> Regards
>> Somnath
>>
>> -----Original Message-----
>> From: Kevan Rehm [mailto:krehm@xxxxxxxx]
>> Sent: Sunday, July 10, 2016 6:53 AM
>> To: Somnath Roy
>> Cc: ceph-devel
>> Subject: Re: Anybody else hitting this panic in latest master with
>>bluestore?
>>
>> Somnath,
>>
>> I hit this same bug while testing bluestore with a PMEM device,
>>ceph-deploy
>> created a partition whose size did not fall on a 4096-byte boundary.
>>
>> I opened ceph issue 16644 to document the problem, see the issue for a
>>3-
>> line patch I proposed that fixes it.
>>
>> Kevan
>>
>>
>> On 6/8/16, 2:14 AM, "ceph-devel-owner@xxxxxxxxxxxxxxx on behalf of
>> Somnath Roy" <ceph-devel-owner@xxxxxxxxxxxxxxx on behalf of
>> Somnath.Roy@xxxxxxxxxxx> wrote:
>>
>> >Try to format a device with 512 sector size. I will revert back the
>> >same device to 512 sector tomorrow and see if I can still reproduce.
>> >Here is the verbose log I collected, see if that helps.
>> >
>> >2016-06-07 13:32:25.431373 7fce0cee28c0 10 stupidalloc commit_start
>> >releasing 0 in extents 0
>> >2016-06-07 13:32:25.431580 7fce0cee28c0 10 stupidalloc commit_finish
>> >released 0 in extents 0
>> >2016-06-07 13:32:25.431733 7fce0cee28c0 10 stupidalloc reserve need
>> >1048576 num_free 306824863744 num_reserved 0
>> >2016-06-07 13:32:25.431743 7fce0cee28c0 10 stupidalloc allocate
>> >want_size
>> >1048576 alloc_unit 1048576 hint 0
>> >2016-06-07 13:32:25.435021 7fce0cee28c0  4 rocksdb: DB pointer
>> >0x7fce08909200
>> >2016-06-07 13:32:25.435049 7fce0cee28c0  1
>> >bluestore(/var/lib/ceph/osd/ceph-15) _open_db opened rocksdb path db
>> >options
>> >compression=kNoCompression,max_write_buffer_number=16,min_write_
>> buffer_
>> >num
>> >ber_to_merge=3,recycle_log_file_num=16
>> >2016-06-07 13:32:25.435057 7fce0cee28c0 20
>> >bluestore(/var/lib/ceph/osd/ceph-15) _open_fm initializing freespace
>> >2016-06-07 13:32:25.435066 7fce0cee28c0 10 freelist _init_misc
>> >bytes_per_key 0x80000, key_mask 0xfffffffffff80000
>> >2016-06-07 13:32:25.435074 7fce0cee28c0 10 freelist create rounding
>> >blocks up from 0x6f9fd151e00 to 0x6f9fd180000 (0x6f9fd180 blocks)
>> >2016-06-07 13:32:25.438853 7fce0cee28c0 -1
>> >os/bluestore/BitmapFreelistManager.cc: In function 'void
>> >BitmapFreelistManager::_xor(uint64_t, uint64_t,
>>KeyValueDB::Transaction)'
>> >thread 7fce0cee28c0 time 2016-06-07 13:32:25.435087
>> >os/bluestore/BitmapFreelistManager.cc: 477: FAILED assert((offset &
>> >block_mask) == offset)
>> >
>> > ceph version 10.2.0-2021-g55cb608
>> >(55cb608f63787f7969514ad0d7222da68ab84d88)
>> > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> >const*)+0x80) [0x562bdda880a0]
>> > 2: (BitmapFreelistManager::_xor(unsigned long, unsigned long,
>> >std::shared_ptr<KeyValueDB::TransactionImpl>)+0x12ed)
>> [0x562bdd75a96d]
>> > 3: (BitmapFreelistManager::create(unsigned long,
>> >std::shared_ptr<KeyValueDB::TransactionImpl>)+0x33f) [0x562bdd75b34f]
>> > 4: (BlueStore::_open_fm(bool)+0xcd3) [0x562bdd641683]
>> > 5: (BlueStore::mkfs()+0x8b9) [0x562bdd6839b9]
>> > 6: (OSD::mkfs(CephContext*, ObjectStore*,
>> >std::__cxx11::basic_string<char, std::char_traits<char>,
>> >std::allocator<char> > const&, uuid_d, int)+0x117) [0x562bdd3226c7]
>> > 7: (main()+0x1003) [0x562bdd2b4533]
>> > 8: (__libc_start_main()+0xf0) [0x7fce09946830]
>> > 9: (_start()+0x29) [0x562bdd3038b9]
>> > NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> >needed to interpret this.
>> >
>> >Thanks & Regards
>> >Somnath
>> >
>> >
>> >-----Original Message-----
>> >From: Ramesh Chander
>> >Sent: Tuesday, June 07, 2016 11:01 PM
>> >To: Somnath Roy; Mark Nelson; Sage Weil
>> >Cc: ceph-devel
>> >Subject: RE: Anybody else hitting this panic in latest master with
>> >bluestore?
>> >
>> >Hi Somnath,
>> >
>> >I think setting 4k block size is done intentionally.
>> >
>> >127
>> >128   // Operate as though the block size is 4 KB.  The backing file
>> >129   // blksize doesn't strictly matter except that some file systems
>>may
>> >130   // require a read/modify/write if we write something smaller than
>> >131   // it.
>> >132   block_size = g_conf->bdev_block_size;
>> >133   if (block_size != (unsigned)st.st_blksize) {
>> >134     dout(1) << __func__ << " backing device/file reports
>>st_blksize "
>> >135       << st.st_blksize << ", using bdev_block_size "
>> >136       << block_size << " anyway" << dendl;
>> >137   }
>> >138
>> >
>> >Other than more fragmentation we should not see any issue by taking
>> >block size as 4k instead of 512. At least I am not aware of.
>> >
>> >How to reproduce it? I can have a look.
>> >
>> >-Ramesh
>> >
>> >> -----Original Message-----
>> >> From: Somnath Roy
>> >> Sent: Wednesday, June 08, 2016 5:04 AM
>> >> To: Somnath Roy; Mark Nelson; Sage Weil
>> >> Cc: Ramesh Chander; ceph-devel
>> >> Subject: RE: Anybody else hitting this panic in latest master with
>> >>bluestore?
>> >>
>> >> Ok , I think I found out what is happening in my environment. This
>> >>drive is formatted with 512 logical block size.
>> >> BitMap allocator is by default is working with 4K block size and the
>> >>calculation is breaking (?). I have reformatted the device with 4K and
>> >>it worked fine.
>> >> I don't think taking this logical block size parameter as user input
>> >>may not be *wise*.
>> >> Since OS needs that all devices is advertising the correct logical
>> >>block size here.
>> >>
>> >> /sys/block/sdb/queue/logical_block_size
>> >>
>> >> Allocator needs to read the correct size from the above location.
>> >> Sage/Ramesh ?
>> >>
>> >> Thanks & Regards
>> >> Somnath
>> >>
>> >> -----Original Message-----
>> >> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
>> >>owner@xxxxxxxxxxxxxxx] On Behalf Of Somnath Roy
>> >> Sent: Tuesday, June 07, 2016 1:12 PM
>> >> To: Mark Nelson; Sage Weil
>> >> Cc: Ramesh Chander; ceph-devel
>> >> Subject: RE: Anybody else hitting this panic in latest master with
>> >>bluestore?
>> >>
>> >> Mark/Sage,
>> >> That problem seems to be gone. BTW, rocksdb folder is not cleaned
>> >> with 'make clean'. I took latest master and manually clean rocksdb
>> >> folder as you suggested..
>> >> But, now I am hitting the following crash in some of my drives. It
>> >> seems to be related to block alignment.
>> >>
>> >>      0> 2016-06-07 11:50:12.353375 7f5c0fe938c0 -1
>> >> os/bluestore/BitmapFreelistManager.cc: In function 'void
>> >>BitmapFreelistManager::_xor(uint64_t, uint64_t,
>> >>KeyValueDB::Transaction)'
>> >> thread 7f5c0fe938c0 time 2016-06-07 11:50:12.349722
>> >> os/bluestore/BitmapFreelistManager.cc: 477: FAILED assert((offset &
>> >> block_mask) == offset)
>> >>
>> >>  ceph version 10.2.0-2021-g55cb608
>> >> (55cb608f63787f7969514ad0d7222da68ab84d88)
>> >>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> >> const*)+0x80) [0x5652219dd0a0]
>> >>  2: (BitmapFreelistManager::_xor(unsigned long, unsigned long,
>> >> std::shared_ptr<KeyValueDB::TransactionImpl>)+0x12ed)
>> >> [0x5652216af96d]
>> >>  3: (BitmapFreelistManager::create(unsigned long,
>> >> std::shared_ptr<KeyValueDB::TransactionImpl>)+0x33f)
>> [0x5652216b034f]
>> >>  4: (BlueStore::_open_fm(bool)+0xcd3) [0x565221596683]
>> >>  5: (BlueStore::mkfs()+0x8b9) [0x5652215d89b9]
>> >>  6: (OSD::mkfs(CephContext*, ObjectStore*,
>> >> std::__cxx11::basic_string<char, std::char_traits<char>,
>> >> std::allocator<char>
>> >> > const&, uuid_d, int)+0x117) [0x5652212776c7]
>> >>  7: (main()+0x1003) [0x565221209533]
>> >>  8: (__libc_start_main()+0xf0) [0x7f5c0c8f7830]
>> >>  9: (_start()+0x29) [0x5652212588b9]
>> >>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> >> needed to interpret this.
>> >>
>> >> Here is my disk partitions..
>> >>
>> >> Osd.15 on /dev/sdi crashed..
>> >>
>> >>
>> >> sdi       8:128  0     7T  0 disk
>> >> ├─sdi1    8:129  0    10G  0 part /var/lib/ceph/osd/ceph-15
>> >> └─sdi2    8:130  0     7T  0 part
>> >> nvme0n1 259:0    0  15.4G  0 disk
>> >> root@emsnode11:~/ceph-master/src# fdisk /dev/sdi
>> >>
>> >> Welcome to fdisk (util-linux 2.27.1).
>> >> Changes will remain in memory only, until you decide to write them.
>> >> Be careful before using the write command.
>> >>
>> >>
>> >> Command (m for help): p
>> >> Disk /dev/sdi: 7 TiB, 7681501126656 bytes, 15002931888 sectors
>> >> Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical):
>> >> 512 bytes / 16384 bytes I/O size
>> >> (minimum/optimal): 16384 bytes / 16384 bytes Disklabel type: gpt Disk
>> >> identifier: 4A3182B9-23EA-441A-A113-FE904E81BF3E
>> >>
>> >> Device        Start         End     Sectors Size Type
>> >> /dev/sdi1      2048    20973567    20971520  10G Linux filesystem
>> >> /dev/sdi2  20973568 15002931854 14981958287   7T Linux filesystem
>> >>
>> >> Seems to be aligned properly , what alignment bitmap allocator is
>> >> looking for (Ramesh ?).
>> >> I will debug further and update.
>> >>
>> >> Thanks & Regards
>> >> Somnath
>> >>
>> >> -----Original Message-----
>> >> From: Somnath Roy
>> >> Sent: Tuesday, June 07, 2016 11:06 AM
>> >> To: 'Mark Nelson'; Sage Weil
>> >> Cc: Ramesh Chander; ceph-devel
>> >> Subject: RE: Anybody else hitting this panic in latest master with
>> >>bluestore?
>> >>
>> >> I will try now and let you know.
>> >>
>> >> Thanks & Regards
>> >> Somnath
>> >>
>> >> -----Original Message-----
>> >> From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
>> >> Sent: Tuesday, June 07, 2016 10:57 AM
>> >> To: Somnath Roy; Sage Weil
>> >> Cc: Ramesh Chander; ceph-devel
>> >> Subject: Re: Anybody else hitting this panic in latest master with
>> >>bluestore?
>> >>
>> >> Hi Somnath,
>> >>
>> >> Did Sage's suggestion fix it for you?  In my tests rocksdb wasn't
>> >> building properly after an upstream commit to detect when jemalloc
>> >> isn't
>> >> present:
>> >>
>> >>
>> https://github.com/facebook/rocksdb/commit/0850bc514737a64dc8ca13de8
>> >> 510fcad4756616a
>> >>
>> >> I've submitted a fix that is now in master.  If you clean the rocksdb
>> >>folder and try again with current master I believe it should work for
>> >>you.
>> >>
>> >> Thanks,
>> >> Mark
>> >>
>> >> On 06/07/2016 09:23 AM, Somnath Roy wrote:
>> >> > Sage,
>> >> > I did a global 'make clean' before build, isn't that sufficient ?
>> >> > Still need to go
>> >> to rocksdb folder and clean ?
>> >> >
>> >> >
>> >> > -----Original Message-----
>> >> > From: Sage Weil [mailto:sage@xxxxxxxxxxxx]
>> >> > Sent: Tuesday, June 07, 2016 6:06 AM
>> >> > To: Mark Nelson
>> >> > Cc: Somnath Roy; Ramesh Chander; ceph-devel
>> >> > Subject: Re: Anybody else hitting this panic in latest master with
>> >>bluestore?
>> >> >
>> >> > On Tue, 7 Jun 2016, Mark Nelson wrote:
>> >> >> I believe this is due to the rocksdb submodule update in PR #9466.
>> >> >> I'm working on tracking down the commit in rocksdb that's causing
>>it.
>> >> >
>> >> > Is it possible that the problem is that your build *didn't* update
>> >>rocksdb?
>> >> >
>> >> > The ceph makefile isn't smart enough to notice changes in the
>> >> > rocksdb/ dir
>> >> and rebuild.  You have to 'cd rocksdb ; make clean ; cd ..' after the
>> >> submodule updates to get a fresh build.
>> >> >
>> >> > Maybe you didn't do that, and some of the ceph code is build using
>> >> > the
>> >> new headers and data structures that don't match the previously
>> >> compiled rocksdb code?
>> >> >
>> >> > sage
>> >> > PLEASE NOTE: The information contained in this electronic mail
>> >> > message is
>> >> intended only for the use of the designated recipient(s) named above.
>> >> If the reader of this message is not the intended recipient, you are
>> >> hereby notified that you have received this message in error and that
>> >> any review, dissemination, distribution, or copying of this message
>> >> is strictly prohibited. If you have received this communication in
>> >> error, please notify the sender by telephone or e-mail (as shown
>> >> above) immediately and destroy any and all copies of this message in
>> >> your possession (whether hard copies or electronically stored
>>copies).
>> >> >
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> >> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
>> majordomo
>> >> info at http://vger.kernel.org/majordomo-info.html
>> >PLEASE NOTE: The information contained in this electronic mail message
>> >is intended only for the use of the designated recipient(s) named
>> >above. If the reader of this message is not the intended recipient, you
>> >are hereby notified that you have received this message in error and
>> >that any review, dissemination, distribution, or copying of this
>> >message is strictly prohibited. If you have received this communication
>> >in error, please notify the sender by telephone or e-mail (as shown
>> >above) immediately and destroy any and all copies of this message in
>> >your possession (whether hard copies or electronically stored copies).
>> >--
>> >To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> >in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
>> >info at  http://vger.kernel.org/majordomo-info.html
>>
>> PLEASE NOTE: The information contained in this electronic mail message
>>is
>> intended only for the use of the designated recipient(s) named above.
>>If the
>> reader of this message is not the intended recipient, you are hereby
>>notified
>> that you have received this message in error and that any review,
>> dissemination, distribution, or copying of this message is strictly
>>prohibited. If
>> you have received this communication in error, please notify the sender
>>by
>> telephone or e-mail (as shown above) immediately and destroy any and all
>> copies of this message in your possession (whether hard copies or
>> electronically stored copies).
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>in the
>> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
>PLEASE NOTE: The information contained in this electronic mail message is
>intended only for the use of the designated recipient(s) named above. If
>the reader of this message is not the intended recipient, you are hereby
>notified that you have received this message in error and that any
>review, dissemination, distribution, or copying of this message is
>strictly prohibited. If you have received this communication in error,
>please notify the sender by telephone or e-mail (as shown above)
>immediately and destroy any and all copies of this message in your
>possession (whether hard copies or electronically stored copies).

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html