Re: mkfs.ext4 hang on RBD volume

Vincent Godin <vince.mlist@xxxxxxxxx> · Tue, 17 Jan 2017 12:34:07 +0100

We found the issue. It was a simply "max open files" on user qemu which was reached. When we do in serial a lot of mkfs, there is a lot of sockets 
open to ceph backend and qemu reach its max open files limit. So we increased max open files in qemu.conf and the problem disapeared

2017-01-16 19:19 GMT+01:00 Jason Dillaman <jdillama@xxxxxxxxxx>:
Can you ensure that you have the "admin socket" configured for your

librbd-backed VM so that you can do the following when you hit that

condition:

ceph --admin-daemon <path to librbd asok file> objecter_requests

That will dump out any hung IO requests between librbd and the OSDs. I

would also check your librbd logs to see if you are seeing an error

like "heartbeat_map is_healthy 'tp_librbd thread tp_librbd' had timed

out after 60" being logged periodically, which would indicate a thread

deadlock within librbd.

On Mon, Jan 16, 2017 at 1:12 PM, Vincent Godin <vince.mlist@xxxxxxxxx> wrote:

> We are using librbd on a host with CentOS 7.2 via virtio-blk. This server

> hosts the VMs on which we are doing our tests. But we have exactly the same

> behaviour than #9071. We try to follow the thread to the bug 8818 but we

> didn't reproduce the issue with a lot of DD. Each time we try with

> mkfs.ext4, there is always one process over the 16 (we have 16 volumes)

> which hangs !

>

> 2017-01-16 17:45 GMT+01:00 Jason Dillaman <jdillama@xxxxxxxxxx>:

>>

>> Are you using krbd directly within the VM or librbd via

>> virtio-blk/scsi? Ticket #9071 is against krbd.

>>

>> On Mon, Jan 16, 2017 at 11:34 AM, Vincent Godin <vince.mlist@xxxxxxxxx>

>> wrote:

>> > In fact, we can reproduce the problem from VM with CentOS 6.7, 7.2 or

>> > 7.3.

>> > We can reproduce it each time with this config : one VM (here in CentOS

>> > 6.7)

>> > with 16 RBD volumes of 100GB attached. When we launch in serial

>> > mkfs.ext4 on

>> > each of these volumes, we allways encounter the problem on one of them.

>> > We

>> > tried with the option -E nodiscard but we still have the problem. It'

>> > look

>> > exactly like the bug #9071 with the same dmesg message :

>> >

>> >  vdh: unknown partition table

>> > EXT4-fs (vdf): mounted filesystem with ordered data mode. Opts:

>> > EXT4-fs (vdg): mounted filesystem with ordered data mode. Opts:

>> > INFO: task flush-252:112:2903 blocked for more than 120 seconds.

>> >       Not tainted 2.6.32-573.18.1.el6.x86_64 #1

>> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this

>> > message.

>> > flush-252:112 D 0000000000000000     0  2903      2 0x00000080

>> >  ffff8808328bf6e0 0000000000000046 ffff8808ffffffff 000000003d697f73

>> >  0000000000000000 ffff88082fbd7ec0 0000000000021454 ffffffffa78356ec

>> >  000000002b9db4fe ffffffff81aa6700 ffff88082efc9ad8 ffff8808328bffd8

>> > Call Trace:

>> >  [<ffffffff81539673>] io_schedule+0x73/0xc0

>> >  [<ffffffff81276598>] get_request_wait+0x108/0x1d0

>> >  [<ffffffff810a1460>] ? autoremove_wake_function+0x0/0x40

>> >  [<ffffffff812766f9>] blk_queue_bio+0x99/0x610

>> >  [<ffffffff81274ec0>] generic_make_request+0x240/0x5a0

>> >  [<ffffffff81129cf5>] ? mempool_alloc_slab+0x15/0x20

>> >  [<ffffffff81129e93>] ? mempool_alloc+0x63/0x140

>> >  [<ffffffff81275290>] submit_bio+0x70/0x120

>> >  [<ffffffff811c7dcd>] submit_bh+0x11d/0x1f0

>> >  [<ffffffff811ca588>] __block_write_full_page+0x1c8/0x330

>> >  [<ffffffff811c9550>] ? end_buffer_async_write+0x0/0x190

>> >  [<ffffffff811ce450>] ? blkdev_get_block+0x0/0x20

>> >  [<ffffffff811ce450>] ? blkdev_get_block+0x0/0x20

>> >  [<ffffffff811ca7d0>] block_write_full_page_endio+0xe0/0x120

>> >  [<ffffffff81126ff0>] ? find_get_pages_tag+0x40/0x130

>> >  [<ffffffff811ca825>] block_write_full_page+0x15/0x20

>> >  [<ffffffff811cf5e8>] blkdev_writepage+0x18/0x20

>> >  [<ffffffff8113b387>] __writepage+0x17/0x40

>> >  [<ffffffff8113c64d>] write_cache_pages+0x1fd/0x4c0

>> >  [<ffffffff8113b370>] ? __writepage+0x0/0x40

>> >  [<ffffffff8113c934>] generic_writepages+0x24/0x30

>> >  [<ffffffff8113c961>] do_writepages+0x21/0x40

>> >  [<ffffffff811bf01d>] writeback_single_inode+0xdd/0x290

>> >  [<ffffffff811bf41d>] writeback_sb_inodes+0xbd/0x170

>> >  [<ffffffff811bf57b>] writeback_inodes_wb+0xab/0x1b0

>> >  [<ffffffff811bf973>] wb_writeback+0x2f3/0x410

>> >  [<ffffffff811bfb4b>] wb_do_writeback+0xbb/0x240

>> >  [<ffffffff811bfd33>] bdi_writeback_task+0x63/0x1b0

>> >  [<ffffffff810a12e7>] ? bit_waitqueue+0x17/0xd0

>> >  [<ffffffff8114b760>] ? bdi_start_fn+0x0/0x100

>> >  [<ffffffff8114b7e6>] bdi_start_fn+0x86/0x100

>> >  [<ffffffff8114b760>] ? bdi_start_fn+0x0/0x100

>> >  [<ffffffff810a0fce>] kthread+0x9e/0xc0

>> >  [<ffffffff8100c28a>] child_rip+0xa/0x20

>> >  [<ffffffff810a0f30>] ? kthread+0x0/0xc0

>> >  [<ffffffff8100c280>] ? child_rip+0x0/0x20

>> > INFO: task mkfs.ext4:3040 blocked for more than 120 seconds.

>> >       Not tainted 2.6.32-573.18.1.el6.x86_64 #1

>> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this

>> > message.

>> > mkfs.ext4     D 0000000000000002     0  3040   3038 0x00000080

>> >  ffff88075e79f4d8 0000000000000082 ffff8808ffffffff 000000003d697f73

>> >  0000000000000000 ffff88082fb73130 0000000000021472 ffffffffa78356ec

>> >  000000002b9db4fe ffffffff81aa6700 ffff88082e787068 ffff88075e79ffd8

>> > Call Trace:

>> >  [<ffffffff81539673>] io_schedule+0x73/0xc0

>> >  [<ffffffff81276598>] get_request_wait+0x108/0x1d0

>> >  [<ffffffff810a1460>] ? autoremove_wake_function+0x0/0x40

>> >  [<ffffffff812766f9>] blk_queue_bio+0x99/0x610

>> >

>> > Ceph version is Jewel 10.2.3

>> > Ceph clients, mons and servers have the kernel

>> > 3.10.0-327.36.3.el7.x86_64

>> > on CentOS 7.2

>> >

>> > 2017-01-13 20:07 GMT+01:00 Jason Dillaman <jdillama@xxxxxxxxxx>:

>> >>

>> >> You might be hitting this issue [1] where mkfs is issuing lots of

>> >> discard operations. If you get a chance, can you retest w/ the "-E

>> >> nodiscard" option?

>> >>

>> >> Thanks

>> >>

>> >> [1] http://tracker.ceph.com/issues/16689

>> >>

>> >> On Fri, Jan 13, 2017 at 12:57 PM, Vincent Godin <vince.mlist@xxxxxxxxx>

>> >> wrote:

>> >> > Thanks Jason,

>> >> >

>> >> > We observed a curious behavior : we have some VMs on CentOS 6.x

>> >> > hosted

>> >> > on

>> >> > our Openstack computes which are in CentOS 7.2. If we try to make a

>> >> > mkfs.ext4 on a volume create with the Jewel default (61) on the VM

>> >> > it's

>> >> > hung

>> >> > and we have to reboot the VM to get a responsive system. This is

>> >> > strange

>> >> > because the libvirt process is launched from the host which is in

>> >> > CentOS

>> >> > 7.2. If a disable some features, the mkfs.ext4 succeed. If the VM is

>> >> > in

>> >> > CentOS 7.x, there is no probleme at all. Maybe the kernel of the

>> >> > CentOS

>> >> > 6.X

>> >> > is unable to use the exclusive-lock feature ?

>> >> > I think we will have to stay in a very conservative

>> >> > rbd_default_features

>> >> > such 1 because we don't use stripping and the others features are not

>> >> > compatible with our old CentOS 6.x VMs ..

>> >> >

>> >> > A last question : is the rbd object-map rebuild a long process ? in

>> >> > an

>> >> > other

>> >> > way, does it cost the same time as a delete (which read all the

>> >> > blocks

>> >> > possible for an image without omap feature). Is it a good idea to

>> >> > enable

>> >> > omap feature on an already used image ? (I know that during the

>> >> > rebuild

>> >> > process, the VM will have to be stopped)

>> >> >

>> >> >

>> >> >

>> >> > 2017-01-13 15:09 GMT+01:00 Jason Dillaman <jdillama@xxxxxxxxxx>:

>> >> >>

>> >> >> On Fri, Jan 13, 2017 at 5:11 AM, Vincent Godin

>> >> >> <vince.mlist@xxxxxxxxx>

>> >> >> wrote:

>> >> >> > We are using a production cluster which started in Firefly, then

>> >> >> > moved

>> >> >> > to

>> >> >> > Giant, Hammer and finally Jewel. So our images have different

>> >> >> > features

>> >> >> > correspondind to the value of "rbd_default_features" of the

>> >> >> > version

>> >> >> > when

>> >> >> > they were created.

>> >> >> > We have actually three pack of features activated :

>> >> >> > image with :

>> >> >> > - layering ~ 1

>> >> >> > - layering, striping ~3

>> >> >> > - layering, exclusive-lock, object-map, fast-diff, deep-flatten ~

>> >> >> > 61

>> >> >> >

>> >> >> > 1) Is it a good idea to try to give all images the same features ?

>> >> >>

>> >> >> It isn't needed.

>> >> >>

>> >> >> > 2) Is it possible to disable the striping feature on an already

>> >> >> > created

>> >> >> > image (we never specify any stripe-unit nor stripe-count) ?

>> >> >>

>> >> >> Negative -- striping cannot be dynamically disabled because it would

>> >> >> result in potentially altering the structure and placement of the

>> >> >> data

>> >> >> within the image. If your stripe-unit is the object size and the

>> >> >> stripe count is 1, that's a special case where the flag is

>> >> >> essentially

>> >> >> ignored.

>> >> >>

>> >> >> > 3) What is the behaviour of an already created image on which we

>> >> >> > activate

>> >> >> > the object-map feature ? Will a process try to rebuild a index of

>> >> >> > used

>> >> >> > blocks - if no, if we delete later the image, will ceph try to

>> >> >> > remove

>> >> >> > all

>> >> >> > the blocks or only the blocks refered by object-map index ?

>> >> >>

>> >> >> You would need to run "rbd object-map rebuild <image-spec>" to

>> >> >> rebuild

>> >> >> the object map. Until it is rebuilt, it will be considered invalid

>> >> >> and

>> >> >> won't be used for reference. You can determine the object map state

>> >> >> by

>> >> >> running "rbd info <image-spec>"

>> >> >>

>> >> >> > 4) We are on Jewel but with tunables set to hammer (Centos 7.2).

>> >> >> > What

>> >> >> > are

>> >> >> > the best default features to set in that case ? (we use Ceph

>> >> >> > under

>> >> >> > an

>> >> >> > Openstack for glance, nova and cinder

>> >> >>

>> >> >> We feel like the current defaults are a good mix of features for

>> >> >> everyday use of non-shared images or non-krbd images. Most

>> >> >> importantly, all the default features can be dynamically disabled if

>> >> >> your needs for the image change.

>> >> >>

>> >> >> >

>> >> >> >

>> >> >> >

>> >> >> > _______________________________________________

>> >> >> > ceph-users mailing list

>> >> >> > ceph-users@xxxxxxxxxxxxxx

>> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>> >> >> >

>> >> >>

>> >> >>

>> >> >>

>> >> >> --

>> >> >> Jason

>> >> >

>> >> >

>> >>

>> >>

>> >>

>> >> --

>> >> Jason

>> >

>> >

>>

>>

>>

>> --

>> Jason

>

>

--

Jason

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com