Re: Terrible RBD performance with Jewel

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 13 Jul 2016 21:33:44 -0500

As Somnath mentioned, you've got a lot of tunables set there.  Are you 
sure those are all doing what you think they are doing?

FWIW, the xfs -n size=64k option is probably not a good idea. 
Unfortunately it can't be changed without making a new filesystem.

See:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007645.html

Typically that seems to manifest as suicide timeouts on the OSDs though. 
 You'd also see kernel log messages that look like:

kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250)

Mark

On 07/13/2016 08:39 PM, Garg, Pankaj wrote:
I agree, but I’m dealing with something else out here with this setup.

I just ran a test, and within 3 seconds my IOPS went to 0, and stayed
there for 90 seconds….then started and within seconds again went to 0.

This doesn’t seem normal at all. Here is my ceph.conf:

[global]

fsid = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

public_network = xxxxxxxxxxxxxxxxxxxxxxxx

cluster_network = xxxxxxxxxxxxxxxxxxxxxxxxxxxx

mon_initial_members = ceph1

mon_host = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

auth_cluster_required = cephx

auth_service_required = cephx

auth_client_required = cephx

filestore_xattr_use_omap = true

osd_mkfs_options = -f -i size=2048 -n size=64k

osd_mount_options_xfs = inode64,noatime,logbsize=256k

filestore_merge_threshold = 40

filestore_split_multiple = 8

osd_op_threads = 12

osd_pool_default_size = 2

mon_pg_warn_max_object_skew = 100000

mon_pg_warn_min_per_osd = 0

mon_pg_warn_max_per_osd = 32768

filestore_op_threads = 6

[osd]

osd_enable_op_tracker = false

osd_op_num_shards = 2

filestore_wbthrottle_enable = false

filestore_max_sync_interval = 1

filestore_odsync_write = true

filestore_max_inline_xattr_size = 254

filestore_max_inline_xattrs = 6

filestore_queue_committing_max_bytes = 1048576000

filestore_queue_committing_max_ops = 5000

filestore_queue_max_bytes = 1048576000

filestore_queue_max_ops = 500

journal_max_write_bytes = 1048576000

journal_max_write_entries = 1000

journal_queue_max_bytes = 1048576000

journal_queue_max_ops = 3000

filestore_fd_cache_shards = 32

filestore_fd_cache_size = 64

*From:*Somnath Roy [mailto:Somnath.Roy@xxxxxxxxxxx]
*Sent:* Wednesday, July 13, 2016 6:06 PM
*To:* Garg, Pankaj; ceph-users@xxxxxxxxxxxxxx
*Subject:* RE: Terrible RBD performance with Jewel

You should do that first to get a stable performance out with filestore.

1M seq write for the entire image should be sufficient to precondition it.

*From:*Garg, Pankaj [mailto:Pankaj.Garg@xxxxxxxxxx]
*Sent:* Wednesday, July 13, 2016 6:04 PM
*To:* Somnath Roy; ceph-users@xxxxxxxxxxxxxx
<mailto:ceph-users@xxxxxxxxxxxxxx>
*Subject:* RE: Terrible RBD performance with Jewel

No I have not.

*From:*Somnath Roy [mailto:Somnath.Roy@xxxxxxxxxxx]
*Sent:* Wednesday, July 13, 2016 6:00 PM
*To:* Garg, Pankaj; ceph-users@xxxxxxxxxxxxxx
<mailto:ceph-users@xxxxxxxxxxxxxx>
*Subject:* RE: Terrible RBD performance with Jewel

In fact, I was wrong , I missed you are running with 12 OSDs
(considering one OSD per SSD). In that case, it will take ~250 second to
fill up the journal.

Have you preconditioned the entire image with bigger block say 1M before
doing any real test ?

*From:*Garg, Pankaj [mailto:Pankaj.Garg@xxxxxxxxxx]
*Sent:* Wednesday, July 13, 2016 5:55 PM
*To:* Somnath Roy; ceph-users@xxxxxxxxxxxxxx
<mailto:ceph-users@xxxxxxxxxxxxxx>
*Subject:* RE: Terrible RBD performance with Jewel

Thanks Somnath. I will try all these, but I think there is something
else going on too.

Firstly my test reaches 0 IOPS within 10 seconds sometimes.

Secondly, when I’m at 0 IOPS, I see NO disk activity on IOSTAT and no
CPU activity either. This part is strange.

Thanks

Pankaj

*From:*Somnath Roy [mailto:Somnath.Roy@xxxxxxxxxxx]
*Sent:* Wednesday, July 13, 2016 5:49 PM
*To:* Somnath Roy; Garg, Pankaj; ceph-users@xxxxxxxxxxxxxx
<mailto:ceph-users@xxxxxxxxxxxxxx>
*Subject:* RE: Terrible RBD performance with Jewel

Also increase the following..

filestore_op_threads

*From:*ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] *On Behalf
Of *Somnath Roy
*Sent:* Wednesday, July 13, 2016 5:47 PM
*To:* Garg, Pankaj; ceph-users@xxxxxxxxxxxxxx
<mailto:ceph-users@xxxxxxxxxxxxxx>
*Subject:* Re:  Terrible RBD performance with Jewel

Pankaj,

Could be related to the new throttle parameter introduced in jewel. By
default these throttles are off , you need to tweak it according to your
setup.

What is your journal size and fio block size ?

If it is default 5GB , with this rate (assuming 4K RW)   you mentioned
and considering 3X replication , it can fill up your journal and stall
io within ~30 seconds or so.

If you think this is what is happening in your system , you need to turn
this throttle on (see
https://github.com/ceph/ceph/blob/jewel/src/doc/dynamic-throttle.txt )
and also need to lower the filestore_max_sync_interval to ~1 (or even
lower). Since you are trying on SSD , I would also recommend to turn the
following parameter on for the stable performance out.

filestore_odsync_write= true

Thanks & Regards

Somnath

*From:*ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] *On Behalf
Of *Garg, Pankaj
*Sent:* Wednesday, July 13, 2016 4:57 PM
*To:* ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
*Subject:*  Terrible RBD performance with Jewel

Hi,

I just  installed jewel on a small cluster of 3 machines with 4 SSDs
each. I created 8 RBD images, and use a single client, with 8 threads,
to do random writes (using FIO with RBD engine) on the images ( 1 thread
per image).

The cluster has 3X replication and 10G cluster and client networks.

FIO prints the aggregate IOPS every second for the cluster. Before
Jewel, I get roughtly 10K IOPS. It was up and down, but still kept going.

Now I see IOPS that go to 13-15K, but then it drops, and eventually
drops to ZERO for several seconds, and then starts back up again.

What am I missing?

Thanks

Pankaj

PLEASE NOTE: The information contained in this electronic mail message
is intended only for the use of the designated recipient(s) named above.
If the reader of this message is not the intended recipient, you are
hereby notified that you have received this message in error and that
any review, dissemination, distribution, or copying of this message is
strictly prohibited. If you have received this communication in error,
please notify the sender by telephone or e-mail (as shown above)
immediately and destroy any and all copies of this message in your
possession (whether hard copies or electronically stored copies).

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com