Re: Terrible RBD performance with Jewel

Nick Fisk <nick@xxxxxxxxxx> · Thu, 14 Jul 2016 10:25:07 +0100

I've seen something similar if you are using RBD caching, I found that if you can fill the RBD cache faster than it can flush you
get these stalls. I increased the size of the cache and also the flush threshold and this solved the problem. I didn't spend much
time looking into it, but it seemed like with a smaller cache it didn't have enough working space to accept new writes whilst the
older ones were being flushed.

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Mark Nelson
> Sent: 14 July 2016 03:34
> To: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Terrible RBD performance with Jewel
> 
> As Somnath mentioned, you've got a lot of tunables set there.  Are you sure those are all doing what you think they are doing?
> 
> FWIW, the xfs -n size=64k option is probably not a good idea.
> Unfortunately it can't be changed without making a new filesystem.
> 
> See:
> 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007645.html
> 
> Typically that seems to manifest as suicide timeouts on the OSDs though.
>   You'd also see kernel log messages that look like:
> 
> kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250)
> 
> Mark
> 
> On 07/13/2016 08:39 PM, Garg, Pankaj wrote:
> > I agree, but I'm dealing with something else out here with this setup.
> >
> > I just ran a test, and within 3 seconds my IOPS went to 0, and stayed
> > there for 90 seconds..then started and within seconds again went to 0.
> >
> > This doesn't seem normal at all. Here is my ceph.conf:
> >
> >
> >
> > [global]
> >
> > fsid = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> >
> > public_network = xxxxxxxxxxxxxxxxxxxxxxxx
> >
> > cluster_network = xxxxxxxxxxxxxxxxxxxxxxxxxxxx
> >
> > mon_initial_members = ceph1
> >
> > mon_host = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> >
> > auth_cluster_required = cephx
> >
> > auth_service_required = cephx
> >
> > auth_client_required = cephx
> >
> > filestore_xattr_use_omap = true
> >
> > osd_mkfs_options = -f -i size=2048 -n size=64k
> >
> > osd_mount_options_xfs = inode64,noatime,logbsize=256k
> >
> > filestore_merge_threshold = 40
> >
> > filestore_split_multiple = 8
> >
> > osd_op_threads = 12
> >
> > osd_pool_default_size = 2
> >
> > mon_pg_warn_max_object_skew = 100000
> >
> > mon_pg_warn_min_per_osd = 0
> >
> > mon_pg_warn_max_per_osd = 32768
> >
> > filestore_op_threads = 6
> >
> >
> >
> > [osd]
> >
> > osd_enable_op_tracker = false
> >
> > osd_op_num_shards = 2
> >
> > filestore_wbthrottle_enable = false
> >
> > filestore_max_sync_interval = 1
> >
> > filestore_odsync_write = true
> >
> > filestore_max_inline_xattr_size = 254
> >
> > filestore_max_inline_xattrs = 6
> >
> > filestore_queue_committing_max_bytes = 1048576000
> >
> > filestore_queue_committing_max_ops = 5000
> >
> > filestore_queue_max_bytes = 1048576000
> >
> > filestore_queue_max_ops = 500
> >
> > journal_max_write_bytes = 1048576000
> >
> > journal_max_write_entries = 1000
> >
> > journal_queue_max_bytes = 1048576000
> >
> > journal_queue_max_ops = 3000
> >
> > filestore_fd_cache_shards = 32
> >
> > filestore_fd_cache_size = 64
> >
> >
> >
> >
> >
> > *From:*Somnath Roy [mailto:Somnath.Roy@xxxxxxxxxxx]
> > *Sent:* Wednesday, July 13, 2016 6:06 PM
> > *To:* Garg, Pankaj; ceph-users@xxxxxxxxxxxxxx
> > *Subject:* RE: Terrible RBD performance with Jewel
> >
> >
> >
> > You should do that first to get a stable performance out with filestore.
> >
> > 1M seq write for the entire image should be sufficient to precondition it.
> >
> >
> >
> > *From:*Garg, Pankaj [mailto:Pankaj.Garg@xxxxxxxxxx]
> > *Sent:* Wednesday, July 13, 2016 6:04 PM
> > *To:* Somnath Roy; ceph-users@xxxxxxxxxxxxxx
> > <mailto:ceph-users@xxxxxxxxxxxxxx>
> > *Subject:* RE: Terrible RBD performance with Jewel
> >
> >
> >
> > No I have not.
> >
> >
> >
> > *From:*Somnath Roy [mailto:Somnath.Roy@xxxxxxxxxxx]
> > *Sent:* Wednesday, July 13, 2016 6:00 PM
> > *To:* Garg, Pankaj; ceph-users@xxxxxxxxxxxxxx
> > <mailto:ceph-users@xxxxxxxxxxxxxx>
> > *Subject:* RE: Terrible RBD performance with Jewel
> >
> >
> >
> > In fact, I was wrong , I missed you are running with 12 OSDs
> > (considering one OSD per SSD). In that case, it will take ~250 second
> > to fill up the journal.
> >
> > Have you preconditioned the entire image with bigger block say 1M
> > before doing any real test ?
> >
> >
> >
> > *From:*Garg, Pankaj [mailto:Pankaj.Garg@xxxxxxxxxx]
> > *Sent:* Wednesday, July 13, 2016 5:55 PM
> > *To:* Somnath Roy; ceph-users@xxxxxxxxxxxxxx
> > <mailto:ceph-users@xxxxxxxxxxxxxx>
> > *Subject:* RE: Terrible RBD performance with Jewel
> >
> >
> >
> > Thanks Somnath. I will try all these, but I think there is something
> > else going on too.
> >
> > Firstly my test reaches 0 IOPS within 10 seconds sometimes.
> >
> > Secondly, when I'm at 0 IOPS, I see NO disk activity on IOSTAT and no
> > CPU activity either. This part is strange.
> >
> >
> >
> > Thanks
> >
> > Pankaj
> >
> >
> >
> > *From:*Somnath Roy [mailto:Somnath.Roy@xxxxxxxxxxx]
> > *Sent:* Wednesday, July 13, 2016 5:49 PM
> > *To:* Somnath Roy; Garg, Pankaj; ceph-users@xxxxxxxxxxxxxx
> > <mailto:ceph-users@xxxxxxxxxxxxxx>
> > *Subject:* RE: Terrible RBD performance with Jewel
> >
> >
> >
> > Also increase the following..
> >
> >
> >
> > filestore_op_threads
> >
> >
> >
> > *From:*ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] *On
> > Behalf Of *Somnath Roy
> > *Sent:* Wednesday, July 13, 2016 5:47 PM
> > *To:* Garg, Pankaj; ceph-users@xxxxxxxxxxxxxx
> > <mailto:ceph-users@xxxxxxxxxxxxxx>
> > *Subject:* Re:  Terrible RBD performance with Jewel
> >
> >
> >
> > Pankaj,
> >
> >
> >
> > Could be related to the new throttle parameter introduced in jewel. By
> > default these throttles are off , you need to tweak it according to
> > your setup.
> >
> > What is your journal size and fio block size ?
> >
> > If it is default 5GB , with this rate (assuming 4K RW)   you mentioned
> > and considering 3X replication , it can fill up your journal and stall
> > io within ~30 seconds or so.
> >
> > If you think this is what is happening in your system , you need to
> > turn this throttle on (see
> > https://github.com/ceph/ceph/blob/jewel/src/doc/dynamic-throttle.txt )
> > and also need to lower the filestore_max_sync_interval to ~1 (or even
> > lower). Since you are trying on SSD , I would also recommend to turn
> > the following parameter on for the stable performance out.
> >
> >
> >
> >
> >
> > filestore_odsync_write= true
> >
> >
> >
> > Thanks & Regards
> >
> > Somnath
> >
> > *From:*ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] *On
> > Behalf Of *Garg, Pankaj
> > *Sent:* Wednesday, July 13, 2016 4:57 PM
> > *To:* ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
> > *Subject:*  Terrible RBD performance with Jewel
> >
> >
> >
> > Hi,
> >
> > I just  installed jewel on a small cluster of 3 machines with 4 SSDs
> > each. I created 8 RBD images, and use a single client, with 8 threads,
> > to do random writes (using FIO with RBD engine) on the images ( 1
> > thread per image).
> >
> > The cluster has 3X replication and 10G cluster and client networks.
> >
> > FIO prints the aggregate IOPS every second for the cluster. Before
> > Jewel, I get roughtly 10K IOPS. It was up and down, but still kept going.
> >
> > Now I see IOPS that go to 13-15K, but then it drops, and eventually
> > drops to ZERO for several seconds, and then starts back up again.
> >
> >
> >
> > What am I missing?
> >
> >
> >
> > Thanks
> >
> > Pankaj
> >
> > PLEASE NOTE: The information contained in this electronic mail message
> > is intended only for the use of the designated recipient(s) named above.
> > If the reader of this message is not the intended recipient, you are
> > hereby notified that you have received this message in error and that
> > any review, dissemination, distribution, or copying of this message is
> > strictly prohibited. If you have received this communication in error,
> > please notify the sender by telephone or e-mail (as shown above)
> > immediately and destroy any and all copies of this message in your
> > possession (whether hard copies or electronically stored copies).
> >
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com