I've seen something similar if you are using RBD caching, I found that if you can fill the RBD cache faster than it can flush you get these stalls. I increased the size of the cache and also the flush threshold and this solved the problem. I didn't spend much time looking into it, but it seemed like with a smaller cache it didn't have enough working space to accept new writes whilst the older ones were being flushed. > -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Mark Nelson > Sent: 14 July 2016 03:34 > To: ceph-users@xxxxxxxxxxxxxx > Subject: Re: Terrible RBD performance with Jewel > > As Somnath mentioned, you've got a lot of tunables set there. Are you sure those are all doing what you think they are doing? > > FWIW, the xfs -n size=64k option is probably not a good idea. > Unfortunately it can't be changed without making a new filesystem. > > See: > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007645.html > > Typically that seems to manifest as suicide timeouts on the OSDs though. > You'd also see kernel log messages that look like: > > kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250) > > Mark > > On 07/13/2016 08:39 PM, Garg, Pankaj wrote: > > I agree, but I'm dealing with something else out here with this setup. > > > > I just ran a test, and within 3 seconds my IOPS went to 0, and stayed > > there for 90 seconds..then started and within seconds again went to 0. > > > > This doesn't seem normal at all. Here is my ceph.conf: > > > > > > > > [global] > > > > fsid = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx > > > > public_network = xxxxxxxxxxxxxxxxxxxxxxxx > > > > cluster_network = xxxxxxxxxxxxxxxxxxxxxxxxxxxx > > > > mon_initial_members = ceph1 > > > > mon_host = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx > > > > auth_cluster_required = cephx > > > > auth_service_required = cephx > > > > auth_client_required = cephx > > > > filestore_xattr_use_omap = true > > > > osd_mkfs_options = -f -i size=2048 -n size=64k > > > > osd_mount_options_xfs = inode64,noatime,logbsize=256k > > > > filestore_merge_threshold = 40 > > > > filestore_split_multiple = 8 > > > > osd_op_threads = 12 > > > > osd_pool_default_size = 2 > > > > mon_pg_warn_max_object_skew = 100000 > > > > mon_pg_warn_min_per_osd = 0 > > > > mon_pg_warn_max_per_osd = 32768 > > > > filestore_op_threads = 6 > > > > > > > > [osd] > > > > osd_enable_op_tracker = false > > > > osd_op_num_shards = 2 > > > > filestore_wbthrottle_enable = false > > > > filestore_max_sync_interval = 1 > > > > filestore_odsync_write = true > > > > filestore_max_inline_xattr_size = 254 > > > > filestore_max_inline_xattrs = 6 > > > > filestore_queue_committing_max_bytes = 1048576000 > > > > filestore_queue_committing_max_ops = 5000 > > > > filestore_queue_max_bytes = 1048576000 > > > > filestore_queue_max_ops = 500 > > > > journal_max_write_bytes = 1048576000 > > > > journal_max_write_entries = 1000 > > > > journal_queue_max_bytes = 1048576000 > > > > journal_queue_max_ops = 3000 > > > > filestore_fd_cache_shards = 32 > > > > filestore_fd_cache_size = 64 > > > > > > > > > > > > *From:*Somnath Roy [mailto:Somnath.Roy@xxxxxxxxxxx] > > *Sent:* Wednesday, July 13, 2016 6:06 PM > > *To:* Garg, Pankaj; ceph-users@xxxxxxxxxxxxxx > > *Subject:* RE: Terrible RBD performance with Jewel > > > > > > > > You should do that first to get a stable performance out with filestore. > > > > 1M seq write for the entire image should be sufficient to precondition it. > > > > > > > > *From:*Garg, Pankaj [mailto:Pankaj.Garg@xxxxxxxxxx] > > *Sent:* Wednesday, July 13, 2016 6:04 PM > > *To:* Somnath Roy; ceph-users@xxxxxxxxxxxxxx > > <mailto:ceph-users@xxxxxxxxxxxxxx> > > *Subject:* RE: Terrible RBD performance with Jewel > > > > > > > > No I have not. > > > > > > > > *From:*Somnath Roy [mailto:Somnath.Roy@xxxxxxxxxxx] > > *Sent:* Wednesday, July 13, 2016 6:00 PM > > *To:* Garg, Pankaj; ceph-users@xxxxxxxxxxxxxx > > <mailto:ceph-users@xxxxxxxxxxxxxx> > > *Subject:* RE: Terrible RBD performance with Jewel > > > > > > > > In fact, I was wrong , I missed you are running with 12 OSDs > > (considering one OSD per SSD). In that case, it will take ~250 second > > to fill up the journal. > > > > Have you preconditioned the entire image with bigger block say 1M > > before doing any real test ? > > > > > > > > *From:*Garg, Pankaj [mailto:Pankaj.Garg@xxxxxxxxxx] > > *Sent:* Wednesday, July 13, 2016 5:55 PM > > *To:* Somnath Roy; ceph-users@xxxxxxxxxxxxxx > > <mailto:ceph-users@xxxxxxxxxxxxxx> > > *Subject:* RE: Terrible RBD performance with Jewel > > > > > > > > Thanks Somnath. I will try all these, but I think there is something > > else going on too. > > > > Firstly my test reaches 0 IOPS within 10 seconds sometimes. > > > > Secondly, when I'm at 0 IOPS, I see NO disk activity on IOSTAT and no > > CPU activity either. This part is strange. > > > > > > > > Thanks > > > > Pankaj > > > > > > > > *From:*Somnath Roy [mailto:Somnath.Roy@xxxxxxxxxxx] > > *Sent:* Wednesday, July 13, 2016 5:49 PM > > *To:* Somnath Roy; Garg, Pankaj; ceph-users@xxxxxxxxxxxxxx > > <mailto:ceph-users@xxxxxxxxxxxxxx> > > *Subject:* RE: Terrible RBD performance with Jewel > > > > > > > > Also increase the following.. > > > > > > > > filestore_op_threads > > > > > > > > *From:*ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] *On > > Behalf Of *Somnath Roy > > *Sent:* Wednesday, July 13, 2016 5:47 PM > > *To:* Garg, Pankaj; ceph-users@xxxxxxxxxxxxxx > > <mailto:ceph-users@xxxxxxxxxxxxxx> > > *Subject:* Re: Terrible RBD performance with Jewel > > > > > > > > Pankaj, > > > > > > > > Could be related to the new throttle parameter introduced in jewel. By > > default these throttles are off , you need to tweak it according to > > your setup. > > > > What is your journal size and fio block size ? > > > > If it is default 5GB , with this rate (assuming 4K RW) you mentioned > > and considering 3X replication , it can fill up your journal and stall > > io within ~30 seconds or so. > > > > If you think this is what is happening in your system , you need to > > turn this throttle on (see > > https://github.com/ceph/ceph/blob/jewel/src/doc/dynamic-throttle.txt ) > > and also need to lower the filestore_max_sync_interval to ~1 (or even > > lower). Since you are trying on SSD , I would also recommend to turn > > the following parameter on for the stable performance out. > > > > > > > > > > > > filestore_odsync_write= true > > > > > > > > Thanks & Regards > > > > Somnath > > > > *From:*ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] *On > > Behalf Of *Garg, Pankaj > > *Sent:* Wednesday, July 13, 2016 4:57 PM > > *To:* ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> > > *Subject:* Terrible RBD performance with Jewel > > > > > > > > Hi, > > > > I just installed jewel on a small cluster of 3 machines with 4 SSDs > > each. I created 8 RBD images, and use a single client, with 8 threads, > > to do random writes (using FIO with RBD engine) on the images ( 1 > > thread per image). > > > > The cluster has 3X replication and 10G cluster and client networks. > > > > FIO prints the aggregate IOPS every second for the cluster. Before > > Jewel, I get roughtly 10K IOPS. It was up and down, but still kept going. > > > > Now I see IOPS that go to 13-15K, but then it drops, and eventually > > drops to ZERO for several seconds, and then starts back up again. > > > > > > > > What am I missing? > > > > > > > > Thanks > > > > Pankaj > > > > PLEASE NOTE: The information contained in this electronic mail message > > is intended only for the use of the designated recipient(s) named above. > > If the reader of this message is not the intended recipient, you are > > hereby notified that you have received this message in error and that > > any review, dissemination, distribution, or copying of this message is > > strictly prohibited. If you have received this communication in error, > > please notify the sender by telephone or e-mail (as shown above) > > immediately and destroy any and all copies of this message in your > > possession (whether hard copies or electronically stored copies). > > > > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com