On Sat, Jul 29, 2017 at 8:34 PM, Mark Nelson <mark.a.nelson@xxxxxxxxx> wrote: > > > On 07/28/2017 03:57 PM, Jianjian Huo wrote: >> >> Hi Mark, >> >> On Wed, Jul 26, 2017 at 8:55 PM, Mark Nelson <mark.a.nelson@xxxxxxxxx> >> wrote: >>> >>> yeah, metrics and profiling data would be good at this point. The >>> standard >>> gauntlet of collectl/iostat, gdbprof or poorman's profiling, perf, >>> blktrace, >>> etc. Don't necessarily need everything but if anything interesting shows >>> up >>> it would be good to see it. >>> >>> Also, turning on rocksdb bloom filters is worth doing if it hasn't been >>> done >>> yet (happening in master soon via >>> https://github.com/ceph/ceph/pull/16450). >>> >>> FWIW, I'm tracking down what I think is a sequential write regression vs >>> earlier versions of bluestore but haven't figured out what's going on yet >>> or >>> even how much of a regression we are facing (these tests are on much >>> bigger >>> volumes than previously tested). >>> >>> Mark >> >> >> For bluestore sequential writes, from our testing with master branch >> two days ago, ec sequential writes (16K and 128K) were 2~3 times >> slower than 3x sequential writes. From your earlier testing, bluestore >> ec sequential writes were faster than 3x in all IO size cases. Is this >> some sort of regression you are aware of? >> >> Jianjian > > > I wouldn't necessarily expect small EC sequential writes to necessarily do > well vs 3x replication. It might depend on the disk configuration and > definitely on client side WB cache (This is tricky because RBD cache has > some locking limitations that become apparent at high IOPS rates / volume). > For large writes though I've seen EC faster (somewhere between 2x and 3x > replication). These numbers are almost 5 months old now (and there have > been some bluestore performance improvements since then), but here's what I > was seeing for RBD EC overwrites last March (scroll to the right for > graphs): > > https://drive.google.com/uc?export=download&id=0B2gTBZrkrnpZbE50QUdtZlBxdFU Thanks for sharing this data, Mark. >From your data of last March, for RBD EC overwrite on NVMe, EC sequential writes are faster than 3X for all IO sizes including small 4K/16KB. Is this right? but I am not seeing this on my setup(all nvme drives, 12 of them per node), in my case EC sequential writes are 2~3 times slower than 3X. Maybe I have too many drives per node? Jianjian > > FWIW, the regression I might be seeing (if it is actually a regression) > appears to be limited to RBD block creation rather than writes to existing > blocks. IE pre-filling volumes is slower than just creating objects via > rados bench of the same size. It's pretty limited in scope. > > Mark > > > >> >>> >>> >>> On 07/26/2017 09:40 PM, Brad Hubbard wrote: >>>> >>>> >>>> Bumping this as I was talking to Junqin in IRC today and he reported it >>>> is >>>> still >>>> an issue. I suggested analysis of metrics and profiling data to try to >>>> determine >>>> the bottleneck for bluestore and also suggested Junqin open a tracker so >>>> we can >>>> investigate this thoroughly. >>>> >>>> Mark, Did you have any additional thoughts on how this might best be >>>> attacked? >>>> >>>> >>>> On Thu, Jul 13, 2017 at 11:37 PM, Junqin JQ7 Zhang <zhangjq7@xxxxxxxxxx> >>>> wrote: >>>>> >>>>> >>>>> Hi Mark, >>>>> >>>>> Thanks for your reply. >>>>> >>>>> Our SSD model is: >>>>> Device Model: SSDSC2BA800G4N >>>>> Intel SSD DC S3710 Series 800GB >>>>> >>>>> And BlueStore OSD configure is as I posted before >>>>> [osd.0] >>>>> host = ceph-1 >>>>> osd data = /var/lib/ceph/osd/ceph-0 # a 100M SSD partition >>>>> bluestore block db path = /dev/sda5 # a 10G SSD partition >>>>> bluestore block wal path = /dev/sda6 # a 10G SSD partition >>>>> bluestore block path = /dev/sdd # a HDD disk >>>>> >>>>> The iostat is a quick snapshot of terminal screen on a 8K write. I >>>>> forget >>>>> the detail test configuration. >>>>> I only can make sure is it is a 8K random write. >>>>> But we have re-setup the cluster, so I can't get the data right now, >>>>> but >>>>> we will do test again later these days. >>>>> >>>>> Is there any special configure on BlueStore on your lab test? Like, how >>>>> BlueStore OSD configured in your lab test? >>>>> Or could you share lab test BlueStore configuration? Like file >>>>> ceph.conf? >>>>> >>>>> Thanks a lot! >>>>> >>>>> B.R. >>>>> Junqin Zhang >>>>> >>>>> -----Original Message----- >>>>> From: ceph-devel-owner@xxxxxxxxxxxxxxx >>>>> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson >>>>> Sent: Wednesday, July 12, 2017 11:29 PM >>>>> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development >>>>> Subject: Re: Ceph Bluestore OSD CPU utilization >>>>> >>>>> Hi Junqin >>>>> >>>>> On 07/12/2017 05:21 AM, Junqin JQ7 Zhang wrote: >>>>>> >>>>>> >>>>>> Hi Mark, >>>>>> >>>>>> We also compared iostat of filestore and bluestore. >>>>>> Disk write rate of bluestore is only around 10% of filestore in same >>>>>> test case. >>>>>> >>>>>> Here is FileStore iostat during write >>>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>>> 13.06 0.00 9.84 11.52 0.00 65.58 >>>>>> >>>>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >>>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>>> sda 0.00 0.00 0.00 8196.00 0.00 73588.00 >>>>>> 17.96 0.52 0.06 0.00 0.06 0.04 31.90 >>>>>> sdb 0.00 0.00 0.00 8298.00 0.00 75572.00 >>>>>> 18.21 0.54 0.07 0.00 0.07 0.04 33.00 >>>>>> sdh 0.00 4894.00 0.00 741.00 0.00 30504.00 >>>>>> 82.33 207.60 314.51 0.00 314.51 1.35 100.10 >>>>>> sdj 0.00 1282.00 0.00 938.00 0.00 15652.00 >>>>>> 33.37 14.40 16.04 0.00 16.04 0.90 84.10 >>>>>> sdk 0.00 5156.00 0.00 847.00 0.00 34560.00 >>>>>> 81.61 199.04 283.83 0.00 283.83 1.18 100.10 >>>>>> sdd 0.00 6889.00 0.00 729.00 0.00 38216.00 >>>>>> 104.84 138.60 198.14 0.00 198.14 1.37 100.00 >>>>>> sde 0.00 6909.00 0.00 763.00 0.00 38608.00 >>>>>> 101.20 139.16 190.55 0.00 190.55 1.31 100.00 >>>>>> sdf 0.00 3237.00 0.00 708.00 0.00 30548.00 >>>>>> 86.29 175.15 310.36 0.00 310.36 1.41 99.80 >>>>>> sdg 0.00 4875.00 0.00 745.00 0.00 32312.00 >>>>>> 86.74 207.70 291.26 0.00 291.26 1.34 100.00 >>>>>> sdi 0.00 7732.00 0.00 812.00 0.00 42136.00 >>>>>> 103.78 140.94 181.96 0.00 181.96 1.23 100.00 >>>>>> >>>>>> Here is BlueStore iostat during write >>>>>> avg-cpu: %user %nice %system %iowait %steal %idle >>>>>> 6.50 0.00 3.22 2.36 0.00 87.91 >>>>>> >>>>>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s >>>>>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>>>>> sda 0.00 0.00 0.00 2938.00 0.00 25072.00 >>>>>> 17.07 0.14 0.05 0.00 0.05 0.04 12.70 >>>>>> sdb 0.00 0.00 0.00 2821.00 0.00 26112.00 >>>>>> 18.51 0.15 0.05 0.00 0.05 0.05 12.90 >>>>>> sdh 0.00 1.00 0.00 510.00 0.00 3600.00 >>>>>> 14.12 5.45 10.68 0.00 10.68 0.24 12.00 >>>>>> sdj 0.00 0.00 0.00 424.00 0.00 3072.00 >>>>>> 14.49 4.24 10.00 0.00 10.00 0.22 9.30 >>>>>> sdk 0.00 0.00 0.00 496.00 0.00 3584.00 >>>>>> 14.45 4.10 8.26 0.00 8.26 0.18 9.10 >>>>>> sdd 0.00 0.00 0.00 419.00 0.00 3080.00 >>>>>> 14.70 3.60 8.60 0.00 8.60 0.19 7.80 >>>>>> sde 0.00 0.00 0.00 650.00 0.00 3784.00 >>>>>> 11.64 24.39 40.19 0.00 40.19 1.15 74.60 >>>>>> sdf 0.00 0.00 0.00 494.00 0.00 3584.00 >>>>>> 14.51 5.92 11.98 0.00 11.98 0.26 12.90 >>>>>> sdg 0.00 0.00 0.00 493.00 0.00 3584.00 >>>>>> 14.54 5.11 10.37 0.00 10.37 0.23 11.20 >>>>>> sdi 0.00 0.00 0.00 744.00 0.00 4664.00 >>>>>> 12.54 121.41 177.66 0.00 177.66 1.35 100.10 >>>>>> >>>>>> sda and sdb are SSD, other are HDD. >>>>> >>>>> >>>>> >>>>> earlier it looked like you were posting the configuration for an 8k >>>>> randrw test, but this is a pure write test? Can you provide the test >>>>> configuration for these results? Also, the SSD model would be useful >>>>> to >>>>> know. >>>>> >>>>> Having said that, these results look pretty different than what I >>>>> typically see in the lab. A big clue is the avgrq-sz. On filestore >>>>> you are >>>>> seeing much larger write requests than with bluestore. That might >>>>> indicate >>>>> that metadata writes are going to the HDD. Is this still with the 10GB >>>>> DB >>>>> partition? >>>>> >>>>> Mark >>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: Junqin JQ7 Zhang >>>>>> Sent: Wednesday, July 12, 2017 10:45 AM >>>>>> To: 'Mark Nelson'; Mark Nelson; Ceph Development >>>>>> Subject: RE: Ceph Bluestore OSD CPU utilization >>>>>> >>>>>> Hi Mark, >>>>>> >>>>>> Actually, we tested filestore on same Ceph version v12.1.0 and same >>>>>> cluster. >>>>>> # ceph -v >>>>>> ceph version 12.1.0 (262617c9f16c55e863693258061c5b25dea5b086) >>>>>> luminous (dev) >>>>>> >>>>>> CPU utilization of each OSD on filestore can reach max to around 200%, >>>>>> but CPU utilization of OSD on bluestore is only around 30%. >>>>>> Then, BlueStore's performance is only about 20% of filestore. >>>>>> We think there must be something wrong with our configuration. >>>>>> >>>>>> I tried to change ceph config, like >>>>>> osd op threads = 8 >>>>>> osd disk threads = 4 >>>>>> >>>>>> but still can't get a good result. >>>>>> >>>>>> Any idea of this? >>>>>> >>>>>> BTW. We changed some filestore related configured during test >>>>>> filestore fd cache size = 2048576000 filestore fd cache shards = 16 >>>>>> filestore async threads = 0 filestore max sync interval = 15 filestore >>>>>> wbthrottle enable = false filestore commit timeout = 1200 >>>>>> filestore_op_thread_suicide_timeout = 0 filestore queue max ops = >>>>>> 1048576 filestore queue max bytes = 17179869184 max open files = >>>>>> 262144 filestore fadvise = false filestore ondisk finisher threads = 4 >>>>>> filestore op threads = 8 >>>>>> >>>>>> Thanks a lot! >>>>>> >>>>>> B.R. >>>>>> Junqin Zhang >>>>>> -----Original Message----- >>>>>> From: ceph-devel-owner@xxxxxxxxxxxxxxx >>>>>> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson >>>>>> Sent: Tuesday, July 11, 2017 11:47 PM >>>>>> To: Junqin JQ7 Zhang; Mark Nelson; Ceph Development >>>>>> Subject: Re: Ceph Bluestore OSD CPU utilization >>>>>> >>>>>> >>>>>> >>>>>> On 07/11/2017 10:31 AM, Junqin JQ7 Zhang wrote: >>>>>>> >>>>>>> >>>>>>> Hi Mark, >>>>>>> >>>>>>> Thanks for your reply. >>>>>>> >>>>>>> The hardware is as below for each 3 hosts. >>>>>>> 2 SATA SSD and 8 HDD >>>>>> >>>>>> >>>>>> >>>>>> The model of SSD potentially could be very important here. The >>>>>> devices >>>>>> we test in our lab are enterprise grade SSDs with power loss >>>>>> protection. >>>>>> That means they don't have to flush data on sync requests. O_DSYNC >>>>>> writes are much faster as a result. I don't know how bad of an impact >>>>>> this >>>>>> has on rocksdb wal/db, but it definitely hurts with filestore >>>>>> journals. >>>>>> >>>>>>> Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz >>>>>>> Network: 20000Mb/s >>>>>>> >>>>>>> I configured OSD like >>>>>>> [osd.0] >>>>>>> host = ceph-1 >>>>>>> osd data = /var/lib/ceph/osd/ceph-0 # a 100M partition of SSD >>>>>>> bluestore block db path = /dev/sda5 # a 10G partition of SSD >>>>>> >>>>>> >>>>>> >>>>>> Bluestore automatically roles rocksdb data over to the HDD with the db >>>>>> gets full. I bet with 10GB you'll see good performance at first and >>>>>> then >>>>>> you'll start seeing lots of extra reads/writes on the HDD once it >>>>>> fills up >>>>>> with metadata (the more extents that are written out the more likely >>>>>> you'll >>>>>> hit this boundary). You'll want to make the db partitions use the >>>>>> majority >>>>>> of the SSD(s). >>>>>> >>>>>>> bluestore block wal path = /dev/sda6 # a 10G partition of SSD >>>>>> >>>>>> >>>>>> >>>>>> The WAL can be smaller. 1-2GB is enough (potentially even less if you >>>>>> adjust the rocksdb buffer settings, but 1-2GB should be small enough >>>>>> to >>>>>> devote most of your SSDs to DB storage). >>>>>> >>>>>>> bluestore block path = /dev/sdd # a HDD disk >>>>>>> >>>>>>> We use fio to test one or more 100G RBDs, an example of our fio >>>>>>> config [global] ioengine=rbd clientname=admin pool=rbd rw=randrw >>>>>>> bs=8k >>>>>>> runtime=120 >>>>>>> iodepth=16 >>>>>>> numjobs=4 >>>>>> >>>>>> >>>>>> >>>>>> with the rbd engine I try to avoid numjobs as it can give erroneous >>>>>> results in some cases. it's probably better generally to stick with >>>>>> multiple independent fio processes (though in this case for a randrw >>>>>> workload it might not matter). >>>>>> >>>>>>> direct=1 >>>>>>> rwmixread=0 >>>>>>> new_group >>>>>>> group_reporting >>>>>>> [rbd_image0] >>>>>>> rbdname=testimage_100GB_0 >>>>>>> >>>>>>> Any suggestion? >>>>>> >>>>>> >>>>>> >>>>>> What kind of performance are you seeing and what do you expect to get? >>>>>> >>>>>> Mark >>>>>> >>>>>>> Thanks. >>>>>>> >>>>>>> B.R. >>>>>>> Junqin zhang >>>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Mark Nelson [mailto:mnelson@xxxxxxxxxx] >>>>>>> Sent: Tuesday, July 11, 2017 7:32 PM >>>>>>> To: Junqin JQ7 Zhang; Ceph Development >>>>>>> Subject: Re: Ceph Bluestore OSD CPU utilization >>>>>>> >>>>>>> Ugh, small sequential *reads* I meant to say. :) >>>>>>> >>>>>>> Mark >>>>>>> >>>>>>> On 07/11/2017 06:31 AM, Mark Nelson wrote: >>>>>>>> >>>>>>>> >>>>>>>> Hi Junqin, >>>>>>>> >>>>>>>> Can you tell us your hardware configuration (models and quantities >>>>>>>> of cpus, network cards, disks, ssds, etc) and the command and >>>>>>>> options you used to measure performance? >>>>>>>> >>>>>>>> In many cases bluestore is faster than filestore, but there are a >>>>>>>> couple of cases where it is notably slower, the big one being when >>>>>>>> doing small sequential writes without client-side readahead. >>>>>>>> >>>>>>>> Mark >>>>>>>> >>>>>>>> On 07/11/2017 05:34 AM, Junqin JQ7 Zhang wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I installed Ceph luminous v12.1.0 in 3 nodes cluster with BlueStore >>>>>>>>> and did some fio test. >>>>>>>>> During test, I found the each OSD CPU utilization rate was only >>>>>>>>> aroud 30%. >>>>>>>>> And the performance seems not good to me. >>>>>>>>> Is there any configuration to help increase OSD CPU utilization to >>>>>>>>> improve performance? >>>>>>>>> Change kernel.pid_max? Any BlueStore specific configuration? >>>>>>>>> >>>>>>>>> Thanks a lot! >>>>>>>>> >>>>>>>>> B.R. >>>>>>>>> Junqin Zhang >>>>>>>>> -- >>>>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>>>> ceph-devel" >>>>>>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More >>>>>>>>> majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>> >>>>>>>> -- >>>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>>> ceph-devel" >>>>>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo >>>>>>>> info at http://vger.kernel.org/majordomo-info.html >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo >>>>>>> info at http://vger.kernel.org/majordomo-info.html >>>>>>> >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo >>>>>> info at http://vger.kernel.org/majordomo-info.html >>>>>> >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>> in >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info >>>>> at >>>>> http://vger.kernel.org/majordomo-info.html >>>> >>>> >>>> >>>> >>>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html