Hello, On Wed, 7 Dec 2016 12:39:11 +0100 Christian Theune wrote: > Hi, > > I’m now working with the raw device and getting interesting results. > > For one, I went through all reviews about the Micron DC S610 again and as always the devil is in the detail. I noticed that the test results are quite favorable, but I didn’t previously notice the caveat (which applies to SSDs in general) that precondition may be in order. > > See http://www.storagereview.com/seagate_12002_micron_s600dc_enterprise_sas_ssd_review <http://www.storagereview.com/seagate_12002_micron_s600dc_enterprise_sas_ssd_review> > > The Micron in their tests shows quite extreme initial max latency until preconditioning settles. > > I can relate to that as the SSDs that I put into the cluster last Friday (5 days ago) have quite different characteristics in my statistics compared to the ones I added this Monday evening (2 days ago). > Be that as it may, it's still lacking, see below. > I took one of the early ones and evacuated the OSD to perform tests. Sebastian’s fio call for testing journal ability ended up like this at the current time: > > | cartman06 ~ # fio --filename=/dev/sdl --direct=1 --sync=1 --rw=write --bs=128k --numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test > | journal-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=1 > | fio-2.0.14 > | Starting 1 process > | Jobs: 1 (f=1): [W] [100.0% done] [0K/88852K/0K /s] [0 /22.3K/0 iops] [eta 00m:00s] > | journal-test: (groupid=0, jobs=1): err= 0: pid=28606: Wed Dec 7 11:59:36 2016 > | write: io=5186.7MB, bw=88517KB/s, iops=22129 , runt= 60001msec > | clat (usec): min=37 , max=1519 , avg=43.77, stdev=10.89 > | lat (usec): min=37 , max=1519 , avg=43.94, stdev=10.90 > | clat percentiles (usec): > | | 1.00th=[ 39], 5.00th=[ 40], 10.00th=[ 40], 20.00th=[ 41], > | | 30.00th=[ 41], 40.00th=[ 42], 50.00th=[ 42], 60.00t848/h=[ 42], > | | 70.00th=[ 43], 80.00th=[ 44], 90.00th=[ 47], 95.00th=[ 53], > | | 99.00th=[ 71], 99.50th=[ 87], 99.90th=[ 157], 99.95th=[ 201], > | | 99.99th=[ 478] > | bw (KB/s) : min=81096, max=91312, per=100.00%, avg=88519.19, stdev=1762.43 > | lat (usec) : 50=92.42%, 100=7.28%, 250=0.27%, 500=0.02%, 750=0.01% > | lat (usec) : 1000=0.01% > | lat (msec) : 2=0.01% > | cpu : usr=5.43%, sys=14.64%, ctx=1327888, majf=0, minf=6 > | IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > | submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > | complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > | issued : total=r=0/w=1327777/d=0, short=r=0/w=0/d=0 > | > | Run status group 0 (all jobs): > | WRITE: io=5186.7MB, aggrb=88516KB/s, minb=88516KB/s, maxb=88516KB/s, mint=60001msec, maxt=60001msec > | > | Disk stats (read/write): > | sdl: ios=15/1326283, merge=0/0, ticks=1/47203, in_queue=46970, util=78.29% > > That doesn’t look too bad to me, specifically the 99.99th of 478 microseconds seems fine. > > The iostat during this run looks OK as well: > Both do look pretty bad to to me. Your SSD with a nominal write speed of 850MB/s is doing 88MB/s at 80% utilization. The puny 400GB DC S3610 in my example earlier can do 400MB/s per Intel specs and was at 70% with 300MB/s (so half of it journal writes!). My experience with Intel SSDs (as mentioned before in this ML) is that their stated speeds can actually be achieved within about a 10% margin when used with Ceph, be it for pure journaling or as OSDs with inline journals. Your numbers here prove basically that you can't expect much more than 100MB/s (50MB/s effective) from these SSDs at the best of circumstances. How do these numbers change when you do "normal" FS level writes w/o sync/direct in fio, so the only SYNCs coming from the FS for its journals? Or just direct to the device w/o FS, an even more "friendly" test. Large/long runs so you can observe things when the pagecache gets flushed. If this gets significant closer to the limits of your SSD, or in your case that of your 6Gb/s SAS2 link (so 600MB/s), then the proof is complete that the culprit is bad DSYNC handling of those SSDs. > | cartman06 ~ # iostat -x 5 sdl > | Linux 4.4.27-gentoo (cartman06) 12/07/2016 _x86_64_ (24 CPU) > | > | avg-cpu: %user %nice %system %iowait %steal %idle > | 5.70 0.05 3.09 5.09 0.00 86.07 > | > | Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > | sdl 0.00 2.92 31.24 148.16 1851.99 8428.66 114.61 1.61 9.00 0.68 10.75 0.22 4.03 > | > | avg-cpu: %user %nice %system %iowait %steal %idle > | 3.31 0.04 1.97 1.48 0.00 93.19 > | > | Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > | sdl 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > | > | avg-cpu: %user %nice %system %iowait %steal %idle > | 4.02 0.03 2.38 1.44 0.00 92.13 > | > | Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > | sdl 0.00 0.00 12.40 3101.40 92.80 12405.60 8.03 0.11 0.04 0.10 0.04 0.04 11.12 > | > | avg-cpu: %user %nice %system %iowait %steal %idle > | 4.02 0.05 3.57 4.78 0.00 87.58 > | > | Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > | sdl 0.00 0.00 0.00 22166.20 0.00 88664.80 8.00 0.80 0.04 0.00 0.04 0.04 79.58 > | > | avg-cpu: %user %nice %system %iowait %steal %idle > | 3.64 0.05 2.77 4.98 0.00 88.56 > | > | Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > | sdl 0.00 0.00 0.00 22304.20 0.00 89216.80 8.00 0.78 0.04 0.00 0.04 0.04 78.08 > | > | avg-cpu: %user %nice %system %iowait %steal %idle > | 4.89 0.05 2.97 11.15 0.00 80.93 > | > | Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > | sdl 0.00 0.00 0.00 22022.00 0.00 88088.00 8.00 0.79 0.04 0.00 0.04 0.04 78.68 > | > | avg-cpu: %user %nice %system %iowait %steal %idle > | 3.45 0.04 2.74 4.24 0.00 89.53 > | > | Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > | sdl 0.00 0.00 0.00 22182.60 0.00 88730.40 8.00 0.78 0.04 0.00 0.04 0.04 77.66 > | > | avg-cpu: %user %nice %system %iowait %steal %idle > | 4.21 0.04 2.51 3.40 0.00 89.83 > | > | Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > | sdl 0.00 0.00 0.00 22392.00 0.00 89568.00 8.00 0.79 0.04 0.00 0.04 0.04 79.26 > | > | avg-cpu: %user %nice %system %iowait %steal %idle > | 4.94 0.04 3.35 3.40 0.00 88.26 > | > | Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > | sdl 0.00 0.00 0.00 22078.40 0.00 88313.60 8.00 0.79 0.04 0.00 0.04 0.04 78.70 > | > | avg-cpu: %user %nice %system %iowait %steal %idle > | 4.43 0.04 3.02 4.68 0.00 87.83 > | > | Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > | sdl 0.00 0.00 0.00 22141.60 0.00 88566.40 8.00 0.77 0.04 0.00 0.04 0.03 77.24 > | > | avg-cpu: %user %nice %system %iowait %steal %idle > | 4.16 0.04 2.82 4.66 0.00 88.32 > | > | Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > | sdl 0.00 0.00 0.00 22177.00 0.00 88708.00 8.00 0.78 0.04 0.00 0.04 0.04 78.24 > | > | avg-cpu: %user %nice %system %iowait %steal %idle > | 4.09 0.03 3.02 12.34 0.00 80.52 > | > | Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > | sdl 0.00 0.00 0.00 22156.60 0.00 88626.40 8.00 0.78 0.04 0.00 0.04 0.04 78.36 > | > | avg-cpu: %user %nice %system %iowait %steal %idle > | 5.43 0.04 3.38 4.07 0.00 87.08 > | > | Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > | sdl 0.00 0.00 0.00 22298.80 0.00 89195.20 8.00 0.77 0.03 0.00 0.03 0.03 77.36 > | > | avg-cpu: %user %nice %system %iowait %steal %idle > | 7.33 0.05 4.42 4.58 0.00 83.62 > | > | Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > | sdl 0.00 0.00 0.00 21905.20 0.00 87620.80 8.00 0.79 0.04 0.00 0.04 0.04 79.20 > | > | avg-cpu: %user %nice %system %iowait %steal %idle > | 4.91 0.03 3.52 3.39 0.00 88.15 > | > | Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > | sdl 0.00 0.00 12.40 18629.40 92.80 74517.60 8.00 0.67 0.04 0.10 0.04 0.04 67.18 > > > I’m now running fio --filename=/dev/sdl --rw=write --bs=128k --numjobs=1 --iodepth=32 --group_reporting --name=journal-test to condition the device fully. After that I’ll perform some more tests based on mixed loads. > > > On 7 Dec 2016, at 12:20, Christian Balzer <chibi@xxxxxxx> wrote: > > > > I wasn't talking about abandoning Ceph up there, just that for this unit > > (storage and VMs) a freeze might be the better, safer option. > > The way it's operated makes that a possibility, others will of course > > want/need to upgrade their clusters and keep them running as indefinitely > > as possible. > > I read your comment from a while ago about us all requiring some level of “insanity” to run mission critical OSS storage. ;) > We all tend to trust the OSS kernel and things like file systems (sometimes wrongly so), but the complexity of Ceph coupled with its comparatively small user base makes it of course a bigger risk. > > Yup, something I did for our stuff (and not just Ceph SSDs) as well, > > there's a nice Nagios plugin for this. > > I’ll see if I can get this into our collectd somehow to feed into our graphing. > That's also more or less straightforward, but requires somebody to stare at things. So a nice addition, but no replacement for proactive automatic monitoring and alerting. Christian > > Could be housekeeping, could be pagecache flushes or other XFS ops. > > Probably best to test/compare with a standalone SSD. > > Hmm. I’ll see when I introduce additional abstraction layers: raw device, lvm, files on xfs. After that maybe also a mixture of two concurrently running fios one for files on XFS (OSD) and one for writing to the journal LVM. > > > If it were hooked up to the backplane (expander or individual connectors > > per drive?) with just one link/lane (6Gb/s) that would indeed be a > > noticeable bottleneck. > > But I have a hard time imagining that. > > > > If it were with just one mini-SAS connector aka 4 lanes to an expander > > port it would halve your potential bandwidth but still be more than what > > you're currently likely to produce there. > > It’s a far fetch but then again it’s a really big list of (some very small) things that can go wrong to screw this everything up at the VM level … > > Cheers, > Christian > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com