Hi,
response in line
On 20 Apr 2016 7:45 a.m., "Christian Balzer" <chibi@xxxxxxx> wrote:
>
>
> Hello,
>
> On Wed, 20 Apr 2016 03:42:00 +0000 Stephen Lord wrote:
>
> >
> > OK, you asked ;-)
> >
>
> I certainly did. ^o^
>
> > This is all via RBD, I am running a single filesystem on top of 8 RBD
> > devices in an effort to get data striping across more OSDs, I had been
> > using that setup before adding the cache tier.
> >
> Nods.
> Depending on your use case (sequential writes) actual RADOS striping might
> be more advantageous than this (with 4MB writes still going to the same
> PG/OSD all the time).
>
>
> > 3 nodes with 11 6 Tbyte SATA drives each for a base RBD pool, this is
> > setup with replication size 3. No SSDs involved in those OSDs, since
> > ceph-disk does not let you break a bluestore configuration into more
> > than one device at the moment.
> >
> That's a pity, but supposedly just a limitation of ceph-disk.
> I'd venture you can work around that with symlinks to a raw SSD
> partition, same as with current filestore journals.
>
> As Sage recently wrote:
> ---
> BlueStore can use as many as three devices: one for the WAL (journal,
> though it can be much smaller than FileStores, e.g., 128MB), one for
> metadata (e.g., an SSD partition), and one for data.
> ---
I believe he also mentioned the use of bcache and friends for the osd, maybe a way forward in this case?
Regards
Josef
>
> > The 600 Mbytes/sec is an approx sustained number for the data rate I can
> > get going into this pool via RBD, that turns into 3 times that for raw
> > data rate, so at 33 drives that is mid 50s Mbytes/sec per drive. I have
> > pushed it harder than that from time to time, but the OSD really wants
> > to use fdatasync a lot and that tends to suck up a lot of the potential
> > of a device, these disks will do 160 Mbytes/sec if you stream data to
> > them.
> >
> > I just checked with rados bench to this set of 33 OSDs with a 3 replica
> > pool, and 600 Mbytes/sec is what it will do from the same client host.
> >
> This matches a cluster of mine with 32 OSDs (filestore of course) and SSD
> journals on 4 nodes with a replica of 3.
>
> So BlueStore is indeed faster than than filestore.
>
> > All the networking is 40 GB ethernet, single port per host, generally I
> > can push 2.2 Gbytes/sec in one direction between two hosts over a single
> > tcp link, the max I have seen is about 2.7 Gbytes/sec coming into a
> > node. Short of going to RDMA that appears to be about the limit for
> > these processors.
> >
> Yeah, didn't expect your network to be involved here bottleneck wise, but
> a good data point to have nevertheless.
>
> > There are a grand total of 2 400 GB P3700s which are running a pool with
> > a replication factor of 1, these are in 2 other nodes. Once I add in
> > replication perf goes downhill. If I had more hardware I would be
> > running more of these and using replication, but I am out of network
> > cards right now.
> >
> Alright, so at 900MB/s you're pretty close to what one would expect from 2
> of these: 1080MB/s*2/2(journal).
>
> How much downhill is that?
>
> I have a production cache tier with 2 nodes (replica 2 of course) and 4
> 800GB DC S3610s each, IPoIB QDR (40Gbs) interconnect and the performance
> is pretty much what I would expect.
>
> > So 5 nodes running OSDs, and a 6th node running the RBD client using the
> > kernel implementation.
> >
> I assume there's are reason for use the kernel RBD client (which kernel?),
> given that it tends to be behind the curve in terms of features and speed?
>
> > Complete set of commands for creating the cache tier, I pulled this from
> > history, so the line in the middle was a failed command actually so
> > sorry for the red herring.
> >
> > 982 ceph osd pool create nvme 512 512 replicated_nvme
> > 983 ceph osd pool set nvme size 1
> > 984 ceph osd tier add rbd nvme
> > 985 ceph osd tier cache-mode nvme writeback
> > 986 ceph osd tier set-overlay rbd nvme
> > 987 ceph osd pool set nvme hit_set_type bloom
> > 988 ceph osd pool set target_max_bytes 500000000000 <<—— typo here,
> > so never mind 989 ceph osd pool set nvme target_max_bytes 500000000000
> > 990 ceph osd pool set nvme target_max_objects 500000
> > 991 ceph osd pool set nvme cache_target_dirty_ratio 0.5
> > 992 ceph osd pool set nvme cache_target_full_ratio 0.8
> >
> > I wish the cache tier would cause a health warning if it does not have
> > a max size set, it lets you do that, flushes nothing and fills the OSDs.
> >
> Oh yes, people have been bitten by this over and over again.
> At least it's documented now.
>
> > As for what the actual test is, this is 4K uncompressed DPX video frames,
> > so 50 Mbyte files written at least 24 a second on a good day, ideally
> > more. This needs to sustain around 1.3 Gbytes/sec in either direction
> > from a single application and needs to do it consistently. There is a
> > certain amount of buffering to deal with fluctuations in perf. I am
> > pushing 4096 of these files sequentially with a queue depth of 32 so
> > there is rather a lot of data in flight at any one time. I know I do not
> > have enough hardware to achieve this rate on writes.
> >
> So this is your test AND actual intended use case I presume, right?
>
> > The are being written with direct I/O into a pool of 8 RBD LUNs. The 8
> > LUN setup will not really help here with the small number of OSDs in the
> > cache pool, it does help when the RBD LUNs are going directly to a large
> > pool of disk based OSDs as it gets all the OSDs moving in parallel.
> >
> > My basic point here is that there is a lot more potential bandwidth to
> > be had in the backing pool, but I cannot get the cache tier to use more
> > than a small fraction of the available bandwidth when flushing content.
> > Since the front end of the cache can sustain around 900 Mbytes/sec over
> > RBD, I am somewhat out of balance here:
> >
> > cache input rate 900 Mbytes/sec
> > backing pool input rate 600 Mbytes/sec
> >
> > But not by a significant amount.
> >
> > The question is really about is there anything I can do to get cache
> > flushing to take advantage of more of the bandwidth. If I do this
> > without the cache tier then the latency of the disk based OSDs is too
> > variable and you cannot sustain a consistent data rate.
>
> This should hopefully be reduced by having WAL and metadata on SSDs with
> bluestore.
> But HDD based storage will be more jittery, that's a given.
>
> >The NVMe devices
> > are better about consistent device latency, but the cache tier
> > implementation seems to have a problem driving the backing pool at
> > anything close to its capabilities. It really only needs to move 40 or
> > 50 objects in parallel to achieve that.
> >
> And this is clearly where the Ceph (cache-tier) code could probably use
> some attention.
> Mind you, most people have the exact OPPOSITE requirement of yours, they
> want a steady, slow, low-impact stream of flushes to the HDDs, not an
> avalanche. ^o^
>
> We know that this is all based on per PG ratios, so if one PG goes over
> the dirty ratio the tier-agent will start flushing objects from it.
> The singular spikes you're seeing.
>
> My bet is that this process is sequential and not parallel (or at least
> not massively so), meaning that until one PG has finished flushing,
> the next dirty'ish one won't start flushing.
> Would be nice to have this confirmed by somebody familiar with the code,
> though.
>
> Of course having fast OSDs in the backing pool will alleviate this
> somewhat.
>
> What's your cache_target_dirty_high_ratio set to and have you tried
> setting it below/identical to cache_target_dirty_ratio?
>
> Christian
>
> > I am not attempting to provision a cache tier large enough for whole
> > workload, but as more of a debounce zone to avoid jitter making it back
> > to the application. I am trying to categorize what can and cannot be
> > achieved with ceph here for this type of workload, not build a complete
> > production setup. My test represents 170 seconds of content and
> > generates 209 Gbytes of data, so this is a small scale test ;-)
> > fortunately this stuff is not always used realtime.
> >
> > All of those extra config options look to be around how fast promotion
> > into the cache can go, not how fast you can get things out of it :-(
> >
> > I have been using readforward and that is working OK, there is
> > sufficient read bandwidth that it does not matter if data is coming from
> > the cache pool or the disk backing pool.
> >
> > Steve
> >
> >
> > > On Apr 19, 2016, at 7:47 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> > >
> > >
> > > Hello,
> > >
> > > On Tue, 19 Apr 2016 20:21:39 +0000 Stephen Lord wrote:
> > >
> > >>
> > >>
> > >> I Have a setup using some Intel P3700 devices as a cache tier, and 33
> > >> sata drives hosting the pool behind them.
> > >
> > > A bit more details about the setup would be nice, as in how many nodes,
> > > interconnect, replication size of the cache tier and the backing HDD
> > > pool, etc.
> > > And "some" isn't a number, how many P3700s (which size?) in how many
> > > nodes? One assumes there are no further SSDs involved with those SATA
> > > HDDs?
> >
> > >
> > >> I setup the cache tier with
> > >> writeback, gave it a size and max object count etc:
> > >>
> > >> ceph osd pool set target_max_bytes 500000000000
> > > ^^^
> > > This should have given you an error, it needs the pool name, as in your
> > > next line.
> > >
> > >> ceph osd pool set nvme target_max_bytes 500000000000
> > >> ceph osd pool set nvme target_max_objects 500000
> > >> ceph osd pool set nvme cache_target_dirty_ratio 0.5
> > >> ceph osd pool set nvme cache_target_full_ratio 0.8
> > >>
> > >> This is all running Jewel using bluestore OSDs (I know experimental).
> > > Make sure to report all pyrotechnics, trap doors and sharp edges. ^_-
> > >
> > >> The cache tier will write at about 900 Mbytes/sec and read at 2.2
> > >> Gbytes/sec, the sata pool can take writes at about 600 Mbytes/sec in
> > >> aggregate.
> > > ^^^^^^^^^
> > > Key word there.
> > >
> > > That's just 18MB/s per HDD (60MB/s with a replication of 3), a pretty
> > > disappointing result for the supposedly twice as fast BlueStore.
> > > Again, replication size and topology might explain that up to a point,
> > > but we don't know them (yet).
> > >
> > > Also exact methodology of your tests please, i.e. the fio command
> > > line, how was the RBD device (if you tested with one) mounted and
> > > where, etc...
> > >
> > >> However, it looks like the mechanism for cleaning the cache
> > >> down to the disk layer is being massively rate limited and I see about
> > >> 47 Mbytes/sec of read activity from each SSD while this is going on.
> > >>
> > > This number is meaningless w/o knowing home many NVMe's you have.
> > > That being said, there are 2 levels of flushing past Hammer, but if you
> > > push the cache tier to the 2nd limit (cache_target_dirty_high_ratio)
> > > you will get full speed.
> > >
> > >> This means that while I could be pushing data into the cache at high
> > >> speed, It cannot evict old content very fast at all, and it is very
> > >> easy to hit the high water mark and the application I/O drops
> > >> dramatically as it becomes throttled by how fast the cache can flush.
> > >>
> > >> I suspect it is operating on a placement group at a time so ends up
> > >> targeting a very limited number of objects and hence disks at any one
> > >> time. I can see individual disk drives going busy for very short
> > >> periods, but most of them are idle at any one point in time. The only
> > >> way to drive the disk based OSDs fast is to hit a lot of them at once
> > >> which would mean issuing many cache flush operations in parallel.
> > >>
> > > Yes, it is all PG based, so your observations match the expectations
> > > and what everybody else is seeing.
> > > See also the thread "Cache tier operation clarifications" by me,
> > > version 2 is in the works.
> > > There are also some new knobs in Jewel that may be helpful, see:
> > > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.spinics.net_lists_ceph-2Dusers_msg25679.html&d=CwICAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=yIRKIZ4yBOSAr9O5lav-0J24ys7S21EhX394KIorQ-E&s=XV0tKumf_xV99IUKYgnbmJrvhbL9I5Fdk1eCwG-YUYQ&e=
> > >
> > > If you have a use case with a clearly defined idle/low use time and a
> > > small enough growth in dirty objects, consider what I'm doing,
> > > dropping the cache_target_dirty_ratio a few percent (in my case 2-3%
> > > is enough for a whole day) via cron job,wait a bit and then up again
> > > to it's normal value.
> > >
> > > That way flushes won't normally happen at all during your peak usage
> > > times, though in my case that's purely cosmetic, flushes are not
> > > problematic at any time in that cluster currently.
> > >
> > >> Are there any controls which can influence this behavior?
> > >>
> > > See above (cache_target_dirty_high_ratio).
> > >
> > > Aside from that you might want to reflect on what your use case,
> > > workload is going to be and how your testing reflects on it.
> > >
> > > As in, are you really going to write MASSIVE amounts of data at very
> > > high speeds or is it (like in 90% of common cases) the amount of small
> > > write IOPS that is really going to be the limiting factor.
> > > Which is something that cache tiers can deal with very well (or
> > > sufficiently large and well designed "plain" clusters).
> > >
> > > Another thing to think about is using the "readforward" cache mode,
> > > leaving your cache tier free to just handle writes and thus giving it
> > > more space to work with.
> > >
> > > Christian
> > > --
> > > Christian Balzer Network/Systems Engineer
> > > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications
> > > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gol.com_&d=CwICAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=yIRKIZ4yBOSAr9O5lav-0J24ys7S21EhX394KIorQ-E&s=D1RK9OOi5QmOFzGURxC82nkUr7mAe2-Ifo2FNgqYVQY&e=
> >
> >
> > ----------------------------------------------------------------------
> > The information contained in this transmission may be confidential. Any
> > disclosure, copying, or further distribution of confidential information
> > is not permitted unless such privilege is explicitly granted in writing
> > by Quantum. Quantum reserves the right to have electronic
> > communications, including email and attachments, sent across its
> > networks filtered through anti virus and spam software programs and
> > retain such messages in order to comply with applicable data security
> > and retention requirements. Quantum is not responsible for the proper
> > and complete transmission of the substance of this communication or for
> > any delay in its receipt.
>
>
> --
> Christian Balzer Network/Systems Engineer
> chibi@xxxxxxx Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com