Re: Ceph cache tier and rbd volumes/SSD primary, HDD replica crush rule!

Nick Fisk <nick@xxxxxxxxxx> · Tue, 12 Jan 2016 17:39:22 -0000

Yes, I would recommend you match the replication levels of the cache and base pools, although as SSD’s can rebuild faster, there is an argument that you might be able to get away with a 2x replication for them.

Yes its fine for the journals to sit on the same SSD as the data. There is a slight performance penalty assuming you are using decent SSD’s, but this is more towards the extremes and wouldn’t be very noticeable in day to day operations.

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Mihai Gheorghe
Sent: 12 January 2016 17:27
To: Nick Fisk <nick@xxxxxxxxxx>; ceph-users@xxxxxxxxxxxxxx
Subject: Re: [ceph-users] Ceph cache tier and rbd volumes/SSD primary, HDD replica crush rule!

One more question. Seeing that cache tier holds data on it untill it reaches % ratio, i suppose i must set replication to 2 or higher on the cache pool to not lose hot data not writen to the cold storage in case of a drive failure, right? 

Also, will there be any perfomance penalty if i set the osd journal on the same SSD as the OSD. I now have one SSD specially for journaling the SSD OSDs. I know that in the case of mechanical drive this is a problem!

And thank you for clearing this things out for me. 

2016-01-12 18:03 GMT+02:00 Nick Fisk <nick@xxxxxxxxxx>:
> -----Original Message-----
> From: Mihai Gheorghe [mailto:mcapsali@xxxxxxxxx]
> Sent: 12 January 2016 15:42
> To: Nick Fisk <nick@xxxxxxxxxx>; ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Ceph cache tier and rbd volumes/SSD primary, HDD
> replica crush rule!
>
>
> 2016-01-12 17:08 GMT+02:00 Nick Fisk <nick@xxxxxxxxxx>:
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> Of
> > Mihai Gheorghe
> > Sent: 12 January 2016 14:56
> > To: Nick Fisk <nick@xxxxxxxxxx>; ceph-users@xxxxxxxxxxxxxx
> > Subject: Re:  Ceph cache tier and rbd volumes/SSD primary,
> HDD
> > replica crush rule!
> >
> > Thank you very much for the quick answer.
> >
> > I supose cache tier works the same way for object storage aswell!?
>
> Yes, exactly the same. The cache is actually at the object layer anyway so it
> works the same. You can actually pin/unpin objects from the cache as well if
> you are using it at the object level.
>
> https://github.com/ceph/ceph/pull/6326
> >
> > How is a delete of a cinder volume handled. I ask you this because after the
> > volume got flushed to the cold storage, i then deleted it from cinder. It got
> > deleted from the cache pool aswell but on the HDD pool,when issuing rbd -
> p
> > ls the volumes were gone but the space was still used (probably rados
> data)
> > untill i manually made a flush command on the cache pool (i didn't wait too
> > long to see if the space would be cleared in time). It is probably a
> > missconfiguration from my end though.
>
> Ah yes, this is one of my pet hates. It's actually slightly worse than what you
> describe. All the objects have to be promoted into the cache tier to be
> deleted and then afterwards, flushed, to remove them from the base tier as
> well. For a large image, this can actually take quite a long time. Hopefully this
> will be fixed at some point, I don't believe this would be too difficult to fix.
>
> I assume this is done automatically and no need for manual flush, only if in a
> hurry, right?
> What if the image is larger than the whole cache pool? I assume the image
> will be promoted into smaller object into the cache pool before deletion.
> I can live with the extra time to delete a volume from the cold storage. My
> only grudge is with the extra network load from the extra step of loading the
> image to the cache tier to be deleted (the SSD used for cache pool resides on
> a different host) as i don't have 10Gb ports, only 1Gb, 6 of them on every
> host in lacp mode.
Yes this is fine, the objects will just get promoted until the cache is full and then the deleted ones will then be flushed out and so on. The only problem is that it causes cache pollution as it will force other objects out the cache. Like I said it's not the end of the world, but very annoying.

>
> >
> > In you opinion is cache tier ready for production? I have read that bcache
> > (flashcache?) is used in favor of cache tier, but is not that simple to setup
> and
> > there are disadvantages there aswell.
>
> See my recent posts about cache tiering, there is a fairly major bug which
> limits performance if you're working set doesn't fit in the cache. Assuming
> you are running the patch for this bug and you can live with the deletion
> problem above.....then yes I would say that its usable in production. I'm
> planning to enable it on the production pool in my cluster in the next couple
> of weeks.
>
> I'm sorry, i'm a bit new to the ceph mailing list. Where can i see your recent
> posts? I really need to check that patch out!
>

Here is the patch, it's in master and is in the process of being back ported to Hammer. I think for Infernalis, you will need to manually patch and build.

https://github.com/zhouyuan/ceph/commit/8ffb4fba2086f5758a3b260c05d16552e995c452

> >
> > Also is there a problem if i add a cache tier to an already existing pool that
> has
> > data on it? Or should the pool be empty prior to adding the cache tier?
>
> Nope, that should be fine.
>
>
> I was asking this because i have a 5TB cinder volume with data on it (mostly
> >3Gb in size). I added a cache tier to the pool that holds the volume and i can
> see chaotic behavoiur from my W2012 instance, as in deleting files takes a
> very long time and not all subdirectories work (i get an error of not finding
> that directory with many small files)

This could be related to the patch I mentioned. Without it, no matter what the promote recency settings are set to, objects will be promoted at almost every read/write. After the patch, ceph will obey the settings. This can quickly overload the cluster with promotions/evictions as even small FS reads will cause 4MB promotions.

So you can set for example:

Hit_set_count = 10
Hit_set_period = 60
Read_recency = 3
Write_recency = 5

This will generate a new hit set every 1 minute and will keep 10 of them. If the last 3 hit sets contain the object then it will be promoted on that read request, if the last 5 hit sets contain the object then it will be promoted on the write request.

>
> >
> > 2016-01-12 16:30 GMT+02:00 Nick Fisk <nick@xxxxxxxxxx>:
> > > -----Original Message-----
> > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> Behalf
> > Of
> > > Mihai Gheorghe
> > > Sent: 12 January 2016 14:25
> > > To: ceph-users@xxxxxxxxxxxxxx
> > > Subject:  Ceph cache tier and rbd volumes/SSD primary, HDD
> > > replica crush rule!
> > >
> > > Hello,
> > >
> > > I have a question about how cache tier works with rbd volumes!?
> > >
> > > So i created a pool of SSD's for cache and a pool on HDD's for cold storage
> > > that acts as backend for cinder volumes. I create a volume in cinder from
> an
> > > image and spawn an instance. The volume is created in the cache pool as
> > > expected and it will be flushed to the cold storage after a period of
> > inactivity
> > > or after the cache pool reaches 40% full as i understand.
> >
> > Cache won't be flushed after inactivity the cache agent only works on % full
> > (either # of objects or bytes)
> >
> > >
> > > Now after the volume is flushed to the HDD and i make a read or write
> > > request in the guest OS, how does ceph handle it. Does it upload the
> whole
> > > rbd volume from the cold storage to the cache pool or only a chunk of it
> > > where the request is made from the guest OS?
> >
> > The cache works on hot objects, so particular objects (normally 4MB) of the
> > RBD will be promoted/demoted over time depending on access patterns.
> >
> > >
> > > Also, is the replication in ceph syncronious or async? If i set a crush rule to
> > use
> > > as primary host the SSD one and for replication the HDD one, would the
> > > writes and reads on the SSD;s be slowed down by the replication on the
> > > mechanical drive?
> > > Would this configuration be viable? (i ask this because i don't have the
> > > number of SSD to make a pool of size 3 on them)
> >
> > Its sync replication. If you have a very heavy read workload, you can do
> what
> > you suggest and set the SSD OSD to be the primary copy for each PG,
> writes
> > will still be limited to the speed of the spinning disks, but reads will be
> > serviced from the SSD's. However there is a risk in degraded scenarios that
> > your performance could dramatically drop if more IO is diverted to spinning
> > disks.
> >
> > >
> > > Thank you!
>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com