Re: Ceph cache tier and rbd volumes/SSD primary, HDD replica crush rule!

Mihai Gheorghe <mcapsali@xxxxxxxxx> · Wed, 13 Jan 2016 13:48:14 +0200

What are the recommanded specs of a SSD for journaling. It's a little bit tricky now to move the journals for spinners on them, because i have data on them. 
I now have all HDD journals on separate SSD. The problem is when i first made the cluster i assigned one journal SSD to 8x4TB HDD. Now i see there are too many spinners for one SSD. 
So i am planning to assign a journal SSD to 4 OSDs, so i have an extra redundancy-ish (if one journal crashes it only takes 4 OSD with it not 8). 

Do read/write specs matter or do the IOPS matter more? The journal SSDs i have now are, i belive, intel 520 (240GB, not that great write speeds but high IOPS). And i have a couple of spares that i can use for jorunaling (same type).

Also, what size should the journal partition be for one 4TB OSD. Now i have them set at 5GB? (it's the default ceph-deploy creates) .

2016-01-12 21:43 GMT+02:00 Robert LeBlanc <robert@xxxxxxxxxxxxx>:
-----BEGIN PGP SIGNED MESSAGE-----

Hash: SHA256

We are using cache tiering in two production clusters at the moment.

One cluster is running in forward mode at the moment due to the

excessive promotion/demotion. I've got Nick's patch backported to

Hammer and am going through the test suite at the moment. Once it

passes, I'll create a PR so it hopefully makes into the next Hammer

version.

In response to your question about journals. Once we introduced the

SSD cache tier, we moved or spindle journals off of the SSDs and onto

the spindles. We found that the load on the spindles were a fraction

of what it was before the cache tier. When we started up a host (five

spindle journals on the same SSD as the cache pool) we would have very

long start up times for the OSDs because the SSD was a bottleneck on

recovery of so many OSDs. We are also finding that even though the

Micron M600 drives perform "good enough" under steady state, there

isn't as much headroom as there is on the Intel S3610s that we are

also evaluating (6-8x less io time for the same IOPs on the S3610s

compared to the M600s). Being on the limits of the M600 may also

contribute to the inability of our busiest production clusters to run

in writeback mode permanently.

If your spindle pool is completely fronted by an SSD pool (or your

busy pools, we don't front our glance pool for instance), I'd say

leave the configuration simpler and co-locate the journal on the

spindle.

-----BEGIN PGP SIGNATURE-----

Version: Mailvelope v1.3.3

Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWlVdvCRDmVDuy+mK58QAAo/gP+QGVClmCb3Jut6PdXc25

B1bGawfOunsp+1c1iFfYi6BGvsf8saObq8FFZB471Yv/wQ0Y6MtuqLsiG85I

Mzy/3rbaS4YoWrcrhwGdaXKmmSOvAy58ZFSMM1fXjd8gNSzoNFsIZL0peorH

94If4+o18Hpc1oUmDLO3pj2XUrbO7RXgzQFT74xJdOfgo8ozlGNF1xfvsjJI

P5c2hVHdUfrnLoL0VFRRxVGTVmFKE6a1MSH4EiJbUDEGNNuxgztKUirBDfmV

SyFmRryrsy/1mulminDiRsjWEjzH8YpTKw/9E212NN0BR+eXbVH2d9uiYYYc

KeWYarxTg+09Ak9bP0IKoCP7ZgbBgrJQBnrMeFbIhM8i6OWpMNqRdniCs1nH

/q6PzYcytYMFdAzOq3HTi8ydxli/lJ1rv7eavvjMupfJQk6JJGvDITUi0phO

Ct7Cqu4qXLzCkQVFxmEo7grO68DTR1E26GEoINv7q3UhpQGsXzrnvYJwZoVv

cabSHDm9TIF0hlorQRTdvzElALuoxrB/rfpxsGhC3FFlrkfgsNA4QottF+dv

AbcxnIfMD+HoOxLadL4xKiRUVHOtgUtKuRCFGqEL7FaagyE165PiiTmR0tJk

H+Cz6wz/fW7CnoaxBE+M733hkdTb4QYnf0wqdkJIQ7Flec988/Ds+fxPcE8o

Hx7S

=3WAd

-----END PGP SIGNATURE-----

----------------

Robert LeBlanc

PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Tue, Jan 12, 2016 at 10:27 AM, Mihai Gheorghe <mcapsali@xxxxxxxxx> wrote:

> One more question. Seeing that cache tier holds data on it untill it reaches

> % ratio, i suppose i must set replication to 2 or higher on the cache pool

> to not lose hot data not writen to the cold storage in case of a drive

> failure, right?

>

> Also, will there be any perfomance penalty if i set the osd journal on the

> same SSD as the OSD. I now have one SSD specially for journaling the SSD

> OSDs. I know that in the case of mechanical drive this is a problem!

>

> And thank you for clearing this things out for me.

>

> 2016-01-12 18:03 GMT+02:00 Nick Fisk <nick@xxxxxxxxxx>:

>>

>> > -----Original Message-----

>> > From: Mihai Gheorghe [mailto:mcapsali@xxxxxxxxx]

>> > Sent: 12 January 2016 15:42

>> > To: Nick Fisk <nick@xxxxxxxxxx>; ceph-users@xxxxxxxxxxxxxx

>> > Subject: Re:  Ceph cache tier and rbd volumes/SSD primary,

>> > HDD

>> > replica crush rule!

>> >

>> >

>> > 2016-01-12 17:08 GMT+02:00 Nick Fisk <nick@xxxxxxxxxx>:

>> > > -----Original Message-----

>> > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf

>> > Of

>> > > Mihai Gheorghe

>> > > Sent: 12 January 2016 14:56

>> > > To: Nick Fisk <nick@xxxxxxxxxx>; ceph-users@xxxxxxxxxxxxxx

>> > > Subject: Re:  Ceph cache tier and rbd volumes/SSD primary,

>> > HDD

>> > > replica crush rule!

>> > >

>> > > Thank you very much for the quick answer.

>> > >

>> > > I supose cache tier works the same way for object storage aswell!?

>> >

>> > Yes, exactly the same. The cache is actually at the object layer anyway

>> > so it

>> > works the same. You can actually pin/unpin objects from the cache as

>> > well if

>> > you are using it at the object level.

>> >

>> > https://github.com/ceph/ceph/pull/6326

>> > >

>> > > How is a delete of a cinder volume handled. I ask you this because

>> > > after the

>> > > volume got flushed to the cold storage, i then deleted it from cinder.

>> > > It got

>> > > deleted from the cache pool aswell but on the HDD pool,when issuing

>> > > rbd -

>> > p

>> > > ls the volumes were gone but the space was still used (probably rados

>> > data)

>> > > untill i manually made a flush command on the cache pool (i didn't

>> > > wait too

>> > > long to see if the space would be cleared in time). It is probably a

>> > > missconfiguration from my end though.

>> >

>> > Ah yes, this is one of my pet hates. It's actually slightly worse than

>> > what you

>> > describe. All the objects have to be promoted into the cache tier to be

>> > deleted and then afterwards, flushed, to remove them from the base tier

>> > as

>> > well. For a large image, this can actually take quite a long time.

>> > Hopefully this

>> > will be fixed at some point, I don't believe this would be too difficult

>> > to fix.

>> >

>> > I assume this is done automatically and no need for manual flush, only

>> > if in a

>> > hurry, right?

>> > What if the image is larger than the whole cache pool? I assume the

>> > image

>> > will be promoted into smaller object into the cache pool before

>> > deletion.

>> > I can live with the extra time to delete a volume from the cold storage.

>> > My

>> > only grudge is with the extra network load from the extra step of

>> > loading the

>> > image to the cache tier to be deleted (the SSD used for cache pool

>> > resides on

>> > a different host) as i don't have 10Gb ports, only 1Gb, 6 of them on

>> > every

>> > host in lacp mode.

>>

>> Yes this is fine, the objects will just get promoted until the cache is

>> full and then the deleted ones will then be flushed out and so on. The only

>> problem is that it causes cache pollution as it will force other objects out

>> the cache. Like I said it's not the end of the world, but very annoying.

>>

>> >

>> > >

>> > > In you opinion is cache tier ready for production? I have read that

>> > > bcache

>> > > (flashcache?) is used in favor of cache tier, but is not that simple

>> > > to setup

>> > and

>> > > there are disadvantages there aswell.

>> >

>> > See my recent posts about cache tiering, there is a fairly major bug

>> > which

>> > limits performance if you're working set doesn't fit in the cache.

>> > Assuming

>> > you are running the patch for this bug and you can live with the

>> > deletion

>> > problem above.....then yes I would say that its usable in production.

>> > I'm

>> > planning to enable it on the production pool in my cluster in the next

>> > couple

>> > of weeks.

>> >

>> > I'm sorry, i'm a bit new to the ceph mailing list. Where can i see your

>> > recent

>> > posts? I really need to check that patch out!

>> >

>>

>> Here is the patch, it's in master and is in the process of being back

>> ported to Hammer. I think for Infernalis, you will need to manually patch

>> and build.

>>

>>

>> https://github.com/zhouyuan/ceph/commit/8ffb4fba2086f5758a3b260c05d16552e995c452

>>

>>

>> > >

>> > > Also is there a problem if i add a cache tier to an already existing

>> > > pool that

>> > has

>> > > data on it? Or should the pool be empty prior to adding the cache

>> > > tier?

>> >

>> > Nope, that should be fine.

>> >

>> >

>> > I was asking this because i have a 5TB cinder volume with data on it

>> > (mostly

>> > >3Gb in size). I added a cache tier to the pool that holds the volume

>> > > and i can

>> > see chaotic behavoiur from my W2012 instance, as in deleting files takes

>> > a

>> > very long time and not all subdirectories work (i get an error of not

>> > finding

>> > that directory with many small files)

>>

>> This could be related to the patch I mentioned. Without it, no matter what

>> the promote recency settings are set to, objects will be promoted at almost

>> every read/write. After the patch, ceph will obey the settings. This can

>> quickly overload the cluster with promotions/evictions as even small FS

>> reads will cause 4MB promotions.

>>

>> So you can set for example:

>>

>> Hit_set_count = 10

>> Hit_set_period = 60

>> Read_recency = 3

>> Write_recency = 5

>>

>> This will generate a new hit set every 1 minute and will keep 10 of them.

>> If the last 3 hit sets contain the object then it will be promoted on that

>> read request, if the last 5 hit sets contain the object then it will be

>> promoted on the write request.

>>

>>

>> >

>> > >

>>

>> > > 2016-01-12 16:30 GMT+02:00 Nick Fisk <nick@xxxxxxxxxx>:

>> > > > -----Original Message-----

>> > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On

>> > Behalf

>> > > Of

>> > > > Mihai Gheorghe

>> > > > Sent: 12 January 2016 14:25

>> > > > To: ceph-users@xxxxxxxxxxxxxx

>> > > > Subject:  Ceph cache tier and rbd volumes/SSD primary,

>> > > > HDD

>> > > > replica crush rule!

>> > > >

>> > > > Hello,

>> > > >

>> > > > I have a question about how cache tier works with rbd volumes!?

>> > > >

>> > > > So i created a pool of SSD's for cache and a pool on HDD's for cold

>> > > > storage

>> > > > that acts as backend for cinder volumes. I create a volume in cinder

>> > > > from

>> > an

>> > > > image and spawn an instance. The volume is created in the cache pool

>> > > > as

>> > > > expected and it will be flushed to the cold storage after a period

>> > > > of

>> > > inactivity

>> > > > or after the cache pool reaches 40% full as i understand.

>> > >

>> > > Cache won't be flushed after inactivity the cache agent only works on

>> > > % full

>> > > (either # of objects or bytes)

>> > >

>> > > >

>> > > > Now after the volume is flushed to the HDD and i make a read or

>> > > > write

>> > > > request in the guest OS, how does ceph handle it. Does it upload the

>> > whole

>> > > > rbd volume from the cold storage to the cache pool or only a chunk

>> > > > of it

>> > > > where the request is made from the guest OS?

>> > >

>> > > The cache works on hot objects, so particular objects (normally 4MB)

>> > > of the

>> > > RBD will be promoted/demoted over time depending on access patterns.

>> > >

>> > > >

>> > > > Also, is the replication in ceph syncronious or async? If i set a

>> > > > crush rule to

>> > > use

>> > > > as primary host the SSD one and for replication the HDD one, would

>> > > > the

>> > > > writes and reads on the SSD;s be slowed down by the replication on

>> > > > the

>> > > > mechanical drive?

>> > > > Would this configuration be viable? (i ask this because i don't have

>> > > > the

>> > > > number of SSD to make a pool of size 3 on them)

>> > >

>> > > Its sync replication. If you have a very heavy read workload, you can

>> > > do

>> > what

>> > > you suggest and set the SSD OSD to be the primary copy for each PG,

>> > writes

>> > > will still be limited to the speed of the spinning disks, but reads

>> > > will be

>> > > serviced from the SSD's. However there is a risk in degraded scenarios

>> > > that

>> > > your performance could dramatically drop if more IO is diverted to

>> > > spinning

>> > > disks.

>> > >

>> > > >

>> > > > Thank you!

>> >

>>

>>

>

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com