Re: journal or cache tier on SSDs ?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Re,

>>>> I'd like some advices about the setup of a new ceph cluster. Here the
>>>> use case :
>>>>
>>>> RadowGW (S3 and maybe swift for hadoop/spark) will be the main usage.
>>>> Most of the access will be in read only mode. Write access will only
>>>> be done by the admin to update the datasets.
>>>>
>>>> We might use rbd some time to sync data as temp storage (when POSIX is
>>>> needed) but performance will not be an issue here. We might use cephfs
>>>> in the futur if that can replace a filesystem on rdb.
>>>>
>>>> We gonna start with 16 nodes (up to 24). The configuration of each
>>>> node is :
>>>>
>>>> CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (12c/48t)
>>>> Memory : 128GB
>>>> OS Storage : 2 x SSD 240GB Intel S3500 DC (raid 1)
>>>
>>> Dedicated OS SSDs aren't really needed, I tend to share OS and
>>> cache/journal SSDs.
>>> That's of course with more durable (S3610) models.
>>
>> I already have those 24 servers running 2 ceph cluster for test right
>> now, so I cannot change anything. we were thinking about share journal
>> but as I mention it below, MON will be on storage server, so that might
>> use too much I/O to share levedb and journal on the same SSD.
>>
> Not really, the journal is sequential writes, the leveldb small, fast
> IOPS. Both of them on the same (decent) SSD should be fine.
> 
> But as your HW is fixed, lets not speculate about that.

Ok.

>>> Since you didn't mention dedicated MON nodes, make sure that if you
>>> plan to put MONs on storage servers to have fast SSDs in them for the
>>> leveldb (again DC S36xx or 37xx).
>>
>> Yes MON nodes will be shared on storage server. MONs use the SSD 240GB
>> for the leveldb right now.
>>
> Note that the lowest IP(s) become the MON leader, so if you put RADOSGW
> and other things on the storage nodes as well, spread things out
> accordingly.

Yes for sur, we gonna spread services over nodes. The 3 RadosGW won't be on the
MONs nodes.

>>> This will also free up 2 more slots in your (likely Supermicro) chassis
>>> for OSD HDDs.
>>
>> It's not supermicro enclosure, it's Intel one with 12 slot 3.5" front
>> and 2 slots 2.5" back, so I cannot add more disk. the 240GB SSDs are in
>> front.
>
> That sounds like a SM chassis. ^o^
> In fact, I can't find a chassis on Intel's page with 2 back 2.5 slots.

http://www.colfax-intl.com/nd/images/systems/servers/R2208WT-rear.gif

>>>> Journal or cache Storage : 2 x SSD 400GB Intel S3300 DC (no Raid)
>>>
>>> These SSDs do not exist according to the Intel site and the only
>>> references I can find for them are on "no longer available" European
>>> sites.
>>
>> I made a mistake, it's not 400 but 480GB, smartctl give me Model
>> SSDSC2BB480H4
>>
> OK, that's not good.
> Firstly, that model number still doesn't get us any hits from Intel,
> strangely enough.
> 
> Secondly, it is 480GB (instead of 400, which denotes overprovisioning) and
> matches the 3510 480GB model up to the last 2 characters.
> And that has an endurance of 275TBW, not something you want to use for
> either journals or cache pools.

I see, here the information from the resseler :

"The S3300 series is the OEM version of S3510 and 1:1 the same drive"

>>> Since you're in the land of rich chocolate bankers, I assume that this
>>> model is something that just happened in Europe.
>>
>> I'm just a poor sysadmin with expensive toy in a University ;)
>>
> I know, I recognized the domain. ^.^

:)

>>> Without knowing the specifications for these SSDs, I can't recommend
>>> them. I'd use DC S3610 or 3710 instead, this very much depends on how
>>> much endurance (TPW) you need.
>>
>> As I write above, I already have those SSDs so I look for the best setup
>> with the hardware I have.
>>
> 
> Unless they have at least an endurance of 3 DWPD like the 361x (and their
> model number, size and the 3300 naming suggests they do NOT), your 480GB
> SSDs aren't suited for intense Ceph usage.
> 
> How much have you used them yet and what is their smartctl status, in
> particular these values (from a 800GB DC S3610 in my cache pool):
> ---
> 232 Available_Reservd_Space 0x0033   100   100   010    Pre-fail  Always       -       0
> 233 Media_Wearout_Indicator 0x0032   100   100   000    Old_age   Always       -       0
> 241 Host_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       869293
> 242 Host_Reads_32MiB        0x0032   100   100   000    Old_age   Always       -       43435
> 243 NAND_Writes_32MiB       0x0032   100   100   000    Old_age   Always       -       1300884
> ---
> 
> Not even 1% down after 40TBW, at which point your SSDs are likely to be
> 15% down...

More or less the same value on the 10 hosts I have on my beta cluster :

232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age  Always - 0
241 Total_LBAs_Written      0x0032 100 100 000 Old_age  Always - 233252
242 Total_LBAs_Read         0x0032 100 100 000 Old_age  Always - 13

>>>> OSD Disk : 10 x HGST ultrastar-7k6000 6TB
>>>> Public Network : 1 x 10Gb/s
>>>> Private Network : 1 x 10Gb/s
>>>> OS : Ubuntu 16.04
>>>> Ceph version : Jewel
>>>>
>>>> The question is : journal or cache tier (read only) on the SD 400GB
>>>> Intel S3300 DC ?
>>>>
>>> You said read-only, or read-mostly up there. 
>>
>> I mean, I think about using cache tier for read operation. No write
>> operation gonna use the cache tier. I don't know yet wich mode I gonna
>> use, I have to do some tests.
>>
> As I said, your HDDs are unlikely to be slower (for sufficient parallel
> access, not short, sequential reads) than those SSDs.

Ok

>>> So why journals (only helpful for writes) or cache tiers (your 2 SSDs
>>> may not be faster than your 10 HDDs for reads) at all?
>>
>> We plan to have eavy read access some time so we think about to have
>> cache tier on SSD to speed up the throughput and decrease the I/O
>> pressure on disk. I might be wrong on that.
>>
> Unless it is repetitive reads that fit all into the cache, probably not.
> Reads that need to be promoted to the cache are actually slower than
> direct ones.

make sense

>>> Mind, if you have the money, go for it!
>>
>> I don't have the money, I have the hardware :)
>>
>>>> Each disk is able to write sequentially at 220MB/s. SSDs can write at
>>>> ~500MB/s. if we set 5 journals on each SSDs, SSD will still be the
>>>> bottleneck (1GB/s vs 2GB/s). 
>>>
>>> Your filestore based OSDs will never write Ceph data at 220MB/s, 100
>>> would be pushing it. 
>>> So no, your journal SSDs won't be the limiting factor, though 5
>>> journals on one SSD is pushing my comfort zone when it comes to SPoFs. 
>>>
>>>> If we set the journal on OSDs, we can
>>>> expect a good throughput in read on the disk (in case of data not in
>>>> the cache) and write shouldn't be so bad too, even if we have random
>>>> read on the OSD during the write ?
>>>>
>>>> SSDs as cache tier seem to be a better usage than only 5 journal on
>>>> each ? Is that correct ?
>>>>
>>> Potentially, depends on your actual usage.
>>>
>>> Again, since you said read-mostly, the question with a cache-tier
>>> becomes, how much of your truly hot data can fit into it?
>>
>> That the biggest point, many datasets will fit into the cache, but some
>> of them will definitely be too big (+100TB) but in that case, Our user
>> know what going one.
>>
> 
> With "correct" configuration of Jewel, you may be able to keep those huge
> datasets out of the cache altogether. 

That would be great !

>>> Remember that super-hot objects are likely to come from the pagecache
>>> of the storage node in question anyway.
>>
>> Yes I know that.
>>
>>> If you do care about fast writes after all, consider de-coupling writes
>>> and reads as much as possible.
>>
>> Write operation will only be done by the admins for datasets update.
>> those updates will be plan according the usage of the cluster and
>> scheduled during low usage period.
> 
> Good, so the scheme below might work for you, at least the flushing of
> dirty data part.

Ok, That's exactly what I thought.

>>> As in, set your cache to "readforward" (undocumented, google for it),
>>> so all un-cached reads will go to the HDDs (they CAN read at near full
>>> speed), while all writes will go the cache pool (and eventually to the
>>> HDDs, you can time that with lowering the dirty ratio during off-peak
>>> hours).
>>
>> I gonna give a look on that, thanks for the tips.
>>
>>>> We gonna use an EC pool for big files (jerasure 8+2 I think) and a
>>>> replicated pool for small files.
>>>>
>>>> If I check on http://ceph.com/pgcalc/, in this use case
>>>>
>>>> replicated pool: pg_num = 8192 for 160 OSDs but 16384 for 240 OSDs
>>>> Ec pool : pg_num = 4096
>>>> and pgp_num = pg_num
>>>>
>>>> Should I set the pg_num to 8192 or 16384 ? what is the impact on the
>>>> cluster if we set the pg_num to 16384 at the beginning ? 16384 is
>>>> high, isn't it ?
>>>>
>>> If 24 nodes is the absolute limit of your cluster, you want to set the
>>> target pg num to 100 in the calculator, which gives you 8192 again.
>>>
>>> Keep in mind that splitting PGs is an expensive operation, so if 24
>>> isn't a hard upper limit, you might be better off starting big.
>>
>> Yes, I did some test on that, it's definitely an expensive operation :)
>>
>> Thanks for that real useful answer
> 
> No worries, hopefully you can get some other SSDs for journals/cache pools.

I can't expect new SSDs right now, so we gonna have to discuss with my
colleagues what we could do

Thanks.

-- 
Yoann Moulin
EPFL IC-IT
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux