I ran an experiment with 1GB memory per OSD using Bluestore. 12.2.2 made a big difference.
In addition, you should have a look at your max object size. It looks like you will see a jump in memory usage if a particular OSD happens to be the primary for a number of objects being written in parallel. In our case reducing the number of clients reduced memory requirements. Reducing max object size should also reduce memory requirements on the OSD daemon.
Subhachandra
On Sun, Dec 10, 2017 at 1:01 PM, <ceph-users-request@xxxxxxxxxxxxxx> wrote:
Send ceph-users mailing list submissions to
ceph-users@xxxxxxxxxxxxxx
To subscribe or unsubscribe via the World Wide Web, visit
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
or, via email, send a message with subject or body 'help' to
ceph-users-request@xxxxxxxxxx.com
You can reach the person managing the list at
ceph-users-owner@xxxxxxxxxx.com
When replying, please edit your Subject line so it is more specific
than "Re: Contents of ceph-users digest..."
Today's Topics:
1. Re: RBD+LVM -> iSCSI -> VMWare (Donny Davis)
2. Re: RBD+LVM -> iSCSI -> VMWare (Brady Deetz)
3. Re: RBD+LVM -> iSCSI -> VMWare (Donny Davis)
4. Re: RBD+LVM -> iSCSI -> VMWare (Brady Deetz)
5. The way to minimize osd memory usage? (shadow_lin)
6. Re: The way to minimize osd memory usage? (Konstantin Shalygin)
7. Re: The way to minimize osd memory usage? (shadow_lin)
8. Random checksum errors (bluestore on Luminous) (Martin Preuss)
9. Re: The way to minimize osd memory usage? (David Turner)
10. what's the maximum number of OSDs per OSD server? (Igor Mendelev)
11. Re: what's the maximum number of OSDs per OSD server? (Nick Fisk)
12. Re: what's the maximum number of OSDs per OSD server?
(Igor Mendelev)
13. Re: RBD+LVM -> iSCSI -> VMWare (He?in Ejdesgaard M?ller)
14. Re: Random checksum errors (bluestore on Luminous) (Martin Preuss)
15. Re: what's the maximum number of OSDs per OSD server? (Nick Fisk)
------------------------------------------------------------ ----------
Message: 1
Date: Sun, 10 Dec 2017 00:26:39 +0000
From: Donny Davis <donny@xxxxxxxxxxxxxx>
To: Brady Deetz <bdeetz@xxxxxxxxx>
Cc: Aaron Glenn <aglenn@xxxxxxxxxxxxxxxxxxxxx>, ceph-users
<ceph-users@xxxxxxxx>
Subject: Re: RBD+LVM -> iSCSI -> VMWare
Message-ID:
<CAMHmko_35Y0pRqFp89MLJCi+6Uv9BMtF=Z71pkq8YDhDR0E3Mw@ mail.gmail.com >
Content-Type: text/plain; charset="utf-8"
Just curious but why not just use a hypervisor with rbd support? Are there
VMware specific features you are reliant on?
On Fri, Dec 8, 2017 at 4:08 PM Brady Deetz <bdeetz@xxxxxxxxx> wrote:
> I'm testing using RBD as VMWare datastores. I'm currently testing with
> krbd+LVM on a tgt target hosted on a hypervisor.
>
> My Ceph cluster is HDD backed.
>
> In order to help with write latency, I added an SSD drive to my hypervisor
> and made it a writeback cache for the rbd via LVM. So far I've managed to
> smooth out my 4k write latency and have some pleasing results.
>
> Architecturally, my current plan is to deploy an iSCSI gateway on each
> hypervisor hosting that hypervisor's own datastore.
>
> Does anybody have any experience with this kind of configuration,
> especially with regard to LVM writeback caching combined with RBD?
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/ >attachments/20171210/4f055103/ attachment-0001.html
------------------------------
Message: 2
Date: Sat, 9 Dec 2017 18:56:53 -0600
From: Brady Deetz <bdeetz@xxxxxxxxx>
To: Donny Davis <donny@xxxxxxxxxxxxxx>
Cc: Aaron Glenn <aglenn@xxxxxxxxxxxxxxxxxxxxx>, ceph-users
<ceph-users@xxxxxxxx>
Subject: Re: RBD+LVM -> iSCSI -> VMWare
Message-ID:
<CADU_9qV6VVVbzxdbEBCofvON-Or9sajS-E0j_22Wf=RdRycBwQ@ mail.gmail.com >
Content-Type: text/plain; charset="utf-8"
We have over 150 VMs running in vmware. We also have 2PB of Ceph for
filesystem. With our vmware storage aging and not providing the IOPs we
need, we are considering and hoping to use ceph. Ultimately, yes we will
move to KVM, but in the short term, we probably need to stay on VMware.
On Dec 9, 2017 6:26 PM, "Donny Davis" <donny@xxxxxxxxxxxxxx> wrote:
> Just curious but why not just use a hypervisor with rbd support? Are there
> VMware specific features you are reliant on?
>
> On Fri, Dec 8, 2017 at 4:08 PM Brady Deetz <bdeetz@xxxxxxxxx> wrote:
>
>> I'm testing using RBD as VMWare datastores. I'm currently testing with
>> krbd+LVM on a tgt target hosted on a hypervisor.
>>
>> My Ceph cluster is HDD backed.
>>
>> In order to help with write latency, I added an SSD drive to my
>> hypervisor and made it a writeback cache for the rbd via LVM. So far I've
>> managed to smooth out my 4k write latency and have some pleasing results.
>>
>> Architecturally, my current plan is to deploy an iSCSI gateway on each
>> hypervisor hosting that hypervisor's own datastore.
>>
>> Does anybody have any experience with this kind of configuration,
>> especially with regard to LVM writeback caching combined with RBD?
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/ >attachments/20171209/8d02eb27/ attachment-0001.html
------------------------------
Message: 3
Date: Sun, 10 Dec 2017 01:09:39 +0000
From: Donny Davis <donny@xxxxxxxxxxxxxx>
To: Brady Deetz <bdeetz@xxxxxxxxx>
Cc: Aaron Glenn <aglenn@xxxxxxxxxxxxxxxxxxxxx>, ceph-users
<ceph-users@xxxxxxxx>
Subject: Re: RBD+LVM -> iSCSI -> VMWare
Message-ID:
<CAMHmko9bvQEcsPU3_crLeGkiiwtz5sY- >WgGHTe3T2UjBqg4xPA@mail.gmail. com
Content-Type: text/plain; charset="utf-8"
What I am getting at is that instead of sinking a bunch of time into this
bandaid, why not sink that time into a hypervisor migration. Seems well
timed if you ask me.
There are even tools to make that migration easier
http://libguestfs.org/virt-v2v.1.html
You should ultimately move your hypervisor instead of building a one off
case for ceph. Ceph works really well if you stay inside the box. So does
KVM. They work like Gang Buster's together.
I know that doesn't really answer your OP, but this is what I would do.
~D
On Sat, Dec 9, 2017 at 7:56 PM Brady Deetz <bdeetz@xxxxxxxxx> wrote:
> We have over 150 VMs running in vmware. We also have 2PB of Ceph for
> filesystem. With our vmware storage aging and not providing the IOPs we
> need, we are considering and hoping to use ceph. Ultimately, yes we will
> move to KVM, but in the short term, we probably need to stay on VMware.
> On Dec 9, 2017 6:26 PM, "Donny Davis" <donny@xxxxxxxxxxxxxx> wrote:
>
>> Just curious but why not just use a hypervisor with rbd support? Are
>> there VMware specific features you are reliant on?
>>
>> On Fri, Dec 8, 2017 at 4:08 PM Brady Deetz <bdeetz@xxxxxxxxx> wrote:
>>
>>> I'm testing using RBD as VMWare datastores. I'm currently testing with
>>> krbd+LVM on a tgt target hosted on a hypervisor.
>>>
>>> My Ceph cluster is HDD backed.
>>>
>>> In order to help with write latency, I added an SSD drive to my
>>> hypervisor and made it a writeback cache for the rbd via LVM. So far I've
>>> managed to smooth out my 4k write latency and have some pleasing results.
>>>
>>> Architecturally, my current plan is to deploy an iSCSI gateway on each
>>> hypervisor hosting that hypervisor's own datastore.
>>>
>>> Does anybody have any experience with this kind of configuration,
>>> especially with regard to LVM writeback caching combined with RBD?
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/ >attachments/20171210/afb26767/ attachment-0001.html
------------------------------
Message: 4
Date: Sat, 9 Dec 2017 19:17:01 -0600
From: Brady Deetz <bdeetz@xxxxxxxxx>
To: Donny Davis <donny@xxxxxxxxxxxxxx>
Cc: Aaron Glenn <aglenn@xxxxxxxxxxxxxxxxxxxxx>, ceph-users
<ceph-users@xxxxxxxx>
Subject: Re: RBD+LVM -> iSCSI -> VMWare
Message-ID:
<CADU_9qXgqBODJc4pFGUoZuCeQfLk6d3nbh >oKa4xxPKKuB6O2VA@mail.gmail. com
Content-Type: text/plain; charset="utf-8"
That's not a bad position. I have concerns with what I'm proposing, so a
hypervisor migration may actually bring less risk than a storage
abomination.
On Dec 9, 2017 7:09 PM, "Donny Davis" <donny@xxxxxxxxxxxxxx> wrote:
> What I am getting at is that instead of sinking a bunch of time into this
> bandaid, why not sink that time into a hypervisor migration. Seems well
> timed if you ask me.
>
> There are even tools to make that migration easier
>
> http://libguestfs.org/virt-v2v.1.html
>
> You should ultimately move your hypervisor instead of building a one off
> case for ceph. Ceph works really well if you stay inside the box. So does
> KVM. They work like Gang Buster's together.
>
> I know that doesn't really answer your OP, but this is what I would do.
>
> ~D
>
> On Sat, Dec 9, 2017 at 7:56 PM Brady Deetz <bdeetz@xxxxxxxxx> wrote:
>
>> We have over 150 VMs running in vmware. We also have 2PB of Ceph for
>> filesystem. With our vmware storage aging and not providing the IOPs we
>> need, we are considering and hoping to use ceph. Ultimately, yes we will
>> move to KVM, but in the short term, we probably need to stay on VMware.
>> On Dec 9, 2017 6:26 PM, "Donny Davis" <donny@xxxxxxxxxxxxxx> wrote:
>>
>>> Just curious but why not just use a hypervisor with rbd support? Are
>>> there VMware specific features you are reliant on?
>>>
>>> On Fri, Dec 8, 2017 at 4:08 PM Brady Deetz <bdeetz@xxxxxxxxx> wrote:
>>>
>>>> I'm testing using RBD as VMWare datastores. I'm currently testing with
>>>> krbd+LVM on a tgt target hosted on a hypervisor.
>>>>
>>>> My Ceph cluster is HDD backed.
>>>>
>>>> In order to help with write latency, I added an SSD drive to my
>>>> hypervisor and made it a writeback cache for the rbd via LVM. So far I've
>>>> managed to smooth out my 4k write latency and have some pleasing results.
>>>>
>>>> Architecturally, my current plan is to deploy an iSCSI gateway on each
>>>> hypervisor hosting that hypervisor's own datastore.
>>>>
>>>> Does anybody have any experience with this kind of configuration,
>>>> especially with regard to LVM writeback caching combined with RBD?
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/ >attachments/20171209/e19aa6ab/ attachment-0001.html
------------------------------
Message: 5
Date: Sun, 10 Dec 2017 11:35:33 +0800
From: "shadow_lin"<shadow_lin@163.com >
To: "ceph-users"<ceph-users@lists.ceph.com >
Subject: The way to minimize osd memory usage?
Message-ID: <229639cd.27d.1603e7dff17.Coremail.shadow_lin@xxxxxxx >
Content-Type: text/plain; charset="utf-8"
Hi All,
I am testing running ceph luminous(12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe 678db293bf) on ARM server.
The ARM server has a two cores@1.4GHz cpu and 2GB ram and I am running 2 osd per ARM server with 2x8TB(or 2x10TB) hdd.
Now I am facing constantly oom problem.I have tried upgrade ceph(to fix osd memroy leak problem) and lower the bluestore cache setting.The oom problems did get better but still occurs constantly.
I am hoping someone can gives me some advice of the follow questions.
Is it impossible to run ceph in this config of hardware or Is it possible I can do some tunning the solve this problem(even to lose some performance to avoid the oom problem)?
Is it a good idea to use raid0 to combine the 2 HDD into one so I can only run one osd to save some memory?
How is memory usage of osd related to the size of HDD?
PS:my ceph.conf bluestore cache setting
[osd]
bluestore_cache_size = 104857600
bluestore_cache_kv_max = 67108864
osd client message size cap = 67108864
2017-12-10
lin.yunfan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/ >attachments/20171210/f096c25b/ attachment-0001.html
------------------------------
Message: 6
Date: Sun, 10 Dec 2017 11:29:23 +0700
From: Konstantin Shalygin <k0ste@xxxxxxxx>
To: ceph-users@xxxxxxxxxxxxxx
Cc: shadow_lin <shadow_lin@xxxxxxx>
Subject: Re: The way to minimize osd memory usage?
Message-ID: <1836996d-95cb-4834-d202-c61502089123@xxxxxxxx >
Content-Type: text/plain; charset=utf-8; format=flowed
> I am testing running ceph luminous(12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe 678db293bf) on ARM server.
Try new 12.2.2 - this release should fix memory issues with Bluestore.
------------------------------
Message: 7
Date: Sun, 10 Dec 2017 12:33:36 +0800
From: "shadow_lin"<shadow_lin@163.com >
To: "Konstantin Shalygin"<k0ste@xxxxxxxx>,
"ceph-users"<ceph-users@lists.ceph.com >
Subject: Re: The way to minimize osd memory usage?
Message-ID: <51e6e209.4ac350.1603eb32924.Coremail.shadow_lin@xxxxxxx >
Content-Type: text/plain; charset="utf-8"
The 12.2.1(12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe 678db293bf) we are running is with the memory issues fix.And we are working on to upgrade to 12.2.2 release to see if there is any furthermore improvement.
2017-12-10
lin.yunfan
????Konstantin Shalygin <k0ste@xxxxxxxx>
?????2017-12-10 12:29
???Re: The way to minimize osd memory usage?
????"ceph-users"<ceph-users@lists.ceph.com >
???"shadow_lin"<shadow_lin@163.com >
> I am testing running ceph luminous(12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe 678db293bf) on ARM server.
Try new 12.2.2 - this release should fix memory issues with Bluestore.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/ >attachments/20171210/e5870ab8/ attachment-0001.html
------------------------------
Message: 8
Date: Sun, 10 Dec 2017 14:34:03 +0100
From: Martin Preuss <martin@xxxxxxxxxxxxx>
To: ceph-users@xxxxxxxxxxxxxx
Subject: Random checksum errors (bluestore on Luminous)
Message-ID: <4e50b57f-5881-e806-bb10-0d1e16e05365@xxxxxxxxxxxxx >
Content-Type: text/plain; charset="utf-8"
Hi,
I'm new to Ceph. I started a ceph cluster from scratch on DEbian 9,
consisting of 3 hosts, each host has 3-4 OSDs (using 4TB hdds, currently
totalling 10 hdds).
Right from the start I always received random scrub errors telling me
that some checksums didn't match the expected value, fixable with "ceph
pg repair".
I looked at the ceph-osd logfiles on each of the hosts and compared with
the corresponding syslogs. I never found any hardware error, so there
was no problem reading or writing a sector hardware-wise. Also there was
never any other suspicious syslog entry around the time of checksum
error reporting.
When I looked at the checksum error entries I found that the reported
bad checksum always was "0x6706be76".
Could someone please tell me where to look further for the source of the
problem?
I appended an excerpt of the osd logs.
Kind regards
Martin
--
"Things are only impossible until they're not"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ceph-osd.log
Type: text/x-log
Size: 4645 bytes
Desc: not available
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/ >attachments/20171210/460992fe/ attachment-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/ >attachments/20171210/460992fe/ attachment-0001.sig
------------------------------
Message: 9
Date: Sun, 10 Dec 2017 15:05:16 +0000
From: David Turner <drakonstein@xxxxxxxxx>
To: shadow_lin <shadow_lin@xxxxxxx>
Cc: Konstantin Shalygin <k0ste@xxxxxxxx>, ceph-users
<ceph-users@xxxxxxxxxxxxxx>
Subject: Re: The way to minimize osd memory usage?
Message-ID:
<CAN-GepK8nyqRzKTTo4AVmnTqLYuXLCcWd >L_XC1LaGBPgQozQ_g@mail.gmail. com
Content-Type: text/plain; charset="utf-8"
The docs recommend 1GB/TB of OSDs. I saw people asking if this was still
accurate for bluestore and the answer was that it is more true for
bluestore than filestore. There might be a way to get this working at the
cost of performance. I would look at Linux kernel memory settings as much
as ceph and bluestore settings. Cache pressure is one that comes to mind
that an aggressive setting might help.
On Sat, Dec 9, 2017, 11:33 PM shadow_lin <shadow_lin@xxxxxxx> wrote:
> The 12.2.1(12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe 678db293bf)
> we are running is with the memory issues fix.And we are working on to
> upgrade to 12.2.2 release to see if there is any furthermore improvement.
>
> 2017-12-10
> ------------------------------
> lin.yunfan
> ------------------------------
>
> *????*Konstantin Shalygin <k0ste@xxxxxxxx>
> *?????*2017-12-10 12:29
> *???*Re: The way to minimize osd memory usage?
> *????*"ceph-users"<ceph-users@lists.ceph.com >
> *???*"shadow_lin"<shadow_lin@163.com >
>
>
>
> > I am testing running ceph luminous(12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe 678db293bf) on ARM server.
> Try new 12.2.2 - this release should fix memory issues with Bluestore.
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/ >attachments/20171210/534133c9/ attachment-0001.html
------------------------------
Message: 10
Date: Sun, 10 Dec 2017 10:38:53 -0500
From: Igor Mendelev <igmend@xxxxxxxxx>
To: ceph-users@xxxxxxxxxxxxxx
Subject: what's the maximum number of OSDs per OSD
server?
Message-ID:
<CAKtyfj_0NKQmPNO2C6CuU47xZhM_Xagm2WF4yLUdUhfSw2G7Qg@mail. >gmail.com
Content-Type: text/plain; charset="utf-8"
Given that servers with 64 CPU cores (128 threads @ 2.7GHz) and up to 2TB
RAM - as well as 12TB HDDs - are easily available and somewhat reasonably
priced I wonder what's the maximum number of OSDs per OSD server (if using
10TB or 12TB HDDs) and how much RAM does it really require if total storage
capacity for such OSD server is on the order of 1,000+ TB - is it still 1GB
RAM per TB of HDD or it could be less (during normal operations - and
extended with NVMe SSDs swap space for extra space during recovery)?
Are there any known scalability limits in Ceph Luminous (12.2.2 with
BlueStore) and/or Linux that'll make such high capacity OSD server not
scale well (using sequential IO speed per HDD as a metric)?
Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/ >attachments/20171210/01aa76db/ attachment-0001.html
------------------------------
Message: 11
Date: Sun, 10 Dec 2017 16:17:40 -0000
From: Nick Fisk <nick@xxxxxxxxxx>
To: 'Igor Mendelev' <igmend@xxxxxxxxx>, ceph-users@xxxxxxxxxxxxxx
Subject: Re: what's the maximum number of OSDs per OSD
server?
Message-ID: <001d01d371d2$66f06de0$34d149a0$@fisk.me.uk>
Content-Type: text/plain; charset="utf-8"
From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com ] On Behalf Of Igor Mendelev
Sent: 10 December 2017 15:39
To: ceph-users@xxxxxxxxxxxxxx
Subject: what's the maximum number of OSDs per OSD server?
Given that servers with 64 CPU cores (128 threads @ 2.7GHz) and up to 2TB RAM - as well as 12TB HDDs - are easily available and somewhat reasonably priced I wonder what's the maximum number of OSDs per OSD server (if using 10TB or 12TB HDDs) and how much RAM does it really require if total storage capacity for such OSD server is on the order of 1,000+ TB - is it still 1GB RAM per TB of HDD or it could be less (during normal operations - and extended with NVMe SSDs swap space for extra space during recovery)?
Are there any known scalability limits in Ceph Luminous (12.2.2 with BlueStore) and/or Linux that'll make such high capacity OSD server not scale well (using sequential IO speed per HDD as a metric)?
Thanks.
How many total OSD?s will you have? If you are planning on having thousands then dense nodes might make sense. Otherwise you are leaving yourself open to having a few number of very large nodes, which will likely shoot you in the foot further down the line. Also don?t forget, unless this is purely for archiving, you will likely need to scale the networking up per node, 2x10G won?t cut it when you have 10-20+ disks per node.
With Bluestore, you are probably looking at around 2-3GB of RAM per OSD, so say 4GB to be on the safe side.
7.2k HDD?s will likely only use a small proportion of a CPU core due to their limited IO potential. A would imagine that even with 90 bay JBOD?s, you will run into physical limitations before you hit CPU ones.
Without knowing your exact requirements, I would suggest that larger number of smaller nodes, might be a better idea. If you choose your hardware right, you can often get the cost down to comparable levels by not going with top of the range kit. Ie Xeon E3?s or D?s vs dual socket E5?s.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/ >attachments/20171210/3f1a50cf/ attachment-0001.html
------------------------------
Message: 12
Date: Sun, 10 Dec 2017 12:37:05 -0500
From: Igor Mendelev <igmend@xxxxxxxxx>
To: nick@xxxxxxxxxx, ceph-users@xxxxxxxxxxxxxx
Subject: Re: what's the maximum number of OSDs per OSD
server?
Message-ID:
<CAKtyfj-zCAPpPANb-5S6gXet+XYX33HhOC_65FP6HrTWBKFfDw@ >mail.gmail.com
Content-Type: text/plain; charset="utf-8"
Expected number of nodes for initial setup is 10-15 and of OSDs -
1,500-2,000.
Networking is planned to be 2 100GbE or 2 dual 50GbE in x16 slots (per OSD
node).
JBODs are to be connected with 3-4 x8 SAS3 HBAs (4 4x SAS3 ports each)
Choice of hardware is done considering (non-trivial) per-server sw
licensing costs -
so small (12-24 HDD) nodes are certainly not optimal regardless of CPUs
cost (which
is estimated to be below 10% of the total cost in the setup I'm currently
considering).
EC (4+2 or 8+3 etc - TBD) - not 3x replication - is planned to be used for
most of the storage space.
Main applications are expected to be archiving and sequential access to
large (multiGB) files/objects.
Nick, which physical limitations you're referring to ?
Thanks.
On Sun, Dec 10, 2017 at 11:17 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> *From:* ceph-users [mailto:ceph-users-bounces@lists.ceph.com ] *On Behalf
> Of *Igor Mendelev
> *Sent:* 10 December 2017 15:39
> *To:* ceph-users@xxxxxxxxxxxxxx
> *Subject:* what's the maximum number of OSDs per OSD server?
>
>
>
> Given that servers with 64 CPU cores (128 threads @ 2.7GHz) and up to 2TB
> RAM - as well as 12TB HDDs - are easily available and somewhat reasonably
> priced I wonder what's the maximum number of OSDs per OSD server (if using
> 10TB or 12TB HDDs) and how much RAM does it really require if total storage
> capacity for such OSD server is on the order of 1,000+ TB - is it still 1GB
> RAM per TB of HDD or it could be less (during normal operations - and
> extended with NVMe SSDs swap space for extra space during recovery)?
>
>
>
> Are there any known scalability limits in Ceph Luminous (12.2.2 with
> BlueStore) and/or Linux that'll make such high capacity OSD server not
> scale well (using sequential IO speed per HDD as a metric)?
>
>
>
> Thanks.
>
>
>
> How many total OSD?s will you have? If you are planning on having
> thousands then dense nodes might make sense. Otherwise you are leaving
> yourself open to having a few number of very large nodes, which will likely
> shoot you in the foot further down the line. Also don?t forget, unless this
> is purely for archiving, you will likely need to scale the networking up
> per node, 2x10G won?t cut it when you have 10-20+ disks per node.
>
>
>
> With Bluestore, you are probably looking at around 2-3GB of RAM per OSD,
> so say 4GB to be on the safe side.
>
> 7.2k HDD?s will likely only use a small proportion of a CPU core due to
> their limited IO potential. A would imagine that even with 90 bay JBOD?s,
> you will run into physical limitations before you hit CPU ones.
>
>
>
> Without knowing your exact requirements, I would suggest that larger
> number of smaller nodes, might be a better idea. If you choose your
> hardware right, you can often get the cost down to comparable levels by not
> going with top of the range kit. Ie Xeon E3?s or D?s vs dual socket E5?s.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/ >attachments/20171210/9c3b98f0/ attachment-0001.html
------------------------------
Message: 13
Date: Sun, 10 Dec 2017 17:38:30 +0000
From: He?in Ejdesgaard M?ller <hej@xxxxxxxxx>
To: Brady Deetz <bdeetz@xxxxxxxxx>, Donny Davis <donny@xxxxxxxxxxxxxx>
Cc: Aaron Glenn <aglenn@xxxxxxxxxxxxxxxxxxxxx>, ceph-users
<ceph-users@xxxxxxxx>
Subject: Re: RBD+LVM -> iSCSI -> VMWare
Message-ID: <1512927510.642.70.camel@synack.fo >
Content-Type: text/plain; charset="UTF-8"
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Another option is to utilize the iscsi gateway, provided in 12.2 http://docs.ceph.com/docs/master/rbd/iscsi-overview/
Benefits:
You can EOL your old SAN wtihout having to simultaneously migrate to another hypervisor.
Any infrastructure that ties in to vSphere, is unaffected. (CEPH is just another set of datastores.)
If you have the appropriate vmware licenses etc. then your move to CEPH can be done without any downtime.
Drawback from my tests, using ceph-12.2-latest and ESXi-6.5, is that you get around 30% performance penalty, and the
latency is higher, compared to a direct rbd mount.
On ley, 2017-12-09 at 19:17 -0600, Brady Deetz wrote:
> That's not a bad position. I have concerns with what I'm proposing, so a hypervisor migration may actually bring less
> risk than a storage abomination.?
>
> On Dec 9, 2017 7:09 PM, "Donny Davis" <donny@xxxxxxxxxxxxxx> wrote:
> > What I am getting at is that instead of sinking a bunch of time into this bandaid, why not sink that time into a
> > hypervisor migration. Seems well timed if you ask me.
> >
> > There are even tools to make that migration easier
> >
> > http://libguestfs.org/virt-v2v.1.html
> >
> > You should ultimately move your hypervisor instead of building a one off case for ceph. Ceph works really well if
> > you stay inside the box. So does KVM. They work like Gang Buster's together.
> >
> > I know that doesn't really answer your OP, but this is what I would do.
> >
> > ~D
> >
> > On Sat, Dec 9, 2017 at 7:56 PM Brady Deetz <bdeetz@xxxxxxxxx> wrote:
> > > We have over 150 VMs running in vmware. We also have 2PB of Ceph for filesystem. With our vmware storage aging and
> > > not providing the IOPs we need, we are considering and hoping to use ceph. Ultimately, yes we will move to KVM,
> > > but in the short term, we probably need to stay on VMware.?
> > > On Dec 9, 2017 6:26 PM, "Donny Davis" <donny@xxxxxxxxxxxxxx> wrote:
> > > > Just curious but why not just use a hypervisor with rbd support? Are there VMware specific features you are
> > > > reliant on??
> > > >
> > > > On Fri, Dec 8, 2017 at 4:08 PM Brady Deetz <bdeetz@xxxxxxxxx> wrote:
> > > > > I'm testing using RBD?as VMWare datastores. I'm currently testing with krbd+LVM on a tgt target hosted on a
> > > > > hypervisor.
> > > > >
> > > > > My Ceph cluster is HDD backed.
> > > > >
> > > > > In order to help with write latency, I added an SSD drive to my hypervisor and made it a writeback cache for
> > > > > the rbd via LVM. So far I've managed to smooth out my 4k write latency and have some pleasing results.
> > > > >
> > > > > Architecturally, my current plan is to deploy an iSCSI gateway on each hypervisor hosting that hypervisor's
> > > > > own datastore.
> > > > >
> > > > > Does anybody have any experience with this kind of configuration, especially with regard to LVM writeback
> > > > > caching combined with RBD?
> > > > > _______________________________________________
> > > > > ceph-users mailing list
> > > > > ceph-users@xxxxxxxxxxxxxx
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
> > > > >
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
-----BEGIN PGP SIGNATURE-----
iQIzBAEBCAAdFiEElZWfRQVsNukQFi9Ko80MCbT/ An0FAlotcRYACgkQo80MCbT/
An36fQ//ULP6gwd4qUbXG3yKBHqMtcsTV76+ CfP8e3jcuEqyEzlCugoR10DXPELj
TLCnrBp4fDP5gTd1zIHcU+PMPcVJ91dBYUWoMZrSLAraM0+ 7kvNQ9Nsacsl6CsiZ
yq+506uOhwcLub55oLSpKgnaW1rEG6TAG / 6TNIBGakb2a79iC1xev16S3lJ8V7zI
cb3psUCePv/T753q/0E9B5SH9L5BiygsMT4DjiE09xGcFzH 3lqkMWm2HMCFXNogI
WbwqQVTTgk5Ch3oilz6cpOIqLK2VMkK0PPFXSGi1SAEjkw2c/ XIBykB9MclVQn+8
q5kO5g+uFcflEVnFhKTZknXVoOjrybhs4lMYm K4LJJ340Ay1uLyAlFdZdh+xAN3B
43QBKfcd1dL+EgKkMVuzGOaYOAqrFbh2/ DN5rAz3l1YUy5h3OtjrXlNU/ F7AkZfc
+UECf9wa6M7uS6DqaPMVxtLhROyMnHw + Z6jrKz7V8EamUduxQyNwOxBNIJYDmK VC
SHSkQi+oykPHWcOIXr1BNR2raaH1YVqXG+ 6mK8b6YV6sGtVeXA+KCa8RgrtabU3F
tgDW8cPkeTcPYi5BOVZeQ2OSD90A6eiC4fJbMcWVbUQim+0gSY2paoC8Rk/ HQkMF
ug8xc9Os7SXe/wEOGQAzRHjDi16eKC9JghrS7dH4JLP g4gvBn4E=
=auLW
-----END PGP SIGNATURE-----
------------------------------
Message: 14
Date: Sun, 10 Dec 2017 19:45:31 +0100
From: Martin Preuss <martin@xxxxxxxxxxxxx>
To: ceph-users@xxxxxxxxxxxxxx
Subject: Re: Random checksum errors (bluestore on
Luminous)
Message-ID: <f93ce725-a404-152e-700d-b847823b4be7@xxxxxxxxxxxxx >
Content-Type: text/plain; charset="utf-8"
Hi (again),
meanwhile I tried
"ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-0"
but that resulted in a segfault (please see attached console log).
Regards
Martin
Am 10.12.2017 um 14:34 schrieb Martin Preuss:
> Hi,
>
> I'm new to Ceph. I started a ceph cluster from scratch on DEbian 9,
> consisting of 3 hosts, each host has 3-4 OSDs (using 4TB hdds, currently
> totalling 10 hdds).
>
> Right from the start I always received random scrub errors telling me
> that some checksums didn't match the expected value, fixable with "ceph
> pg repair".
>
> I looked at the ceph-osd logfiles on each of the hosts and compared with
> the corresponding syslogs. I never found any hardware error, so there
> was no problem reading or writing a sector hardware-wise. Also there was
> never any other suspicious syslog entry around the time of checksum
> error reporting.
>
> When I looked at the checksum error entries I found that the reported
> bad checksum always was "0x6706be76".
>
> Could someone please tell me where to look further for the source of the
> problem?
>
> I appended an excerpt of the osd logs.
>
>
> Kind regards
> Martin
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
>
--
"Things are only impossible until they're not"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fsck.log
Type: text/x-log
Size: 4314 bytes
Desc: not available
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/ >attachments/20171210/1a19349d/ attachment-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/ >attachments/20171210/1a19349d/ attachment-0001.sig
------------------------------
Message: 15
Date: Sun, 10 Dec 2017 20:32:45 -0000
From: Nick Fisk <nick@xxxxxxxxxx>
To: 'Igor Mendelev' <igmend@xxxxxxxxx>, ceph-users@xxxxxxxxxxxxxx
Subject: Re: what's the maximum number of OSDs per OSD
server?
Message-ID: <002201d371f6$09a38040$1cea80c0$@fisk.me.uk>
Content-Type: text/plain; charset="utf-8"
From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com ] On Behalf Of Igor Mendelev
Sent: 10 December 2017 17:37
To: nick@xxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx
Subject: Re: what's the maximum number of OSDs per OSD server?
Expected number of nodes for initial setup is 10-15 and of OSDs - 1,500-2,000.
Networking is planned to be 2 100GbE or 2 dual 50GbE in x16 slots (per OSD node).
JBODs are to be connected with 3-4 x8 SAS3 HBAs (4 4x SAS3 ports each)
Choice of hardware is done considering (non-trivial) per-server sw licensing costs -
so small (12-24 HDD) nodes are certainly not optimal regardless of CPUs cost (which
is estimated to be below 10% of the total cost in the setup I'm currently considering).
EC (4+2 or 8+3 etc - TBD) - not 3x replication - is planned to be used for most of the storage space.
Main applications are expected to be archiving and sequential access to large (multiGB) files/objects.
Nick, which physical limitations you're referring to ?
Thanks.
Hi Igor,
I guess I meant physical annoyances rather than limitations. Being able to pull out a 1 or 2U node is always much less of a chore vs dealing with several U of SAS interconnected JBOD?s.
If you have some license reason for larger nodes, then there is a very valid argument for larger nodes. Is this license cost related in some way to Ceph (I thought Redhat was capacity based) or is this some sort of collocated software? Just make sure you size the nodes to a point that if one has to be taken offline for any reason, that you are happy with the resulting state of the cluster, including the peering when suddenly taking ~200 OSD?s offline/online.
Nick
On Sun, Dec 10, 2017 at 11:17 AM, Nick Fisk <nick@xxxxxxxxxx <mailto:nick@xxxxxxxxxx> > wrote:
From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com <mailto:ceph-users-bounces@lists.ceph.com > ] On Behalf Of Igor Mendelev
Sent: 10 December 2017 15:39
To: ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxx.com >
Subject: what's the maximum number of OSDs per OSD server?
Given that servers with 64 CPU cores (128 threads @ 2.7GHz) and up to 2TB RAM - as well as 12TB HDDs - are easily available and somewhat reasonably priced I wonder what's the maximum number of OSDs per OSD server (if using 10TB or 12TB HDDs) and how much RAM does it really require if total storage capacity for such OSD server is on the order of 1,000+ TB - is it still 1GB RAM per TB of HDD or it could be less (during normal operations - and extended with NVMe SSDs swap space for extra space during recovery)?
Are there any known scalability limits in Ceph Luminous (12.2.2 with BlueStore) and/or Linux that'll make such high capacity OSD server not scale well (using sequential IO speed per HDD as a metric)?
Thanks.
How many total OSD?s will you have? If you are planning on having thousands then dense nodes might make sense. Otherwise you are leaving yourself open to having a few number of very large nodes, which will likely shoot you in the foot further down the line. Also don?t forget, unless this is purely for archiving, you will likely need to scale the networking up per node, 2x10G won?t cut it when you have 10-20+ disks per node.
With Bluestore, you are probably looking at around 2-3GB of RAM per OSD, so say 4GB to be on the safe side.
7.2k HDD?s will likely only use a small proportion of a CPU core due to their limited IO potential. A would imagine that even with 90 bay JBOD?s, you will run into physical limitations before you hit CPU ones.
Without knowing your exact requirements, I would suggest that larger number of smaller nodes, might be a better idea. If you choose your hardware right, you can often get the cost down to comparable levels by not going with top of the range kit. Ie Xeon E3?s or D?s vs dual socket E5?s.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/ >attachments/20171210/1e954b89/ attachment-0001.html
------------------------------
Subject: Digest Footer
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
------------------------------
End of ceph-users Digest, Vol 59, Issue 9
*****************************************
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com