Re: The way to minimize osd memory usage?

Subhachandra Chandra <schandra@xxxxxxxxxxxx> · Mon, 11 Dec 2017 10:08:08 -0800

I ran an experiment with 1GB memory per OSD using Bluestore. 12.2.2 made a big difference. 

In addition, you should have a look at your max object size. It looks like you will see a jump in memory usage if a particular OSD happens to be the primary for a number of objects being written in parallel. In our case reducing the number of clients reduced memory requirements. Reducing max object size should also reduce memory requirements on the OSD daemon.

Subhachandra

On Sun, Dec 10, 2017 at 1:01 PM,  <ceph-users-request@xxxxxxxxxxxxxx> wrote:
Send ceph-users mailing list submissions to

        ceph-users@xxxxxxxxxxxxxx

To subscribe or unsubscribe via the World Wide Web, visit

        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

or, via email, send a message with subject or body 'help' to

        ceph-users-request@xxxxxxxxxx.com

You can reach the person managing the list at

        ceph-users-owner@xxxxxxxxxx.com

When replying, please edit your Subject line so it is more specific

than "Re: Contents of ceph-users digest..."

Today's Topics:

   1. Re: RBD+LVM -> iSCSI -> VMWare (Donny Davis)

   2. Re: RBD+LVM -> iSCSI -> VMWare (Brady Deetz)

   3. Re: RBD+LVM -> iSCSI -> VMWare (Donny Davis)

   4. Re: RBD+LVM -> iSCSI -> VMWare (Brady Deetz)

   5. The way to minimize osd memory usage? (shadow_lin)

   6. Re: The way to minimize osd memory usage? (Konstantin Shalygin)

   7. Re: The way to minimize osd memory usage? (shadow_lin)

   8. Random checksum errors (bluestore on Luminous) (Martin Preuss)

   9. Re: The way to minimize osd memory usage? (David Turner)

  10. what's the maximum number of OSDs per OSD server? (Igor Mendelev)

  11. Re: what's the maximum number of OSDs per OSD server? (Nick Fisk)

  12. Re: what's the maximum number of OSDs per OSD server?

      (Igor Mendelev)

  13. Re: RBD+LVM -> iSCSI -> VMWare (He?in Ejdesgaard M?ller)

  14. Re: Random checksum errors (bluestore on Luminous) (Martin Preuss)

  15. Re: what's the maximum number of OSDs per OSD server? (Nick Fisk)

----------------------------------------------------------------------

Message: 1

Date: Sun, 10 Dec 2017 00:26:39 +0000

From: Donny Davis <donny@xxxxxxxxxxxxxx>

To: Brady Deetz <bdeetz@xxxxxxxxx>

Cc: Aaron Glenn <aglenn@xxxxxxxxxxxxxxxxxxxxx>, ceph-users

        <ceph-users@xxxxxxxx>

Subject: Re:  RBD+LVM -> iSCSI -> VMWare

Message-ID:

        <CAMHmko_35Y0pRqFp89MLJCi+6Uv9BMtF=Z71pkq8YDhDR0E3Mw@mail.gmail.com>

Content-Type: text/plain; charset="utf-8"

Just curious but why not just use a hypervisor with rbd support? Are there

VMware specific features you are reliant on?

On Fri, Dec 8, 2017 at 4:08 PM Brady Deetz <bdeetz@xxxxxxxxx> wrote:

> I'm testing using RBD as VMWare datastores. I'm currently testing with

> krbd+LVM on a tgt target hosted on a hypervisor.

>

> My Ceph cluster is HDD backed.

>

> In order to help with write latency, I added an SSD drive to my hypervisor

> and made it a writeback cache for the rbd via LVM. So far I've managed to

> smooth out my 4k write latency and have some pleasing results.

>

> Architecturally, my current plan is to deploy an iSCSI gateway on each

> hypervisor hosting that hypervisor's own datastore.

>

> Does anybody have any experience with this kind of configuration,

> especially with regard to LVM writeback caching combined with RBD?

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

-------------- next part --------------

An HTML attachment was scrubbed...

URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171210/4f055103/attachment-0001.html>

------------------------------

Message: 2

Date: Sat, 9 Dec 2017 18:56:53 -0600

From: Brady Deetz <bdeetz@xxxxxxxxx>

To: Donny Davis <donny@xxxxxxxxxxxxxx>

Cc: Aaron Glenn <aglenn@xxxxxxxxxxxxxxxxxxxxx>, ceph-users

        <ceph-users@xxxxxxxx>

Subject: Re:  RBD+LVM -> iSCSI -> VMWare

Message-ID:

        <CADU_9qV6VVVbzxdbEBCofvON-Or9sajS-E0j_22Wf=RdRycBwQ@mail.gmail.com>

Content-Type: text/plain; charset="utf-8"

We have over 150 VMs running in vmware. We also have 2PB of Ceph for

filesystem. With our vmware storage aging and not providing the IOPs we

need, we are considering and hoping to use ceph. Ultimately, yes we will

move to KVM, but in the short term, we probably need to stay on VMware.

On Dec 9, 2017 6:26 PM, "Donny Davis" <donny@xxxxxxxxxxxxxx> wrote:

> Just curious but why not just use a hypervisor with rbd support? Are there

> VMware specific features you are reliant on?

>

> On Fri, Dec 8, 2017 at 4:08 PM Brady Deetz <bdeetz@xxxxxxxxx> wrote:

>

>> I'm testing using RBD as VMWare datastores. I'm currently testing with

>> krbd+LVM on a tgt target hosted on a hypervisor.

>>

>> My Ceph cluster is HDD backed.

>>

>> In order to help with write latency, I added an SSD drive to my

>> hypervisor and made it a writeback cache for the rbd via LVM. So far I've

>> managed to smooth out my 4k write latency and have some pleasing results.

>>

>> Architecturally, my current plan is to deploy an iSCSI gateway on each

>> hypervisor hosting that hypervisor's own datastore.

>>

>> Does anybody have any experience with this kind of configuration,

>> especially with regard to LVM writeback caching combined with RBD?

>> _______________________________________________

>> ceph-users mailing list

>> ceph-users@xxxxxxxxxxxxxx

>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>

>

-------------- next part --------------

An HTML attachment was scrubbed...

URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171209/8d02eb27/attachment-0001.html>

------------------------------

Message: 3

Date: Sun, 10 Dec 2017 01:09:39 +0000

From: Donny Davis <donny@xxxxxxxxxxxxxx>

To: Brady Deetz <bdeetz@xxxxxxxxx>

Cc: Aaron Glenn <aglenn@xxxxxxxxxxxxxxxxxxxxx>, ceph-users

        <ceph-users@xxxxxxxx>

Subject: Re:  RBD+LVM -> iSCSI -> VMWare

Message-ID:

        <CAMHmko9bvQEcsPU3_crLeGkiiwtz5sY-WgGHTe3T2UjBqg4xPA@mail.gmail.com>

Content-Type: text/plain; charset="utf-8"

What I am getting at is that instead of sinking a bunch of time into this

bandaid, why not sink that time into a hypervisor migration. Seems well

timed if you ask me.

There are even tools to make that migration easier

http://libguestfs.org/virt-v2v.1.html

You should ultimately move your hypervisor instead of building a one off

case for ceph. Ceph works really well if you stay inside the box. So does

KVM. They work like Gang Buster's together.

I know that doesn't really answer your OP, but this is what I would do.

~D

On Sat, Dec 9, 2017 at 7:56 PM Brady Deetz <bdeetz@xxxxxxxxx> wrote:

> We have over 150 VMs running in vmware. We also have 2PB of Ceph for

> filesystem. With our vmware storage aging and not providing the IOPs we

> need, we are considering and hoping to use ceph. Ultimately, yes we will

> move to KVM, but in the short term, we probably need to stay on VMware.

> On Dec 9, 2017 6:26 PM, "Donny Davis" <donny@xxxxxxxxxxxxxx> wrote:

>

>> Just curious but why not just use a hypervisor with rbd support? Are

>> there VMware specific features you are reliant on?

>>

>> On Fri, Dec 8, 2017 at 4:08 PM Brady Deetz <bdeetz@xxxxxxxxx> wrote:

>>

>>> I'm testing using RBD as VMWare datastores. I'm currently testing with

>>> krbd+LVM on a tgt target hosted on a hypervisor.

>>>

>>> My Ceph cluster is HDD backed.

>>>

>>> In order to help with write latency, I added an SSD drive to my

>>> hypervisor and made it a writeback cache for the rbd via LVM. So far I've

>>> managed to smooth out my 4k write latency and have some pleasing results.

>>>

>>> Architecturally, my current plan is to deploy an iSCSI gateway on each

>>> hypervisor hosting that hypervisor's own datastore.

>>>

>>> Does anybody have any experience with this kind of configuration,

>>> especially with regard to LVM writeback caching combined with RBD?

>>> _______________________________________________

>>> ceph-users mailing list

>>> ceph-users@xxxxxxxxxxxxxx

>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>

>>

-------------- next part --------------

An HTML attachment was scrubbed...

URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171210/afb26767/attachment-0001.html>

------------------------------

Message: 4

Date: Sat, 9 Dec 2017 19:17:01 -0600

From: Brady Deetz <bdeetz@xxxxxxxxx>

To: Donny Davis <donny@xxxxxxxxxxxxxx>

Cc: Aaron Glenn <aglenn@xxxxxxxxxxxxxxxxxxxxx>, ceph-users

        <ceph-users@xxxxxxxx>

Subject: Re:  RBD+LVM -> iSCSI -> VMWare

Message-ID:

        <CADU_9qXgqBODJc4pFGUoZuCeQfLk6d3nbhoKa4xxPKKuB6O2VA@mail.gmail.com>

Content-Type: text/plain; charset="utf-8"

That's not a bad position. I have concerns with what I'm proposing, so a

hypervisor migration may actually bring less risk than a storage

abomination.

On Dec 9, 2017 7:09 PM, "Donny Davis" <donny@xxxxxxxxxxxxxx> wrote:

> What I am getting at is that instead of sinking a bunch of time into this

> bandaid, why not sink that time into a hypervisor migration. Seems well

> timed if you ask me.

>

> There are even tools to make that migration easier

>

> http://libguestfs.org/virt-v2v.1.html

>

> You should ultimately move your hypervisor instead of building a one off

> case for ceph. Ceph works really well if you stay inside the box. So does

> KVM. They work like Gang Buster's together.

>

> I know that doesn't really answer your OP, but this is what I would do.

>

> ~D

>

> On Sat, Dec 9, 2017 at 7:56 PM Brady Deetz <bdeetz@xxxxxxxxx> wrote:

>

>> We have over 150 VMs running in vmware. We also have 2PB of Ceph for

>> filesystem. With our vmware storage aging and not providing the IOPs we

>> need, we are considering and hoping to use ceph. Ultimately, yes we will

>> move to KVM, but in the short term, we probably need to stay on VMware.

>> On Dec 9, 2017 6:26 PM, "Donny Davis" <donny@xxxxxxxxxxxxxx> wrote:

>>

>>> Just curious but why not just use a hypervisor with rbd support? Are

>>> there VMware specific features you are reliant on?

>>>

>>> On Fri, Dec 8, 2017 at 4:08 PM Brady Deetz <bdeetz@xxxxxxxxx> wrote:

>>>

>>>> I'm testing using RBD as VMWare datastores. I'm currently testing with

>>>> krbd+LVM on a tgt target hosted on a hypervisor.

>>>>

>>>> My Ceph cluster is HDD backed.

>>>>

>>>> In order to help with write latency, I added an SSD drive to my

>>>> hypervisor and made it a writeback cache for the rbd via LVM. So far I've

>>>> managed to smooth out my 4k write latency and have some pleasing results.

>>>>

>>>> Architecturally, my current plan is to deploy an iSCSI gateway on each

>>>> hypervisor hosting that hypervisor's own datastore.

>>>>

>>>> Does anybody have any experience with this kind of configuration,

>>>> especially with regard to LVM writeback caching combined with RBD?

>>>> _______________________________________________

>>>> ceph-users mailing list

>>>> ceph-users@xxxxxxxxxxxxxx

>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>>

>>>

-------------- next part --------------

An HTML attachment was scrubbed...

URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171209/e19aa6ab/attachment-0001.html>

------------------------------

Message: 5

Date: Sun, 10 Dec 2017 11:35:33 +0800

From: "shadow_lin"<shadow_lin@163.com>

To: "ceph-users"<ceph-users@lists.ceph.com>

Subject:  The way to minimize osd memory usage?

Message-ID: <229639cd.27d.1603e7dff17.Coremail.shadow_lin@xxxxxxx>

Content-Type: text/plain; charset="utf-8"

Hi All,

I am testing running ceph luminous(12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe678db293bf) on ARM server.

The ARM server has a two cores@1.4GHz cpu and 2GB ram and I am running 2 osd per ARM server with 2x8TB(or 2x10TB) hdd.

Now I am facing constantly oom problem.I have tried upgrade ceph(to fix osd memroy leak problem) and lower the bluestore  cache setting.The oom problems did get better but still occurs constantly.

I am hoping someone can gives me some advice of the follow questions.

Is it impossible to run ceph in this config of hardware or Is it possible I can do some tunning the solve this problem(even to lose some performance to avoid the oom problem)?

Is it a good idea to use raid0 to combine the 2 HDD into one so I can only run one osd to save some memory?

How is memory usage of osd related to the size of HDD?

PS:my ceph.conf bluestore cache setting

[osd]

        bluestore_cache_size = 104857600

        bluestore_cache_kv_max = 67108864

        osd client message size cap = 67108864

2017-12-10

lin.yunfan

-------------- next part --------------

An HTML attachment was scrubbed...

URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171210/f096c25b/attachment-0001.html>

------------------------------

Message: 6

Date: Sun, 10 Dec 2017 11:29:23 +0700

From: Konstantin Shalygin <k0ste@xxxxxxxx>

To: ceph-users@xxxxxxxxxxxxxx

Cc: shadow_lin <shadow_lin@xxxxxxx>

Subject: Re:  The way to minimize osd memory usage?

Message-ID: <1836996d-95cb-4834-d202-c61502089123@xxxxxxxx>

Content-Type: text/plain; charset=utf-8; format=flowed

> I am testing running ceph luminous(12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe678db293bf) on ARM server.

Try new 12.2.2 - this release should fix memory issues with Bluestore.

------------------------------

Message: 7

Date: Sun, 10 Dec 2017 12:33:36 +0800

From: "shadow_lin"<shadow_lin@163.com>

To: "Konstantin Shalygin"<k0ste@xxxxxxxx>,

        "ceph-users"<ceph-users@lists.ceph.com>

Subject: Re:  The way to minimize osd memory usage?

Message-ID: <51e6e209.4ac350.1603eb32924.Coremail.shadow_lin@xxxxxxx>

Content-Type: text/plain; charset="utf-8"

The 12.2.1(12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe678db293bf) we are running is with the memory issues fix.And we are working on to upgrade to 12.2.2 release to see if there is any furthermore improvement.

2017-12-10

lin.yunfan

????Konstantin Shalygin <k0ste@xxxxxxxx>

?????2017-12-10 12:29

???Re:  The way to minimize osd memory usage?

????"ceph-users"<ceph-users@lists.ceph.com>

???"shadow_lin"<shadow_lin@163.com>

> I am testing running ceph luminous(12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe678db293bf) on ARM server.

Try new 12.2.2 - this release should fix memory issues with Bluestore.

-------------- next part --------------

An HTML attachment was scrubbed...

URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171210/e5870ab8/attachment-0001.html>

------------------------------

Message: 8

Date: Sun, 10 Dec 2017 14:34:03 +0100

From: Martin Preuss <martin@xxxxxxxxxxxxx>

To: ceph-users@xxxxxxxxxxxxxx

Subject:  Random checksum errors (bluestore on Luminous)

Message-ID: <4e50b57f-5881-e806-bb10-0d1e16e05365@xxxxxxxxxxxxx>

Content-Type: text/plain; charset="utf-8"

Hi,

I'm new to Ceph. I started a ceph cluster from scratch on DEbian 9,

consisting of 3 hosts, each host has 3-4 OSDs (using 4TB hdds, currently

totalling 10 hdds).

Right from the start I always received random scrub errors telling me

that some checksums didn't match the expected value, fixable with "ceph

pg repair".

I looked at the ceph-osd logfiles on each of the hosts and compared with

the corresponding syslogs. I never found any hardware error, so there

was no problem reading or writing a sector hardware-wise. Also there was

never any other suspicious syslog entry around the time of checksum

error reporting.

When I looked at the checksum error entries I found that the reported

bad checksum always was "0x6706be76".

Could someone please tell me where to look further for the source of the

problem?

I appended an excerpt of the osd logs.

Kind regards

Martin

--

"Things are only impossible until they're not"

-------------- next part --------------

A non-text attachment was scrubbed...

Name: ceph-osd.log

Type: text/x-log

Size: 4645 bytes

Desc: not available

URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171210/460992fe/attachment-0001.bin>

-------------- next part --------------

A non-text attachment was scrubbed...

Name: signature.asc

Type: application/pgp-signature

Size: 181 bytes

Desc: OpenPGP digital signature

URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171210/460992fe/attachment-0001.sig>

------------------------------

Message: 9

Date: Sun, 10 Dec 2017 15:05:16 +0000

From: David Turner <drakonstein@xxxxxxxxx>

To: shadow_lin <shadow_lin@xxxxxxx>

Cc: Konstantin Shalygin <k0ste@xxxxxxxx>, ceph-users

        <ceph-users@xxxxxxxxxxxxxx>

Subject: Re:  The way to minimize osd memory usage?

Message-ID:

        <CAN-GepK8nyqRzKTTo4AVmnTqLYuXLCcWdL_XC1LaGBPgQozQ_g@mail.gmail.com>

Content-Type: text/plain; charset="utf-8"

The docs recommend 1GB/TB of OSDs. I saw people asking if this was still

accurate for bluestore and the answer was that it is more true for

bluestore than filestore. There might be a way to get this working at the

cost of performance. I would look at Linux kernel memory settings as much

as ceph and bluestore settings. Cache pressure is one that comes to mind

that an aggressive setting might help.

On Sat, Dec 9, 2017, 11:33 PM shadow_lin <shadow_lin@xxxxxxx> wrote:

> The 12.2.1(12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe678db293bf)

> we are running is with the memory issues fix.And we are working on to

> upgrade to 12.2.2 release to see if there is any furthermore improvement.

>

> 2017-12-10

> ------------------------------

> lin.yunfan

> ------------------------------

>

> *????*Konstantin Shalygin <k0ste@xxxxxxxx>

> *?????*2017-12-10 12:29

> *???*Re:  The way to minimize osd memory usage?

> *????*"ceph-users"<ceph-users@lists.ceph.com>

> *???*"shadow_lin"<shadow_lin@163.com>

>

>

>

> > I am testing running ceph luminous(12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe678db293bf) on ARM server.

> Try new 12.2.2 - this release should fix memory issues with Bluestore.

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

-------------- next part --------------

An HTML attachment was scrubbed...

URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171210/534133c9/attachment-0001.html>

------------------------------

Message: 10

Date: Sun, 10 Dec 2017 10:38:53 -0500

From: Igor Mendelev <igmend@xxxxxxxxx>

To: ceph-users@xxxxxxxxxxxxxx

Subject:  what's the maximum number of OSDs per OSD

        server?

Message-ID:

        <CAKtyfj_0NKQmPNO2C6CuU47xZhM_Xagm2WF4yLUdUhfSw2G7Qg@mail.gmail.com>

Content-Type: text/plain; charset="utf-8"

Given that servers with 64 CPU cores (128 threads @ 2.7GHz) and up to 2TB

RAM - as well as 12TB HDDs - are easily available and somewhat reasonably

priced I wonder what's the maximum number of OSDs per OSD server (if using

10TB or 12TB HDDs) and how much RAM does it really require if total storage

capacity for such OSD server is on the order of 1,000+ TB - is it still 1GB

RAM per TB of HDD or it could be less (during normal operations - and

extended with NVMe SSDs swap space for extra space during recovery)?

Are there any known scalability limits in Ceph Luminous (12.2.2 with

BlueStore) and/or Linux that'll make such high capacity OSD server not

scale well (using sequential IO speed per HDD as a metric)?

Thanks.

-------------- next part --------------

An HTML attachment was scrubbed...

URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171210/01aa76db/attachment-0001.html>

------------------------------

Message: 11

Date: Sun, 10 Dec 2017 16:17:40 -0000

From: Nick Fisk <nick@xxxxxxxxxx>

To: 'Igor Mendelev' <igmend@xxxxxxxxx>, ceph-users@xxxxxxxxxxxxxx

Subject: Re:  what's the maximum number of OSDs per OSD

        server?

Message-ID: <001d01d371d2$66f06de0$34d149a0$@fisk.me.uk>

Content-Type: text/plain; charset="utf-8"

From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf Of Igor Mendelev

Sent: 10 December 2017 15:39

To: ceph-users@xxxxxxxxxxxxxx

Subject:  what's the maximum number of OSDs per OSD server?

Given that servers with 64 CPU cores (128 threads @ 2.7GHz) and up to 2TB RAM - as well as 12TB HDDs - are easily available and somewhat reasonably priced I wonder what's the maximum number of OSDs per OSD server (if using 10TB or 12TB HDDs) and how much RAM does it really require if total storage capacity for such OSD server is on the order of 1,000+ TB - is it still 1GB RAM per TB of HDD or it could be less (during normal operations - and extended with NVMe SSDs swap space for extra space during recovery)?

Are there any known scalability limits in Ceph Luminous (12.2.2 with BlueStore) and/or Linux that'll make such high capacity OSD server not scale well (using sequential IO speed per HDD as a metric)?

Thanks.

How many total OSD?s will you have? If you are planning on having thousands then dense nodes might make sense. Otherwise you are leaving yourself open to having a few number of very large nodes, which will likely shoot you in the foot further down the line. Also don?t forget, unless this is purely for archiving, you will likely need to scale the networking up per node, 2x10G won?t cut it when you have 10-20+ disks per node.

With Bluestore, you are probably looking at around 2-3GB of RAM per OSD, so say 4GB to be on the safe side.

7.2k HDD?s will likely only use a small proportion of a CPU core due to their limited IO potential. A would imagine that even with 90 bay JBOD?s, you will run into physical limitations before you hit CPU ones.

Without knowing your exact requirements, I would suggest that larger number of smaller nodes, might be a better idea. If you choose your hardware right, you can often get the cost down to comparable levels by not going with top of the range kit. Ie Xeon E3?s or D?s vs dual socket E5?s.

-------------- next part --------------

An HTML attachment was scrubbed...

URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171210/3f1a50cf/attachment-0001.html>

------------------------------

Message: 12

Date: Sun, 10 Dec 2017 12:37:05 -0500

From: Igor Mendelev <igmend@xxxxxxxxx>

To: nick@xxxxxxxxxx, ceph-users@xxxxxxxxxxxxxx

Subject: Re:  what's the maximum number of OSDs per OSD

        server?

Message-ID:

        <CAKtyfj-zCAPpPANb-5S6gXet+XYX33HhOC_65FP6HrTWBKFfDw@mail.gmail.com>

Content-Type: text/plain; charset="utf-8"

Expected number of nodes for initial setup is 10-15 and of OSDs -

1,500-2,000.

Networking is planned to be 2 100GbE or 2 dual 50GbE in x16 slots (per OSD

node).

JBODs are to be connected with 3-4 x8 SAS3 HBAs (4 4x SAS3 ports each)

Choice of hardware is done considering (non-trivial) per-server sw

licensing costs -

so small (12-24 HDD) nodes are certainly not optimal regardless of CPUs

cost (which

is estimated to be below 10% of the total cost in the setup I'm currently

considering).

EC (4+2 or 8+3 etc - TBD) - not 3x replication - is planned to be used for

most of the storage space.

Main applications are expected to be archiving and sequential access to

large (multiGB) files/objects.

Nick, which physical limitations you're referring to ?

Thanks.

On Sun, Dec 10, 2017 at 11:17 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:

> *From:* ceph-users [mailto:ceph-users-bounces@lists.ceph.com] *On Behalf

> Of *Igor Mendelev

> *Sent:* 10 December 2017 15:39

> *To:* ceph-users@xxxxxxxxxxxxxx

> *Subject:*  what's the maximum number of OSDs per OSD server?

>

>

>

> Given that servers with 64 CPU cores (128 threads @ 2.7GHz) and up to 2TB

> RAM - as well as 12TB HDDs - are easily available and somewhat reasonably

> priced I wonder what's the maximum number of OSDs per OSD server (if using

> 10TB or 12TB HDDs) and how much RAM does it really require if total storage

> capacity for such OSD server is on the order of 1,000+ TB - is it still 1GB

> RAM per TB of HDD or it could be less (during normal operations - and

> extended with NVMe SSDs swap space for extra space during recovery)?

>

>

>

> Are there any known scalability limits in Ceph Luminous (12.2.2 with

> BlueStore) and/or Linux that'll make such high capacity OSD server not

> scale well (using sequential IO speed per HDD as a metric)?

>

>

>

> Thanks.

>

>

>

> How many total OSD?s will you have? If you are planning on having

> thousands then dense nodes might make sense. Otherwise you are leaving

> yourself open to having a few number of very large nodes, which will likely

> shoot you in the foot further down the line. Also don?t forget, unless this

> is purely for archiving, you will likely need to scale the networking up

> per node, 2x10G won?t cut it when you have 10-20+ disks per node.

>

>

>

> With Bluestore, you are probably looking at around 2-3GB of RAM per OSD,

> so say 4GB to be on the safe side.

>

> 7.2k HDD?s will likely only use a small proportion of a CPU core due to

> their limited IO potential. A would imagine that even with 90 bay JBOD?s,

> you will run into physical limitations before you hit CPU ones.

>

>

>

> Without knowing your exact requirements, I would suggest that larger

> number of smaller nodes, might be a better idea. If you choose your

> hardware right, you can often get the cost down to comparable levels by not

> going with top of the range kit. Ie Xeon E3?s or D?s vs dual socket E5?s.

>

-------------- next part --------------

An HTML attachment was scrubbed...

URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171210/9c3b98f0/attachment-0001.html>

------------------------------

Message: 13

Date: Sun, 10 Dec 2017 17:38:30 +0000

From: He?in Ejdesgaard M?ller  <hej@xxxxxxxxx>

To: Brady Deetz <bdeetz@xxxxxxxxx>, Donny Davis <donny@xxxxxxxxxxxxxx>

Cc: Aaron Glenn <aglenn@xxxxxxxxxxxxxxxxxxxxx>, ceph-users

        <ceph-users@xxxxxxxx>

Subject: Re:  RBD+LVM -> iSCSI -> VMWare

Message-ID: <1512927510.642.70.camel@synack.fo>

Content-Type: text/plain; charset="UTF-8"

-----BEGIN PGP SIGNED MESSAGE-----

Hash: SHA256

Another option is to utilize the iscsi gateway, provided in 12.2 http://docs.ceph.com/docs/master/rbd/iscsi-overview/

Benefits:

You can EOL your old SAN wtihout having to simultaneously migrate to another hypervisor.

Any infrastructure that ties in to vSphere, is unaffected. (CEPH is just another set of datastores.)

If you have the appropriate vmware licenses etc. then your move to CEPH can be done without any downtime.

Drawback from my tests, using ceph-12.2-latest and ESXi-6.5, is that you get around 30% performance penalty, and the

latency is higher, compared to a direct rbd mount.

On ley, 2017-12-09 at 19:17 -0600, Brady Deetz wrote:

> That's not a bad position. I have concerns with what I'm proposing, so a hypervisor migration may actually bring less

> risk than a storage abomination.?

>

> On Dec 9, 2017 7:09 PM, "Donny Davis" <donny@xxxxxxxxxxxxxx> wrote:

> > What I am getting at is that instead of sinking a bunch of time into this bandaid, why not sink that time into a

> > hypervisor migration. Seems well timed if you ask me.

> >

> > There are even tools to make that migration easier

> >

> > http://libguestfs.org/virt-v2v.1.html

> >

> > You should ultimately move your hypervisor instead of building a one off case for ceph. Ceph works really well if

> > you stay inside the box. So does KVM. They work like Gang Buster's together.

> >

> > I know that doesn't really answer your OP, but this is what I would do.

> >

> > ~D

> >

> > On Sat, Dec 9, 2017 at 7:56 PM Brady Deetz <bdeetz@xxxxxxxxx> wrote:

> > > We have over 150 VMs running in vmware. We also have 2PB of Ceph for filesystem. With our vmware storage aging and

> > > not providing the IOPs we need, we are considering and hoping to use ceph. Ultimately, yes we will move to KVM,

> > > but in the short term, we probably need to stay on VMware.?

> > > On Dec 9, 2017 6:26 PM, "Donny Davis" <donny@xxxxxxxxxxxxxx> wrote:

> > > > Just curious but why not just use a hypervisor with rbd support? Are there VMware specific features you are

> > > > reliant on??

> > > >

> > > > On Fri, Dec 8, 2017 at 4:08 PM Brady Deetz <bdeetz@xxxxxxxxx> wrote:

> > > > > I'm testing using RBD?as VMWare datastores. I'm currently testing with krbd+LVM on a tgt target hosted on a

> > > > > hypervisor.

> > > > >

> > > > > My Ceph cluster is HDD backed.

> > > > >

> > > > > In order to help with write latency, I added an SSD drive to my hypervisor and made it a writeback cache for

> > > > > the rbd via LVM. So far I've managed to smooth out my 4k write latency and have some pleasing results.

> > > > >

> > > > > Architecturally, my current plan is to deploy an iSCSI gateway on each hypervisor hosting that hypervisor's

> > > > > own datastore.

> > > > >

> > > > > Does anybody have any experience with this kind of configuration, especially with regard to LVM writeback

> > > > > caching combined with RBD?

> > > > > _______________________________________________

> > > > > ceph-users mailing list

> > > > > ceph-users@xxxxxxxxxxxxxx

> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> > > > >

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEElZWfRQVsNukQFi9Ko80MCbT/An0FAlotcRYACgkQo80MCbT/

An36fQ//ULP6gwd4qUbXG3yKBHqMtcsTV76+CfP8e3jcuEqyEzlCugoR10DXPELj

TLCnrBp4fDP5gTd1zIHcU+PMPcVJ91dBYUWoMZrSLAraM0+7kvNQ9Nsacsl6CsiZ

yq+506uOhwcLub55oLSpKgnaW1rEG6TAG/6TNIBGakb2a79iC1xev16S3lJ8V7zI

cb3psUCePv/T753q/0E9B5SH9L5BiygsMT4DjiE09xGcFzH3lqkMWm2HMCFXNogI

WbwqQVTTgk5Ch3oilz6cpOIqLK2VMkK0PPFXSGi1SAEjkw2c/XIBykB9MclVQn+8

q5kO5g+uFcflEVnFhKTZknXVoOjrybhs4lMYmK4LJJ340Ay1uLyAlFdZdh+xAN3B

43QBKfcd1dL+EgKkMVuzGOaYOAqrFbh2/DN5rAz3l1YUy5h3OtjrXlNU/F7AkZfc

+UECf9wa6M7uS6DqaPMVxtLhROyMnHw+Z6jrKz7V8EamUduxQyNwOxBNIJYDmKVC

SHSkQi+oykPHWcOIXr1BNR2raaH1YVqXG+6mK8b6YV6sGtVeXA+KCa8RgrtabU3F

tgDW8cPkeTcPYi5BOVZeQ2OSD90A6eiC4fJbMcWVbUQim+0gSY2paoC8Rk/HQkMF

ug8xc9Os7SXe/wEOGQAzRHjDi16eKC9JghrS7dH4JLPg4gvBn4E=

=auLW

-----END PGP SIGNATURE-----

------------------------------

Message: 14

Date: Sun, 10 Dec 2017 19:45:31 +0100

From: Martin Preuss <martin@xxxxxxxxxxxxx>

To: ceph-users@xxxxxxxxxxxxxx

Subject: Re:  Random checksum errors (bluestore on

        Luminous)

Message-ID: <f93ce725-a404-152e-700d-b847823b4be7@xxxxxxxxxxxxx>

Content-Type: text/plain; charset="utf-8"

Hi (again),

meanwhile I tried

"ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-0"

but that resulted in a segfault (please see attached console log).

Regards

Martin

Am 10.12.2017 um 14:34 schrieb Martin Preuss:

> Hi,

>

> I'm new to Ceph. I started a ceph cluster from scratch on DEbian 9,

> consisting of 3 hosts, each host has 3-4 OSDs (using 4TB hdds, currently

> totalling 10 hdds).

>

> Right from the start I always received random scrub errors telling me

> that some checksums didn't match the expected value, fixable with "ceph

> pg repair".

>

> I looked at the ceph-osd logfiles on each of the hosts and compared with

> the corresponding syslogs. I never found any hardware error, so there

> was no problem reading or writing a sector hardware-wise. Also there was

> never any other suspicious syslog entry around the time of checksum

> error reporting.

>

> When I looked at the checksum error entries I found that the reported

> bad checksum always was "0x6706be76".

>

> Could someone please tell me where to look further for the source of the

> problem?

>

> I appended an excerpt of the osd logs.

>

>

> Kind regards

> Martin

>

>

>

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

--

"Things are only impossible until they're not"

-------------- next part --------------

A non-text attachment was scrubbed...

Name: fsck.log

Type: text/x-log

Size: 4314 bytes

Desc: not available

URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171210/1a19349d/attachment-0001.bin>

-------------- next part --------------

A non-text attachment was scrubbed...

Name: signature.asc

Type: application/pgp-signature

Size: 181 bytes

Desc: OpenPGP digital signature

URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171210/1a19349d/attachment-0001.sig>

------------------------------

Message: 15

Date: Sun, 10 Dec 2017 20:32:45 -0000

From: Nick Fisk <nick@xxxxxxxxxx>

To: 'Igor Mendelev' <igmend@xxxxxxxxx>, ceph-users@xxxxxxxxxxxxxx

Subject: Re:  what's the maximum number of OSDs per OSD

        server?

Message-ID: <002201d371f6$09a38040$1cea80c0$@fisk.me.uk>

Content-Type: text/plain; charset="utf-8"

From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com] On Behalf Of Igor Mendelev

Sent: 10 December 2017 17:37

To: nick@xxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx

Subject: Re:  what's the maximum number of OSDs per OSD server?

Expected number of nodes for initial setup is 10-15 and of OSDs - 1,500-2,000.

Networking is planned to be 2 100GbE or 2 dual 50GbE in x16 slots (per OSD node).

JBODs are to be connected with 3-4 x8 SAS3 HBAs (4 4x SAS3 ports each)

Choice of hardware is done considering (non-trivial) per-server sw licensing costs -

so small (12-24 HDD) nodes are certainly not optimal regardless of CPUs cost (which

is estimated to be below 10% of the total cost in the setup I'm currently considering).

EC (4+2 or 8+3 etc - TBD) - not 3x replication - is planned to be used for most of the storage space.

Main applications are expected to be archiving and sequential access to large (multiGB) files/objects.

Nick, which physical limitations you're referring to ?

Thanks.

Hi Igor,

I guess I meant physical annoyances rather than limitations. Being able to pull out a 1 or 2U node is always much less of a chore vs dealing with several U of SAS interconnected JBOD?s.

If you have some license reason for larger nodes, then there is a very valid argument for larger nodes. Is this license cost  related in some way to Ceph (I thought Redhat was capacity based) or is this some sort of collocated software? Just make sure you size the nodes to a point that if one has to be taken offline for any reason, that you are happy with the resulting state of the cluster, including the peering when suddenly taking ~200 OSD?s offline/online.

Nick

On Sun, Dec 10, 2017 at 11:17 AM, Nick Fisk <nick@xxxxxxxxxx <mailto:nick@xxxxxxxxxx> > wrote:

From: ceph-users [mailto:ceph-users-bounces@lists.ceph.com <mailto:ceph-users-bounces@lists.ceph.com> ] On Behalf Of Igor Mendelev

Sent: 10 December 2017 15:39

To: ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxx.com>

Subject:  what's the maximum number of OSDs per OSD server?

Given that servers with 64 CPU cores (128 threads @ 2.7GHz) and up to 2TB RAM - as well as 12TB HDDs - are easily available and somewhat reasonably priced I wonder what's the maximum number of OSDs per OSD server (if using 10TB or 12TB HDDs) and how much RAM does it really require if total storage capacity for such OSD server is on the order of 1,000+ TB - is it still 1GB RAM per TB of HDD or it could be less (during normal operations - and extended with NVMe SSDs swap space for extra space during recovery)?

Are there any known scalability limits in Ceph Luminous (12.2.2 with BlueStore) and/or Linux that'll make such high capacity OSD server not scale well (using sequential IO speed per HDD as a metric)?

Thanks.

How many total OSD?s will you have? If you are planning on having thousands then dense nodes might make sense. Otherwise you are leaving yourself open to having a few number of very large nodes, which will likely shoot you in the foot further down the line. Also don?t forget, unless this is purely for archiving, you will likely need to scale the networking up per node, 2x10G won?t cut it when you have 10-20+ disks per node.

With Bluestore, you are probably looking at around 2-3GB of RAM per OSD, so say 4GB to be on the safe side.

7.2k HDD?s will likely only use a small proportion of a CPU core due to their limited IO potential. A would imagine that even with 90 bay JBOD?s, you will run into physical limitations before you hit CPU ones.

Without knowing your exact requirements, I would suggest that larger number of smaller nodes, might be a better idea. If you choose your hardware right, you can often get the cost down to comparable levels by not going with top of the range kit. Ie Xeon E3?s or D?s vs dual socket E5?s.

-------------- next part --------------

An HTML attachment was scrubbed...

URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171210/1e954b89/attachment-0001.html>

------------------------------

Subject: Digest Footer

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

------------------------------

End of ceph-users Digest, Vol 59, Issue 9

*****************************************

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com