Re: Ceph cluster stability

M Ranga Swami Reddy <swamireddy@xxxxxxxxx> · Fri, 22 Feb 2019 17:09:54 +0530

ceph mons looks fine during the recovery.  Using  HDD with SSD
journals. with recommeded CPU and RAM numbers.

On Fri, Feb 22, 2019 at 4:40 PM David Turner <drakonstein@xxxxxxxxx> wrote:
>
> What about the system stats on your mons during recovery? If they are having a hard time keeping up with requests during a recovery, I could see that impacting client io. What disks are they running on? CPU? Etc.
>
> On Fri, Feb 22, 2019, 6:01 AM M Ranga Swami Reddy <swamireddy@xxxxxxxxx> wrote:
>>
>> Debug setting defaults are using..like 1/5 and 0/5 for almost..
>> Shall I try with 0 for all debug settings?
>>
>> On Wed, Feb 20, 2019 at 9:17 PM Darius Kasparavičius <daznis@xxxxxxxxx> wrote:
>> >
>> > Hello,
>> >
>> >
>> > Check your CPU usage when you are doing those kind of operations. We
>> > had a similar issue where our CPU monitoring was reporting fine < 40%
>> > usage, but our load on the nodes was high mid 60-80. If it's possible
>> > try disabling ht and see the actual cpu usage.
>> > If you are hitting CPU limits you can try disabling crc on messages.
>> > ms_nocrc
>> > ms_crc_data
>> > ms_crc_header
>> >
>> > And setting all your debug messages to 0.
>> > If you haven't done you can also lower your recovery settings a little.
>> > osd recovery max active
>> > osd max backfills
>> >
>> > You can also lower your file store threads.
>> > filestore op threads
>> >
>> >
>> > If you can also switch to bluestore from filestore. This will also
>> > lower your CPU usage. I'm not sure that this is bluestore that does
>> > it, but I'm seeing lower cpu usage when moving to bluestore + rocksdb
>> > compared to filestore + leveldb .
>> >
>> >
>> > On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
>> > <swamireddy@xxxxxxxxx> wrote:
>> > >
>> > > Thats expected from Ceph by design. But in our case, we are using all
>> > > recommendation like rack failure domain, replication n/w,etc, still
>> > > face client IO performance issues during one OSD down..
>> > >
>> > > On Tue, Feb 19, 2019 at 10:56 PM David Turner <drakonstein@xxxxxxxxx> wrote:
>> > > >
>> > > > With a RACK failure domain, you should be able to have an entire rack powered down without noticing any major impact on the clients.  I regularly take down OSDs and nodes for maintenance and upgrades without seeing any problems with client IO.
>> > > >
>> > > > On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy <swamireddy@xxxxxxxxx> wrote:
>> > > >>
>> > > >> Hello - I have a couple of questions on ceph cluster stability, even
>> > > >> we follow all recommendations as below:
>> > > >> - Having separate replication n/w and data n/w
>> > > >> - RACK is the failure domain
>> > > >> - Using SSDs for journals (1:4ratio)
>> > > >>
>> > > >> Q1 - If one OSD down, cluster IO down drastically and customer Apps impacted.
>> > > >> Q2 - what is stability ratio, like with above, is ceph cluster
>> > > >> workable condition, if one osd down or one node down,etc.
>> > > >>
>> > > >> Thanks
>> > > >> Swami
>> > > >> _______________________________________________
>> > > >> ceph-users mailing list
>> > > >> ceph-users@xxxxxxxxxxxxxx
>> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > > _______________________________________________
>> > > ceph-users mailing list
>> > > ceph-users@xxxxxxxxxxxxxx
>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com