Re: Gentoo & ceph 0.67 & pg stuck After fresh Installation

Aaron Ten Clay <aarontc@xxxxxxxxxxx> · Thu, 30 Jan 2014 17:20:39 -0800

Philipp,

I have had issues with clock sync on machines before that I could usually alleviate by tweaking the kernel config. Changing CONFIG_HZ to 300 instead of 1000 can help. If you ever reboot the machines, making sure your init system writes the current software clock to the hardware clock on shutdown (if you use OpenRC, /etc/conf.d/hwclock should have 'clock_hctosys="YES"') can help that situation.

Some more hardware details might be helpful. On very, very overloaded systems I've seen the software clock drift a lot, you might just be trying to do too much with the number of cores you have. Also, cheap or badly-configured hardware can cause spurious interrupts, so keeping an eye on the context-switches-per-second, and interrupts-per-second values over time might be a clue for clock drift as well.

Glad you found my notes helpful - I didn't write the majority of that howto, though, just the notes at the top :)

-Aaron

On Tue, Jan 28, 2014 at 2:32 PM, Philipp von Strobl-Albeg <philipp@xxxxxxxxxxxx> wrote:

Hi all,

thank you very much for your input.

I sync the clock on all hosts per ntpdate pool.ntp.org and sync this with the hwclock on every host.

For strange reason, on is after some minutes out of sync. I can't say where this comes from...

Perhaps this is a special gentoo-thing or a "cheap-pc"-problem.

What is the worsest thing i have to expect, if i won't fix this ?

Anyway i get manage to fix the pgs stuck-thing.

I redesign the crush map (mainly set the host to the a rack and this to default) and now the health is OK !

Thank you again for you kindly help and great job - inktank ;-)

PS: Aaron - your Howto was really helpful

Best

Philipp

Am 20.01.2014 05:59, schrieb Sage Weil:

On Sun, 19 Jan 2014, Sherry Shahbazi wrote:

Hi Philipp,

Installing "ntp" on each server might solve the clock skew problem.

At the very least a onetime 'ntpdate time.apple.com' should make that

issue go away for the time being.

s

  Best Regards

Sherry

On Sunday, January 19, 2014 6:34 AM, Philipp Strobl <philipp@xxxxxxxxxxxx>

wrote:

HI Aaron,

sorry for taking so long...

After i add the osd and buckets to the crushmap i get

ceph osd tree

# id    weight    type name    up/down    reweight

-3    1    host dp2

1    1        osd.1    up    1

-2    1    host dp1

0    1        osd.0    up    1

-1    0    root default

Both osds are up and in

ceph osd stat

e25: 2 osds: 2 up, 2 in

ceph health detail says:

HEALTH_WARN 292 pgs stuck inactive; 292 pgs stuck unclean; clock skew

detected on mon.vmsys-dp2

pg 3.f is stuck inactive since forever, current state creating, last acting

[]

pg 0.c is stuck inactive since forever, current state creating, last acting

[]

pg 1.d is stuck inactive since forever, current state creating, last acting

[]

pg 2.e is stuck inactive since forever, current state creating, last acting

[]

pg 3.8 is stuck inactive since forever, current state creating, last acting

[]

pg 0.b is stuck inactive since forever, current state creating, last acting

[]

pg 1.a is stuck inactive since forever, current state creating, last acting

[]

...

pg 2.c is stuck unclean since forever, current state creating, last acting

[]

pg 1.f is stuck unclean since forever, current state creating, last acting

[]

pg 0.e is stuck unclean since forever, current state creating, last acting

[]

pg 3.d is stuck unclean since forever, current state creating, last acting

[]

pg 2.f is stuck unclean since forever, current state creating, last acting

[]

pg 1.c is stuck unclean since forever, current state creating, last acting

[]

pg 0.d is stuck unclean since forever, current state creating, last acting

[]

pg 3.e is stuck unclean since forever, current state creating, last acting

[]

mon.vmsys-dp2 addr 10.0.0.22:6789/0 clock skew 16.4914s > max 0.05s (latency

0.00666228s)

All pgs have the same status.

Is the clock skew an important fact ?

I compiled ceph like this - eix ceph:

...

Installed versions:  0.67{tbz2}(00:54:50 01/08/14)(fuse -debug -gtk

-libatomic -radosgw -static-libs -tcmalloc)

  cluster name is vmsys, servers are dp1 and dp2

config:

[global]

     auth cluster required = none

     auth service required = none

     auth client required = none

     auth supported = none

     fsid = 265d12ac-e99d-47b9-9651-05cb2b4387a6

[mon.vmsys-dp1]

     host = dp1

     mon addr = INTERNAL-IP1:6789

     mon data = "">dp1

[mon.vmsys-dp2]

     host = dp2

     mon addr = INTERNAL-IP2:6789

     mon data = "">dp2

[osd]

[osd.0]

     host = dp1

     devs = /dev/sdb1

     osd_mkfs_type = xfs

     osd data = "">

[osd.1]

     host = dp2

     devs = /dev/sdb1

     osd_mkfs_type = xfs

     osd data = "">

[mds.vmsys-dp1]

         host = dp1

[mds.vmsys-dp2]

         host = dp2

Hope this is helpful - i really don't know at the moment what is wrong.

Perhaps i try the manual-deploy howto from inktank or do you have an idea ?

Best Philipp

http://www.pilarkto.net

Am 10.01.2014 20:50, schrieb Aaron Ten Clay:

       Hi Philipp,

It sounds like perhaps you don't have any OSDs that are both "up" and

"in" the cluster. Can you provide the output of "ceph health detail"

and "ceph osd tree" for us?

As for the "howto" you mentioned, I added some notes to the top but

never really updated the body of the document... I'm not entirely sure

it's straightforward or up to date any longer :) I'd be happy to make

changes as needed but I haven't manually deployed a cluster in several

months, and Inktank now has a manual deployment guide for Ceph at

http://ceph.com/docs/master/install/manual-deployment/

-Aaron

On Fri, Jan 10, 2014 at 6:57 AM, Philipp Strobl <philipp@xxxxxxxxxxxx>

wrote:

       Hi,

After managed to deploy ceph manual in gentoo (ceph-disk tools

are under /usr/usr/sbin...), the daemons are coming properly up,

but "ceph health" shows warn for all pgs stuck unclean.

This is a strange behavior for a clean new installtion i guess.

So the question is, do i'm something wrong Or can i reset the

PGs for getting the Cluster Running ?

Also the rbd-Client Or Mount.ceph Hangs with no answer.

I used thishowto: https://github.com/aarontc/ansible-playbooks/blob/master/roles/ceph.

notes-on-deployment.rst

Resp. our German translation/expansion

http://wiki.open-laboratory.de/Intern:IT:HowTo:Ceph

With auth Support ... = none

Best regards

And thank you in advance

Philipp Strobl

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--

Aaron Ten Clay

http://www.aarontc.com/

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Aaron Ten Clay
http://www.aarontc.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com