Re: One host with 24 OSDs is offline - best way to get it back online

Götz Reinicke <goetz.reinicke@xxxxxxxxxxxxxxx> · Sun, 27 Jan 2019 15:47:49 +0100

Dear all,
thanks for your feedback and Fäll try to take any suggestion in consideration.

I’v rebooted node in question and oll 24 OSDs came online without any complaining.

But wat makes me wonder is: During the downtime the Object got rebalanced and placed on the remaining nodes.

With the failed node online, only a couple of hundreds objects where misplaced, out of about 35 million.

The question for me is: What happens to the objects on the OSDs that went down after the OSDs got back online?

	Thanks for feedback 

Am 27.01.2019 um 04:17 schrieb Christian Balzer <chibi@xxxxxxx>:

Hello,

this is where (depending on your topology) something like:
---
mon_osd_down_out_subtree_limit = host
---
can come in very handy.

Provided you have correct monitoring, alerting and operations, recovering
a down node can often be restored long before any recovery would be
finished and you also avoid the data movement back and forth.
And if you see that recovering the node will take a long time, just
manually set things out for the time being.

Christian

On Sun, 27 Jan 2019 00:02:54 +0100 Götz Reinicke wrote:

Dear Chris,

Thanks for your feedback. The node/OSDs in question are part of an erasure coded pool and during the weekend the workload should be close to none.

But anyway, I could get a look on the console and on the server; the power is up, but I cant use any console, the Loginprompt is shown, but no key is accepted.

I’ll have to reboot the server and check what he is complaining about tomorrow morning ASAP I can access the server again.

	Fingers crossed and regards. Götz

Am 26.01.2019 um 23:41 schrieb Chris <bitskrieg@xxxxxxxxxxxxx>:

It sort of depends on your workload/use case.  Recovery operations can be computationally expensive.  If your load is light because its the weekend you should be able to turn that host back on  as soon as you resolve whatever the issue is with minimal impact.  You can also increase the priority of the recovery operation to make it go faster if you feel you can spare additional IO and it won't affect clients.

We do this in our cluster regularly and have yet to see an issue (given that we take care to do it during periods of lower client io)

On January 26, 2019 17:16:38 Götz Reinicke <goetz.reinicke@xxxxxxxxxxxxxxx> wrote:

Hi,

one host out of 10 is down for yet unknown reasons. I guess a power failure. I could not yet see the server.

The Cluster is recovering and remapping fine, but still has some objects to process.

My question: May I just switch the server back on and in best case, the 24 OSDs get back online and recovering will do the job without problems.

Or what might be a good way to handle that host? Should I first wait till the recover is finished?

Thanks for feedback and suggestions - Happy Saturday Night  :) . Regards . Götz  

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications

  Götz Reinicke

 IT-Koordinator
IT-OfficeNet

 +49 7141 969 82420

    goetz.reinicke@xxxxxxxxxxxxxxx

      Filmakademie Baden-Württemberg GmbH

       Akademiehof 10
       71638 Ludwigsburg

      http://www.filmakademie.de

Eintragung Amtsgericht Stuttgart HRB 205016

Vorsitzende des Aufsichtsrates:

Petra Olschowski
Staatssekretärin im Ministerium für Wissenschaft,
Forschung und Kunst Baden-Württemberg

Geschäftsführer:

Prof. Thomas Schadt

Datenschutzerklärung | Transparenzinformation
Data privacy statement | Transparency information

Attachment:
smime.p7s

Description: S/MIME cryptographic signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com