Dear all,
thanks for your feedback and Fäll try to take any suggestion in consideration.
I’v rebooted node in question and oll 24 OSDs came online without any complaining.
But wat makes me wonder is: During the downtime the Object got rebalanced and placed on the remaining nodes.
With the failed node online, only a couple of hundreds objects where misplaced, out of about 35 million.
The question for me is: What happens to the objects on the OSDs that went down after the OSDs got back online?
Thanks for feedback
Hello,this is where (depending on your topology) something like:---mon_osd_down_out_subtree_limit = host---can come in very handy.Provided you have correct monitoring, alerting and operations, recoveringa down node can often be restored long before any recovery would befinished and you also avoid the data movement back and forth.And if you see that recovering the node will take a long time, justmanually set things out for the time being.ChristianOn Sun, 27 Jan 2019 00:02:54 +0100 Götz Reinicke wrote:Dear Chris,
Thanks for your feedback. The node/OSDs in question are part of an erasure coded pool and during the weekend the workload should be close to none.
But anyway, I could get a look on the console and on the server; the power is up, but I cant use any console, the Loginprompt is shown, but no key is accepted.
I’ll have to reboot the server and check what he is complaining about tomorrow morning ASAP I can access the server again.
Fingers crossed and regards. Götz
Am 26.01.2019 um 23:41 schrieb Chris <bitskrieg@xxxxxxxxxxxxx>:
It sort of depends on your workload/use case. Recovery operations can be computationally expensive. If your load is light because its the weekend you should be able to turn that host back on as soon as you resolve whatever the issue is with minimal impact. You can also increase the priority of the recovery operation to make it go faster if you feel you can spare additional IO and it won't affect clients.
We do this in our cluster regularly and have yet to see an issue (given that we take care to do it during periods of lower client io)
On January 26, 2019 17:16:38 Götz Reinicke <goetz.reinicke@xxxxxxxxxxxxxxx> wrote:
Hi,
one host out of 10 is down for yet unknown reasons. I guess a power failure. I could not yet see the server.
The Cluster is recovering and remapping fine, but still has some objects to process.
My question: May I just switch the server back on and in best case, the 24 OSDs get back online and recovering will do the job without problems.
Or what might be a good way to handle that host? Should I first wait till the recover is finished?
Thanks for feedback and suggestions - Happy Saturday Night :) . Regards . Götz
-- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Rakuten Communications
|
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com