Re: OSD fast shutdown and allocator''s map restoration (aka NCB)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Igor,

This is true that the fast shutdown might cause lengthy restoration on the next bootup.
Gabriel is currently working on it.
The goal is to make fast shutdown work in following way:
- stop new ops from OSD
- finish processing BlueStore ops
- save allocator state
- exit()
We expect it to still be much faster than regular shutdown - with savings stemming from
just dropping elaborate data structures, instead unwinding them orderly.

Of course this new "fast shutdown" will never be as fast as the previous one.

Still, I agree with you that we must maintain orderly shutdown
so we can use tools like valgrind.

Regards,
Adam

On Mon, 22 Nov 2021 at 19:02, Neha Ojha <nojha@xxxxxxxxxx> wrote:
On Mon, Nov 22, 2021 at 9:55 AM Sage Weil <sage@xxxxxxxxxxxx> wrote:
>
> Hi Igor,
>
> Good point!  I forgot about the intersection of these two features.
>
> (Just to confirm: the allocator change is new in quincy, right?  And won't be backported?  Just want to confirm this is affecting any users.)

It is only in master/quincy.

>
> The fast shutdown was introduced partly because in Octopus we want to set the dead_epoch value in the OSDMap, and at the time the quickest way to do that was to kill the process so that peers would get ECONNREFUSED and report the peer dead.
>
> Given this new feature, I think it makes sense to change the clean shutdown process to (1) stop responding to messages immediately (play dead) and (2) ask the mon to mark us dead (MOSDMarkMeDead instead of MOSDMarkMeDown).  I think that's a matter of dropping incoming messages when we are in PREPARING_TO_STOP state or whatever it is.  Note that it is very important that we stop processing requests *before* we are marked dead in order to prevent potentially stale reads--"dead" means the OSD process is truly dead (vs unresponsive/slow or possibly partitioned away from us but still serving reads for some clients).
>
> Anyway, this relates to https://tracker.ceph.com/issues/53327.  I plan to take a closer look this week.
>
> sage
>
> On Mon, Nov 22, 2021 at 11:44 AM Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:
>>
>> Hey folks,
>>
>> recently I realized that OSD's fast shutdown (which is a default
>> behavior) results  our new feature - dynamic allocator's map restoration
>> - in being working in a suboptimal mode. Due to nongracefull shutdown it
>> has to recover allocator's map through onode enumeration on each OSD
>> startup. Which might apparently take some time. Moreover RocksDB
>> apparently performs a sort of recovery in this case too - may be not
>> that long but still visible.
>>
>> Please also note that one might miss the above issues when using
>> vstart.sh - it has got osd_fast_shutdown set to false.
>>
>> I created the following ticket to track the issue:
>> https://tracker.ceph.com/issues/53266
>>
>>
>> Additionally we've already made some additional tricks in the code for
>> this fast shutdown mode, e.g. osd_fast_shutdown_notify_mon_option.
>>
>> Hence given the above shouldn't we revise the need for this fast
>> shutdown feature? IIUC the presense  of various bugs along the regular
>> shutdown path was one of the primary rationales for new mode
>> introduction. But IMO properly running graceful shutdown is a sort of
>> code's quality mark. And aren't we just moving the complexity/burden
>> from shutdown procedure to the startup one this way? So may be we better
>> invest in making shutdown clean enough?
>>
>>
>> Thanks,
>>
>> --
>> Igor Fedotov
>> Ceph Lead Developer
>>
>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>
>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>> CEO: Martin Verges - VAT-ID: DE310638492
>> Com. register: Amtsgericht Munich HRB 231263
>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux