Re: Sequence rollover; local offset updated

William Brown <wbrown@xxxxxxx> · Sun, 22 Dec 2019 12:42:54 +1030

> On 22 Dec 2019, at 08:22, Christophe Trefois <trefex@xxxxxxxxx> wrote:
> 
> First off, apologies for double posting to here and ipa mailing list, but we are getting a bit uneasy, and also the issue seems to come from the code in 389-ds directly, so this seems more appropriate.

Hi there, thanks for contacting us. Happy your you to post here.

> 
> We are using ipa-server ipa-server-4.6.5-11.el7.centos.3.x86_64 with 389-ds-base-1.3.9.1-10.el7.x86_64 on CentOS 7.7.
> Since couple days some of our replicas are coming with "csngen_new_csn - Sequence rollover; local offset updated." messages in the slapd erorr logs. 

This isn't a problem, but you should investigate the possible causes. The short answer is that we are pushing the lamport clock ahead due to either high writes or the system clock being stepped backwards.

To see the code look at:

https://pagure.io/389-ds-base/blob/master/f/ldap/servers/slapd/csngen.c#_195

You should probably for sanity checking investigate:

* If you have high write load in your environment that is not expected
* If you have issues with ntp consistency on your machines (continually advancing or reversing)
* Conflict between a virtualised time sync service is vmware/libvirt vs ntp causing time jumps

For a slightly longer explanation. The CSN is a lamport clock, IE it can only advance, but never step back. It's based on the current unix time in seconds, with a sub-counter that is 16 bit. IE we can have 65535 writes "per second".

This is because if you have say:

Write object A
Ntp syncs clock backwards
Write object B

We need the CSN of these to still reflect the true order of operations - that A occurs before B, as we use time as the sync source between replicas rather than locking/consensus. If the CSN didn't use lamport clock the changelog would show B before A which is incorrect for reasons that are extremely complex and subtle.

So with the CSN being a lamport clock, if ntp sets your time backwards, the CSN stays at the "highest" time, and the subcounter keeps incrementing. If this continues for a long time, we overflow the 16bit sub counter - we can't have duplicate CSN so the local offset (aka seconds) is increased to push the CSN's always forward.

That's why I recommend you check your write load and ntp/system time.

Hope that helps, 

> 
> We use the python "ipa_check_consistency" and replication seems to be fine. 
> 
> We checked all replicas, and they are all in time sync with ntp (updated) with no visible offset. 
> 
> is this anything to worry about, and how can we make those messages to stop appearing?
> _______________________________________________
> 389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx
> To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx
> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx

—
Sincerely,

William Brown

Senior Software Engineer, 389 Directory Server
SUSE Labs
_______________________________________________
389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx