Re: CPU Scalability / Scaling

William Brown <wbrown@xxxxxxx> · Mon, 17 Aug 2020 11:21:11 +1000

> On 17 Aug 2020, at 07:51, Mark Reynolds <mreynolds@xxxxxxxxxx> wrote:
> 
> 
> 
> On 8/16/20 1:14 PM, Ben Spencer wrote:
>> 
>> 
>> On Fri, Aug 14, 2020 at 6:19 PM Marc Sauton <msauton@xxxxxxxxxx> wrote:
>> 
>> 
>> 
>> On Fri, Aug 14, 2020 at 1:31 PM Ben Spencer <isatworktoday@xxxxxxxxx> wrote:
>> 
>> 
>> On Fri, Aug 14, 2020, 10:53 AM David Boreham <david@xxxxxxxxxxxxxxx> wrote:
>> 
>> On 8/14/2020 9:04 AM, Ben Spencer wrote:
>> > After a little investigation I didn't find any recent information on 
>> > how well / linearly 389 scales from a CPU perspective. I also realize 
>> > this is a more complicated topic with many factors which actually play 
>> > into it.

"complicated topic" may be the understatement of the century here :)

>> >
>> > Throwing the basic question out there: Does 389 scale fairly 
>> > linearly as the number of CPUs are increased? Is there a point where 
>> > it drops off?

"It depends", very much on your workload and your configuration IE plugins. I would say out of the box "yes". But the only way to know is in your environment with monitoring. 

Remember also that single server scaling is only a single element of a DS install - Scaling over a replication topology is also a crucial part of this. For example, we know that openldap may be faster than 389-ds in a single server context, but we have heard from large deployments (in excess of hundreds of replicas) that as a replication topology grows that 389-ds will outperform openldap due to differences in how replication is managed.

>> 
>> Cached reads (cached anywhere : filesystem cache, db page pool, entry 
>> cache) should scale quite well, at least to 4/6/8 CPU. I'm not sure 
>> about today's 8+ CPU systems but would assume probably not great scaling 
>> beyond 8 until proven otherwise.
>> 
>> Interesting since we are currently sitting with 10 CPU per server. Things organically grew over time without much thought given.
>>  
>> Having more CPUs may displace problems, to for example, more threads contention and newer performance problems.

If you have multi-sockets then NUMA may also become a factor as accessing memory to the other NUMA region will add delays.

>> 
>> That is one of the concerns we have.
>>  
>> Related to CPU and configurations, the "autotuning" for nsslapd-threadnumber is recommended.
>> http://www.port389.org/docs/389ds/design/autotuning.html
>> https://access.redhat.com/documentation/en-us/red_hat_directory_server/11/html/performance_tuning_guide/ds-threads
>> ( "excessive" manual thread setting will have a counter effect )
>> 
>> We use autotuning but if the information is to be believed, 389 supports 512 threads. I could read this to mean that performance drops off at 512 threads but it is not clear if performance is linear up until that point. Do 512 threads handle 16x more work than 32 threads without an execution time drop off or an increase in queuing.
> This is why we have autotuning so you don't have to worry about it :-)  But you want to set the number of "worker" threads (nsslapd-threadnumber) to the number of CPU's.  But if you have replication agreements they also generate a thread.  So if you have 16 CPU's and 4 repl agreements, you would want to set the nsslapd-threadnumber to 12.   

I think we set the maximum of 512 because we didn't have operational experience with "anything larger" at that point. If you have a machine with more than 512 threads, I'd be curious to see how 389 performs.

But honestly, I think it's probably a better investment to get multiple single socket servers that you can distribute into a replication topology, rather than relying on smaller numbers of big servers. 

If you are worried autotuning isn't doing the best job, consider setting threads to match the number of sockets you have. This reduces thread context switching so that each task can complete faster. 

> 
> HTH,
> 
> Mark
> 
>>  
>> There is another aspect, on the LDAP client side:
>> High CPU use is often the result of "poorly" designed applications that are hammering an LDAP server with a constant flow of complex search filters with pattern matching.
>> And very often all the long server side CPU processing is useless.
>> Analysing the LDAP server access log can help tune, change the filters those applications are sending, and can have a high impact on the server side.
>> So often, only the global settings are kept, and there is a server side configuration that is overlooked, that can really help optimize the CPU and I/O: the "fine grained" ID list scan limit.
>> http://www.port389.org/docs/389ds/design/fine-grained-id-list-size.html
>> https://access.redhat.com/documentation/en-us/red_hat_directory_server/11/html/configuration_command_and_file_reference/database_plug_in_attributes#nsIndexIDListScanLimit
>> And reduce or optimize index use on the system resources.

IDListScanLimit is a really weird tuning setting. It *MAY* actually be better to tune IDListScanLimit *high* rather than lowering it. This is because the point of the IDListScanLimit is to try to say "okay, if this index is really large it might be a full table scan anyway, lets move on and hope something more specific exists in the filter.". 

Where IDListScanLimit works is a filter such as:

(&(objectClass=account)(uid=william))

The idea is to stop loading the large index of objectClass=account, and "hope" the more specific index of uid=william is smaller, so that we can then load the single entry and filter test it as a performance optimisation. 

Of course, this doesn't always happen - consider a search like "objectClass=group". If you have 10,000 users and 10,000 groups, you don't want to full table scan 20,000 entries (which would scan 4000 entries then hit the size limit), you actually want to scan the index even though it's 10,000 elements, then we can check the length of the candidate set and shortcut return without needing to filter test 4000 entries. Sure an index with 10,000 items probably isn't the cheapest to load, but it's only going to be about 40kb which in the scheme of things is pretty cheap IO/memory wise compared to filter testing 4000 entries. 

That's while this tunable is so weird, because you are either tuning for "let's make broad searches faster" by setting it high at the expense that targeted searches may be "slower", OR you are tuning for "lets make targeted searches fast" at the expense that broad searches will have to go through the admin limit path potentially filter testing lots of entries in the process. 

IDListScanLimit exists as a form of naive filter optimisation, which has been in development for a long time and stalled on some issues with VLV at the moment, but with filter optimisation, we'd basically be able to remove IDListScanLimit as a tunable, and we'd make both broad AND targeted searches faster.

The reason why filter optimisation works is because of the filter test threshold that says "if the candidate set is below X limit" we don't need to load more indexes, we just filter test the partial candidate set instead. It means that:

(&(uid=william)(objectClass=account))

Would resolve uid=william to a single value in the index, and would shortcut and never even load the OC index. Plus we'd re-arrange (&(objectClass=account)(uid=william)) to the more efficient uid first version without any admin intervention needed. 

So you may find in the interim that changing application queries to have more specific filters FIRST could actually net you some large gains. (note, some applications like SSSD are known to always create slow/poorly arranged filters, and we can't do much to fix that :( ) 

The only way to know what will work here is to get 24 hours of access logs and spreadsheet out the filters and the etimes to work out your access patterns. Then you can choose which way you should tune your IDListScanLimit.

>> 
>> Unfortunately we've gone through this exercise previously with no actionable items.

Depending on what version you are on, it can help to enable filter syntax validation, which can prevent certain types of bad queries that cause full table scans:

http://www.port389.org/docs/389ds/design/filter-syntax-verification.html

Also be sure to check for notes=U and notes=A in your logs, which indicate partial and fully unindexed searches (counter intuitively, notes=U for partial indexed can occur due to a client question being fully unindexed, but 389 internally wraps the filter which creates the partial indexed warning instead.).

>>  
>> so there should be some investigation before adding more system resources, logs, pstacks, or some gdb stack traces, index configurations.
>> 
>> Writes are going to be heavily serialized, assume no CPU scaling. Fast 
>> I/O is what you need for write throughput.
>> > Where am I going with this?
>> > We are faced with either adding more CPUs to the existing servers or 
>> > adding more instances or a combination of the two. The current servers 
>> > have 10 CPU with the entire database fitting in RAM but, there is a 
>> > regular flow of writes. Sometimes somewhat heavy thanks to batch 
>> > updates. Gut feeling tells me to have more servers than a few huge 
>> > servers largely because of the writes/updates and lock contention. 
>> > Needing to balance the server sprawl as well.

An improvement we have seen in "huge" environments is to have your writes flow to only 2 - 4 servers, and then direct reads to other servers. That way your "read" servers only need to acknowledge and accept incoming writes via replication. 

Additionally, review what plugins you have enabled, as many plugins affect the write path and can add additionally locking. If you aren't using the plugins feature you can consider disabling it. IIRC managed entries comes to mind here but I can't remember if that's enabled by default or not. 

>> 
>> I'd look at whether I/O throughput (Write IOPS particularly) can be 
>> upgraded as a first step. Then perhaps look at system design to see if 
>> the batch updates can be throttled/trickled to reduce the cross-traffic 
>> interference. Usually the write load is the limiting factor scaling 
>> because it has to be replayed on every server regardless of its read 
>> workload.
>> 
>> Something to consider. Hard to resolve in the environment where the servers are.
>> large bulk updates always degrade LDAP servers performance.
>> Adding more replicas will create more contention for replication sessions.
>> LDAP replication can be tuned to accommodate replication sessions competition, but there isn't a dynamic tuning to adapt to traffic pattern changes, large bursts of modifications, so throttling scheduled updates seems a good and easier approach.
>> 
>>  Being that we are weighting adding CPU vs adding more replicas, those tuning options may be useful. Couple of questions around this:
>> 1) is the contention on inbound only, inbound and outbound or the entire replication domain?

Replication will help you gain fractional increases in write IOPS, but large gains in read throughput. This is because for every write in your topology, that write must be replicated to every other server. Replication is batched though, making it fractionally more efficient.

Adding more read servers though, will increase your read throughput almost linearly. So as mentioned, try to have more smaller servers, than fewer larger servers. This also gives you more fault tolerance too.

>> 2) What are some of the tuning knobs?

An un-expected tuning that may help is to put your logs onto fast write IO devices. Even with log buffering there is a small section that is blocking in log flushing, so this will help to shorten that window which could net you more write/read perf.

You probably want to look at the all of the various cn=monitor's throughout cn=config, because that could give hints to what you could tune. IE the ndn cache may be small so you could find in the monitor that you have high eviction rates so you need to increase it's size.

hope this (very long) response gives you some ideas. 

>> 
>> 
>> 
>> _______________________________________________
>> 389-users mailing list -- 
>> 389-users@xxxxxxxxxxxxxxxxxxxxxxx
>> 
>> To unsubscribe send an email to 
>> 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx
>> 
>> Fedora Code of Conduct: 
>> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>> 
>> List Guidelines: 
>> https://fedoraproject.org/wiki/Mailing_list_guidelines
>> 
>> List Archives: 
>> https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx
> -- 
> 
> 389 Directory Server Development Team
> 
> _______________________________________________
> 389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx
> To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx
> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx

—
Sincerely,

William Brown

Senior Software Engineer, 389 Directory Server
SUSE Labs
_______________________________________________
389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx