Re: [389-devel] 389-devel Digest, Vol 99, Issue 6

Ludwig Krispenz <lkrispen@xxxxxxxxxx> · Thu, 12 Sep 2013 14:57:46 +0200

On 09/11/2013 07:41 PM, Howard Chu wrote:
389-devel-request@xxxxxxxxxxxxxxxxxxxxxxx wrote:
Send 389-devel mailing list submissions to
    389-devel@xxxxxxxxxxxxxxxxxxxxxxx

To subscribe or unsubscribe via the World Wide Web, visit
    https://admin.fedoraproject.org/mailman/listinfo/389-devel
or, via email, send a message with subject or body 'help' to
    389-devel-request@xxxxxxxxxxxxxxxxxxxxxxx

You can reach the person managing the list at
    389-devel-owner@xxxxxxxxxxxxxxxxxxxxxxx

When replying, please edit your Subject line so it is more specific
than "Re: Contents of 389-devel digest..."

Today's Topics:

    1. Re: RFC: New Design: Fine Grained ID List Size (Rich Megginson)
    2. Re: RFC: New Design: Fine Grained ID List Size (Ludwig Krispenz)

----------------------------------------------------------------------

Message: 1
Date: Tue, 10 Sep 2013 08:29:14 -0600
From: Rich Megginson <rmeggins@xxxxxxxxxx>
To: Ludwig Krispenz <lkrispen@xxxxxxxxxx>
Cc: "389 Directory server developer discussion."
    <389-devel@xxxxxxxxxxxxxxxxxxxxxxx>
Subject: Re: [389-devel] RFC: New Design: Fine Grained ID List Size
Message-ID: <522F2CBA.1040205@xxxxxxxxxx>
Content-Type: text/plain; charset=UTF-8; format=flowed

On 09/10/2013 01:47 AM, Ludwig Krispenz wrote:
On 09/09/2013 07:19 PM, Rich Megginson wrote:
On 09/09/2013 02:27 AM, Ludwig Krispenz wrote:
On 09/07/2013 05:02 AM, David Boreham wrote:
On 9/6/2013 8:49 PM, Nathan Kinder wrote:
This is a good idea, and it is something that we discussed briefly
off-list.  The only downside is that we need to change the index
format to keep a count of ids for each key. Implementing this
isn't a big problem, but it does mean that the existing indexes
need to be updated to populate the count based off of the contents
(as you mention above).
I don't think you need to do this (I certainly wasn't advocating
doing so). The "statistics" state is much the same as that proposed
in Rich's design. In fact you could probably just use that same
information. My idea is more about where and how you use the
information. All you need is something associated with each index
that says "not much point looking here if you're after something
specific, move along, look somewhere else instead". This is much
the same information as "don't use a high scan limit here".

In the short term, we are looking for a way to be able to improve
performance for specific search filters that are not possible to
modify on the client side (for whatever reason) while leaving the
index file format exactly as it is.  I still feel that there is
potentially great value in keeping a count of ids per key so we
can optimize things on the server side automatically without the
need for complex index configuration on the administrator's part.
I think we should consider this for an additional future 
enhancement.
I'm saying the same thing. Keeping a cardinality count per key is
way more than I'm proposing, and I'm not sure how useful that would
be anyway, unless you want to do OLAP in the DS ;)
we have the cardinality of the key in old-idl and this makes some
searches where parts of the filter are allids fast.

I'm late in the discussion, but I think Rich's proposal is very
promising to address all the problems related to allids in new-idl.

We could then eventually rework filter ordering based on these
configurations. Right now we only have a filter ordering based on
index type and try to postpone "<=" or similar filter as they are
known to be costly, but this could be more elaborate.

An alternative would be to have some kind of index lookup caching.
In the example in ticket 47474 the filter is
(&(|(objectClass=organizationalPerson)(objectClass=inetOrgPerson)(objectClass=organization)(objectClass=organizationalUnit)(objectClass=groupOf 

Names)(objectClass=groupOfUniqueNames)(objectClass=group))(c3sUserID=EndUser0000078458))" 

and probably only the "c3sUserID=xxxxx" part will change, if we
cache the result for the (&(|(objectClass=... part, even if it is
expensive, it would be done only once.
Thanks everyone for the comments.  I have added Noriko's suggestion:
http://port389.org/wiki/Design/Fine_Grained_ID_List_Size

David, Ludwig: Does the current design address your concerns, and/or
provide the necessary first step for further refinements?
yes, the topic of filter reordering or caching could be looked at
independently.

Just one concern abou the syntax:

nsIndexIDListScanLimit:
maxsize[:indextype][:flag[,flag...]][:value[,value...]]

since everything is optional, how do you decide if in
nsIndexIDListScanLimit: 6:eq:AND "AND" is a value or a flag ?
and as it defines limits for specific keys, could the attributname
reflect this, eg nsIndexKeyIDListScanLimit or nsIndexKeyScanLimit or
... ?
Thanks, yes, it is ambiguous.
I think it may have to use keyword=value, so something like this:

nsIndexIDListScanLimit: limit=NNN [type=eq[,sub]] [flags=ADD[,OR]]
[values=val[,val...]]

That should be easy to parse for both humans and machines.
For values, will have to figure out a way to have escapes (e.g. if a
value contains a comma or an escape character).   Was thinking of using
LDAP escapes (e.g. \, or \032)

--
389-devel mailing list
389-devel@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/389-devel
--
389-devel mailing list
389-devel@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/389-devel

------------------------------

Message: 2
Date: Tue, 10 Sep 2013 16:35:07 +0200
From: Ludwig Krispenz <lkrispen@xxxxxxxxxx>
To: Rich Megginson <rmeggins@xxxxxxxxxx>
Cc: "389 Directory server developer discussion."
    <389-devel@xxxxxxxxxxxxxxxxxxxxxxx>
Subject: Re: [389-devel] RFC: New Design: Fine Grained ID List Size
Message-ID: <522F2E1B.9010600@xxxxxxxxxx>
Content-Type: text/plain; charset=UTF-8; format=flowed

On 09/10/2013 04:29 PM, Rich Megginson wrote:
On 09/10/2013 01:47 AM, Ludwig Krispenz wrote:
On 09/09/2013 07:19 PM, Rich Megginson wrote:
On 09/09/2013 02:27 AM, Ludwig Krispenz wrote:
On 09/07/2013 05:02 AM, David Boreham wrote:
On 9/6/2013 8:49 PM, Nathan Kinder wrote:
This is a good idea, and it is something that we discussed
briefly off-list.  The only downside is that we need to change
the index format to keep a count of ids for each key.
Implementing this isn't a big problem, but it does mean that the
existing indexes need to be updated to populate the count based
off of the contents (as you mention above).
I don't think you need to do this (I certainly wasn't advocating
doing so). The "statistics" state is much the same as that
proposed in Rich's design. In fact you could probably just use
that same information. My idea is more about where and how you use
the information. All you need is something associated with each
index that says "not much point looking here if you're after
something specific, move along, look somewhere else instead". This
is much the same information as "don't use a high scan limit here".

In the short term, we are looking for a way to be able to improve
performance for specific search filters that are not possible to
modify on the client side (for whatever reason) while leaving the
index file format exactly as it is.  I still feel that there is
potentially great value in keeping a count of ids per key so we
can optimize things on the server side automatically without the
need for complex index configuration on the administrator's part.
I think we should consider this for an additional future
enhancement.
I'm saying the same thing. Keeping a cardinality count per key is
way more than I'm proposing, and I'm not sure how useful that
would be anyway, unless you want to do OLAP in the DS ;)
we have the cardinality of the key in old-idl and this makes some
searches where parts of the filter are allids fast.
Just out of curiosity, why is keeping a count per key a problem? If 
you're using BDB duplicate key support, can't you just use  to get 
this? I.e., BDB already maintains key counts internally, why not 
leverage that?
well, cursor->c_count() is a function which counts the duplicates and to 
do this will have to access all the pages containing duplicates, so I am 
not sure we will gain much
--
389-devel mailing list
389-devel@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/389-devel