Re: Should I partition this table?

Bill Moran <wmoran@xxxxxxxxxxxxxxxxx> · Thu, 10 Jul 2014 12:39:11 -0400

On Thu, 10 Jul 2014 08:40:59 -0700 (PDT) AlexK <alkuzo@xxxxxxxxx> wrote:

> Bill,
> 
> Regarding "SELECT performance improve nearly linerally to the number of
> partitions," - can you elaborate why? If I split my table into several
> partitions, even the index depth may stay the same, because the PK is
> narrow, it only consists of 2 4-byte integers.

That statement is based on experimentation.  About a year ago, I did a
research project for a former employer to determine the best way to
store a large amount of data.  Like you, we were talking about narrow
rows, but we were looking at the total # of rows exceeding 10 billion.
Also different, we didn't expect the number of unique "ParentID"s to
ever exceed 100,000.

I managed to borrow some beefy hardware from another project that
wouldn't need it for a few weeks and do some experimenting with
different partition configurations, all compared to a baseline of
an unpartitioned table.  I don't remember exactly, but I believe I
was populating the test databases with 1 billion rows.

That's where I came up with the "linear" determination.  A query that
averaged 100ms on a single table averaged 10ms on 10 partitions and
1ms on 100 partitions (etc).  Every test I concocted seemed to support
that the improvement was linear.

Another point that I expermimented with, and is worth noting: not all
queries are able to benefit from the partitioning, but the test queries
that I ran that could not, saw only a few percent of performance hit.
Since our app design would take advantage of the partitioning for 99%
of its queries, it was a no-brainer.

As a result, I can only theorize, but my theory is that since each
partition is actually a table, and has indexes independently of the
other table/partitions, that both indexes searches and table scans
required less work with smaller tables.  That theory doesn't 100%
make sense, since index performance should not degrade linerally with
index size, but that was the behavior I observed.  It's likely that
because of the significant size of the hardware, that the performance
curve was simply so close to flat that it looked linear at the sizes
I was working with.

> At this time we would rather not introduce bugs related to cache
> invalidation. Although we do need to read data fast, we do not want stale
> data, and we cannot afford any bugs in this application.

I can't make your decisions for you.  But I'll make an attempt here
to try to help you avoid the same mistake I made.

On that same project, we made a similar decision: caches are sources
of bugs so we'll just make sure the DB is tuned well enough that we
don't need a cache.

That decision led to a lot of late night work under duress to get
caching reliably implemented at the last minute.  There are a lot of
things that partitioning doesn't speed up: parsing and planning the
queries, the time it takes the network to move data back and forth,
and the time it takes the application to demarshall the results
provided by the SQL server.  Caching data in RAM completely removes
all of that overhead.

Of course, only you and your developers can make the determination as
to whether you need it.  I'm just recommending that you don't fail
to consider it simply because it's difficult to do reliably.  If
you're use the "can't risk bugs" argument, you might benefit more
by improving your QA process than anything else.

-- 
Bill Moran <wmoran@xxxxxxxxxxxxxxxxx>