Re: MongoDB 3.2 beating Postgres 9.5.1?

<pbj@xxxxxxxxxx> · Tue, 15 Mar 2016 23:39:48 +0000 (UTC)

Your results are close enough to mine, I think, to prove the point.   And, I agree that the EDB benchmark is not necessary reflective of a real-world scenario.

However, the cache I'm referring to is PG's shared_buffer cache.   You can see the first run of the select causing a lot of disk reads.  The second identical run, reads purely from shared_buffers.

What I don't understand is, why does a slightly different select from the *same* table during the same session cause shared_buffers to be blown out and re-read??

I will see if I can try YCSB next week (I'm in workshops all week...)

Thanks!

    On Monday, March 14, 2016 3:34 AM, Dmitry Dolgov <9erthalion6@xxxxxxxxx> wrote:

 Hi, Paul
I agree with Oleg, EDB benchmarks are strange sometimes. I did the same benchmarks several months ago. I never noticed the cache influence back then, so I tried to reproduce your situation now (on a 5*10^6 records although). I started to play with db cache (using `echo 3 > /proc/sys/vm/drop_cache`), and I see difference in time execution for two subsequent queries, but `explain` info are almost identical, e.g. `shared hit & read`:

```
benchmark=# explain (buffers, analyze, verbose) select data from json_tables where data @> '{"name": "AC3 Case Red"}';
                                                                 QUERY PLAN                                                                 
--------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.json_tables  (cost=102.74..19001.47 rows=4999 width=1257) (actual time=740.556..215956.655 rows=454546 loops=1)
   Output: data
   Recheck Cond: (json_tables.data @> '{"name": "AC3 Case Red"}'::jsonb)
   Rows Removed by Index Recheck: 2114606
   Heap Blocks: exact=31624 lossy=422922
   Buffers: shared hit=1371 read=455551
   ->  Bitmap Index Scan on json_tables_idx  (cost=0.00..101.49 rows=4999 width=0) (actual time=731.010..731.010 rows=454547 loops=1)
         Index Cond: (json_tables.data @> '{"name": "AC3 Case Red"}'::jsonb)
         Buffers: shared hit=1371 read=1005
 Planning time: 6.352 ms
 Execution time: 216075.830 ms
(11 rows)

benchmark=# explain (buffers, analyze, verbose) select data from json_tables where data @> '{"name": "AC3 Case Red"}';
                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.json_tables  (cost=102.74..19001.47 rows=4999 width=1257) (actual time=222.476..10692.703 rows=454546 loops=1)
   Output: data
   Recheck Cond: (json_tables.data @> '{"name": "AC3 Case Red"}'::jsonb)
   Rows Removed by Index Recheck: 2114606
   Heap Blocks: exact=31624 lossy=422922
   Buffers: shared hit=1371 read=455551
   ->  Bitmap Index Scan on json_tables_idx  (cost=0.00..101.49 rows=4999 width=0) (actual time=214.736..214.736 rows=454547 loops=1)
         Index Cond: (json_tables.data @> '{"name": "AC3 Case Red"}'::jsonb)
         Buffers: shared hit=1371 read=1005
 Planning time: 0.089 ms
 Execution time: 10767.739 ms
(11 rows)
```

But I see almost the same execution time from mongodb `explain` (216075ms for pg and 177784ms for mongo, which isn't so much I think):

```
DBQuery.shellBatchSize = 10000000000; db.json_tables.find({"name": "AC3 Case Red"}).explain(true)
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "benchmark.json_tables",
                "indexFilterSet" : false,
                "parsedQuery" : {
                        "name" : {
                                "$eq" : "AC3 Case Red"
                        }
                },
                "winningPlan" : {
                        "stage" : "FETCH",
                        "inputStage" : {
                                "stage" : "IXSCAN",
                                "keyPattern" : {
                                        "name" : 1
                                },
                                "indexName" : "name_1",
                                "isMultiKey" : false,
                                "isUnique" : false,
                                "isSparse" : false,
                                "isPartial" : false,
                                "indexVersion" : 1,
                                "direction" : "forward",
                                "indexBounds" : {
                                        "name" : [
                                                "[\"AC3 Case Red\", \"AC3 Case Red\"]"
                                        ]
                                }
                        }
                },
                "rejectedPlans" : [ ]
        },
       "executionStats" : {
                "executionSuccess" : true,
                "nReturned" : 454546,
                "executionTimeMillis" : 177784,
                "totalKeysExamined" : 454546,
                "totalDocsExamined" : 454546,
                "executionStages" : {
                        "stage" : "FETCH",
                        "nReturned" : 454546,
                        "executionTimeMillisEstimate" : 175590,
                        "works" : 454547,
                        "advanced" : 454546,
                        "needTime" : 0,
                        "needYield" : 0,
                        "saveState" : 8638,
                        "restoreState" : 8638,
                        "isEOF" : 1,
                        "invalidates" : 0,
                        "docsExamined" : 454546,
                        "alreadyHasObj" : 0,
                        "inputStage" : {
                                "stage" : "IXSCAN",
                                "nReturned" : 454546,
                                "executionTimeMillisEstimate" : 700,
                                "works" : 454547,
                                "advanced" : 454546,
                                "needTime" : 0,
                                "needYield" : 0,
                                "saveState" : 8638,
                                "restoreState" : 8638,
                                "isEOF" : 1,
                                "invalidates" : 0,
                                "keyPattern" : {
                                        "name" : 1
                                },
                                "indexName" : "name_1",
                                "isMultiKey" : false,
                                "isUnique" : false,
                                "isSparse" : false,
                                "isPartial" : false,
                                "indexVersion" : 1,
                                "direction" : "forward",
                                "indexBounds" : {
                                        "name" : [
                                                "[\"AC3 Case Red\", \"AC3 Case Red\"]"
                                        ]
                                },
                                "keysExamined" : 454546,
                                "dupsTested" : 0,
                                "dupsDropped" : 0,
                                "seenInvalidated" : 0
                        }
                },
                "allPlansExecution" : [ ]
        },
        "serverInfo" : {
                "host" : "ip-172-30-0-236",
                "port" : 27017,
                "version" : "3.2.4",
                "gitVersion" : "e2ee9ffcf9f5a94fad76802e28cc978718bb7a30"
        },
        "ok" : 1
}
```

I not missed anything, am I right? Are you sure that it took much more time for PostgreSQL?
Besides, everything is fine for queries with more small results (while the query {"name": "AC3 Case Red"} is almost 1/10 of entire dataset):

```
=# insert into json_tables values('{"name": "test name"}'::jsonb);

=# explain (buffers, analyze, verbose) select data from json_tables where data @> '{"name": "test name"}';                                                   
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.json_tables  (cost=62.75..18965.16 rows=5000 width=1257) (actual time=0.020..0.021 rows=1 loops=1)
   Output: data
   Recheck Cond: (json_tables.data @> '{"name": "test name"}'::jsonb)
   Heap Blocks: exact=1
   Buffers: shared hit=5
   ->  Bitmap Index Scan on json_tables_idx  (cost=0.00..61.50 rows=5000 width=0) (actual time=0.011..0.011 rows=1 loops=1)
         Index Cond: (json_tables.data @> '{"name": "test name"}'::jsonb)
         Buffers: shared hit=4
 Planning time: 1.164 ms
 Execution time: 0.045 ms
(10 rows)
```

As far as I know there isn't much to do about caching. I don't know if it's appropriate, but you can manually warm-up the cache (something like `cat /var/lib/postgresql/9.5/main/base/*/* > /dev/null`).

On 14 March 2016 at 00:30, Oleg Bartunov <obartunov@xxxxxxxxx> wrote:

On Mar 11, 2016 4:40 PM, "Paul Jones" <pbj@xxxxxxxxxx> wrote:

>

> I have been running the EDB benchmark that compares Postgres and MongoDB.

> I believe EDB ran it against PG 9.4 and Mongo 2.6.  I am running it

> against PG 9.5.1 and Mongo 3.2 with WiredTiger storage using 10,000,000

> JSON records generated by the benchmark.  It looks like Mongo is winning,

> and apparently because of its cache management.
Dmitry was working on the same benchmarks. I think edb benchmark is broken by design. Better,  use ycsb benchmarks. I hope, Dmitry will share his 

results.

>

> The first queries on both run in ~30 min.  And, once PG fills its cache,

> it whips Mongo on repeats of the *same* query (vmstat shows no disk

> reads for PG).

>

> However, when different query on the same table is issued to both,

> vmstat shows that PG has to read the *entire* table again, and it takes

> ~30 min.  Mongo does a lot of reads initially but after about 5 minutes,

> it stops reading and completes the query, most likely because it is

> using its cache very effectively.

>

> Host:   Virtual Machine

>         4 CPUs

>         16 Gb RAM

>         200 Gb Disk

>         RHEL 6.6

>

> PG:     9.5.1 compiled from source

>         shared_buffers = 7GB

>         effectve_cache_size = 12GB

>

> Mongo:  3.2 installed with RPM from Mongo

>

> In PG, I created the table by:

>

> CREATE TABLE json_tables

> (

>         data    JSONB

> );

>

> After loading, it creates the index:

>

> CREATE INDEX json_tables_idx ON json_tables USING GIN (data jsonb_path_ops);

>

> After a lot of experimentation, I discovered that the benchmark was not

> using PG's index, so I modified the four queries to be:

>

> SELECT data FROM json_tables WHERE data @> '{"brand": "ACME"}';

> SELECT data FROM json_tables WHERE data @> '{"name": "Phone Service Basic Plan"}';

> SELECT data FROM json_tables WHERE data @> '{"name": "AC3 Case Red"}';

> SELECT data FROM json_tables WHERE data @> '{"type": "service"}';

>

> Here are two consecutive explain analyze for PG, for the same query.

> No functional difference in the plans that I can tell, but the effect

> of PG's cache on the second is dramatic.

>

> If anyone has ideas on how I can get PG to more effectively use the cache

> for subsequent queries, I would love to hear them.

>

> -------

>

> benchmark=# explain analyze select data from json_tables where data @> '{"name": "AC3 Case Red"}';

>

>                                                                QUERY PLAN

>

>

> -----------------------------------------------------------------------------------------------------------------------------------------

>

> Bitmap Heap Scan on json_tables  (cost=113.50..37914.64 rows=10000 width=1261)

> (actual time=2157.118..1259550.327 rows=909091 loops=1)

>    Recheck Cond: (data @> '{"name": "AC3 Case Red"}'::jsonb)

>    Rows Removed by Index Recheck: 4360296

>    Heap Blocks: exact=37031 lossy=872059

>    ->  Bitmap Index Scan on json_tables_idx  (cost=0.00..111.00 rows=10000 width =0) (actual time=2141.250..2141.250 rows=909091 loops=1)

>          Index Cond: (data @> '{"name": "AC3 Case Red"}'::jsonb)

> Planning time: 291.932 ms

> Execution time: 1259886.920 ms

> (8 rows)

>

> Time: 1261191.844 ms

>

> benchmark=# explain analyze select data from json_tables where data @> '{"name": "AC3 Case Red"}';

>                                                               QUERY PLAN

>

> ---------------------------------------------------------------------------------------------------------------------------------------

>

> Bitmap Heap Scan on json_tables  (cost=113.50..37914.64 rows=10000 width=1261) (actual time=779.261..29815.262 rows=909091 loops=1)

>    Recheck Cond: (data @> '{"name": "AC3 Case Red"}'::jsonb)

>    Rows Removed by Index Recheck: 4360296

>    Heap Blocks: exact=37031 lossy=872059

>    ->  Bitmap Index Scan on json_tables_idx  (cost=0.00..111.00 rows=10000 width =0) (actual time=769.081..769.081 rows=909091 loops=1)

>          Index Cond: (data @> '{"name": "AC3 Case Red"}'::jsonb)

> Planning time: 33.967 ms

> Execution time: 29869.381 ms

>

> (8 rows)

>

> Time: 29987.122 ms

>

>

> --

> Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)

> To make changes to your subscription:

> http://www.postgresql.org/mailpref/pgsql-general