Hi Stephen, thanks for your feedback. I agree with you the compression is playing a role in the comparison. Probably there is a toll to pay when the load is high and the CPU stressed from de/compressing data. If we will be able to bring our studies that further, this is definitely something we would like to measure. I also agree with you that at the moment Postgres really shines on relational data. To be honest, after seeing the outcome of our research, we are actually considering to decouple some (or all) fields from their JSON structure. There will be a toll to be payed there too, since we are receiving data in JSON format. And the toll will be in time spent to deliver such a solution, and indeed time spent by the engine in doing the conversion. It might not be that convenient after all. Anyway, to bring data from JSON to a relational model is out of topic for the current discussion, since we are actually questioning if Postgres is a good replacement for Mongo when handling JSON data. As per sharing the dataset, as mentioned in the post we are handling medical data. Even if the content is anonymized, we are not keen to share the data structure too for security reasons. That's a pity I know but i cannot do anything about it. The queries we ran and the commands we used are mentioned in the blog post but if you see gaps, feel free to ask. regards, fabio pardi On 11/19/18 6:26 PM, Stephen Frost wrote: > Greetings, > > * Fabio Pardi (f.pardi@xxxxxxxxxxxx) wrote: >> We are open to any kind of feedback and we hope you enjoy the reading. > > Looks like a lot of the difference being seen and the comments made > about one being faster than the other are because one system is > compressing *everything*, while PG (quite intentionally...) only > compresses the data sometimes- once it hits the TOAST limit. That > likely also contributes to why you're seeing the on-disk size > differences that you are. > > Of course, if you want to see where PG will really shine, you'd stop > thinking of data as just blobs of JSON and actually define individual > fields in PG instead of just one 'jsonb' column, especially when you > know that field will always exist (which is obviously the case if you're > building an index on it, such as your MarriageDate) and then remove > those fields from the jsonb and just reconstruct the JSON when you > query. Doing that you'll get the size down dramatically. > > And that's without even going to that next-level stuff of actual > normalization where you pull out duplicate data from across the JSON > and have just one instance of that data in another, smaller, table and > use a JOIN to bring it all back together. Even better is when you > realize that then you only have to update one row in this other table > when something changes in that subset of data, unlike when you > repeatedly store that data in individual JSON entries all across the > system and such a change requires rewriting every single JSON object in > the entire system... > > Lastly, as with any performance benchmark, please include full details- > all scripts used, all commands run, all data used, so that others can > reproduce your results. I'm sure it'd be fun to take your json data and > create actual tables out of it and see what it'd be like then. > > Thanks! > > Stephen >
Attachment:
signature.asc
Description: OpenPGP digital signature