Hi again, On 11/20/18 2:34 PM, Stephen Frost wrote: >> I agree with you the compression is playing a role in the comparison. >> Probably there is a toll to pay when the load is high and the CPU >> stressed from de/compressing data. If we will be able to bring our >> studies that further, this is definitely something we would like to measure. > > I was actually thinking of the compression as having more of an impact > with regard to the 'cold' cases because you're pulling fewer blocks when > it's compressed. The decompression cost on CPU is typically much, much > less than the cost to pull the data off of the storage medium. When > things are 'hot' and in cache then it might be interesting to question > if the compression/decompression is worth the cost. > true. When data is present in RAM, Postgres then is faster, because as you say the compression will not actually give a benefit on retrieving data from disk. In my statement here above about the CPU I was speculating if the speed Mongo gains thanks to the blocks compression, would act as a double edged sword under warm cache scenarios and heavy load. >> I also agree with you that at the moment Postgres really shines on >> relational data. To be honest, after seeing the outcome of our research, >> we are actually considering to decouple some (or all) fields from their >> JSON structure. There will be a toll to be payed there too, since we are >> receiving data in JSON format. > > PostgreSQL has tools to help with this, you might look into > 'json_to_record' and friends. > it might turn out useful to us if we normalize our data, thanks. >> Anyway, to bring data from JSON to a relational model is out of topic >> for the current discussion, since we are actually questioning if >> Postgres is a good replacement for Mongo when handling JSON data. > > This narrow viewpoint isn't really sensible though- what you should be > thinking about is what's appropriate for your *data*. JSON is just a > data format, and while it's alright as a system inter-exchange format, > it's rather terrible as a storage format. > I did not want to narrow the viewpoint. I'm exploring possibilities. Since Postgres supports JSON, it would have been nice to know how far one can go in storing data without transforming it. When we started our research the only question was: Is it possible to replace Postgres with Mongo 1 to 1? All other considerations came after, and as matter of fact, as told already, we are actually considering to (maybe partially) transform data to a relational model. Maybe we did not look around enough but we did not find on internet all the answers to our questions, therefore we initiated something ourselves. >> As per sharing the dataset, as mentioned in the post we are handling >> medical data. Even if the content is anonymized, we are not keen to >> share the data structure too for security reasons. > > If you really want people to take your analysis seriously, others must > be able to reproduce your results. I certainly appreciate that there > are very good reasons that you can't share this actual data, but your > testing could be done with completely generated data which happens to be > similar in structure to your data and have similar frequency of values. > > The way to approach generating such a data set would be to aggregate up > the actual data to a point where the appropriate committee/board agree > that it can be shared publicly, and then you build a randomly generated > set of data which aggregates to the same result and then use that for > testing. > Probably looking backward, I would generate data that is sharable with everybody to give the opportunity to play with it and involve people more. The fact is that we started very small and we ended up with quite a bunch of information we felt like sharing. Time is tyrant and at the moment we cannot re-run everything with sharable data so we all have to live with it. It is not optimal and is not perfectly academic but is still better than not sharing at all in my opinion. One good thing is that while testing and learning I found a similar investigation which led to similar results (unfortunately also there you can argue that is not sharing dataset and scripts and all the rest). In the jsquery section of the blog post there is a link pointing to: https://github.com/postgrespro/jsquery/blob/master/README.md which in turn points to http://www.sai.msu.su/~megera/postgres/talks/pgconfeu-2014-jsquery.pdf At page 18 there are some results which are close to what we obtained. I think those results are close to what we found even if the paper is from 2014 and a lot changed in the landscape. This to say that i suspect that if we generate random JSON data, we will probably draw the same conclusions. >> That's a pity I know but i cannot do anything about it. >> The queries we ran and the commands we used are mentioned in the blog >> post but if you see gaps, feel free to ask. > > There were a lot of gaps that I saw when I looked through the article- > starting with things like the actual CREATE TABLE command you used, you are right, there is only the command i used to transform the table to jsonb. Small detail, but I updated the post for clarity and > the complete size/structure of the JSON object, but really what a paper > like this should include is a full script which creates all the tables, > loads all the data, runs the analysis, calculates the results, etc. > Queries are shared, but without data, to share the rest is quite useless in my opinion. regards, fabio pardi
Attachment:
signature.asc
Description: OpenPGP digital signature