Hello Michael,
Thanks a lot for reaching out and for providing detailed feedback on the report. All of these comments will help in further augmenting/updating our report. We are definitely interested in learning more about Calc and also sharing our ideas on DB+Spreadsheet
integration. I am currently traveling and will send you a detailed response CCing the entire DataSpread team. We are also cleaning up the repo and the macros and will definitely share those ASAP. There are a number of valid concerns that have been raised and
need further examination---thanks for pointing those out.
We didn't even get a response from one of the other tools compared in the report when we reached out to them. So many thanks for taking our effort seriously and providing important details about Calc.
Regards
From: Michael Meeks <michael.meeks@xxxxxxxxxxxxx>
Sent: Saturday, December 7, 2019 12:05 PM To: Rahman, Sajjadur <srahman7@xxxxxxxxxxxx> Cc: Kohei Yoshida <kohei@xxxxxxxxxxxxxxx>; Dennis Francis <dennis.francis@xxxxxxxxxxxxx>; Lubos Lunak <l.lunak@xxxxxxxxxxxxx>; libreoffice-dev <libreoffice@xxxxxxxxxxxxxxxxxxxxx> Subject: benchmark of Excel, Calc, Google Docs Hi Sajjadur & team,
Great to connect on twitter; I noticed: https://blog.acolyer.org/2019/12/06/benchmarking-spreadsheet-systemsand your paper: https://people.eecs.berkeley.edu/~adityagp/papers/spreadsheet_bench.pdf I was interested in a number of things: particularly whether we can get your test sheets / macros so that we can run the tests under a profiler & of course see what stupid things jump out that we can fix =) I'd also like to query: "5.2 In memory Data Layout" "For random data access, we randomly select a row and then get the value of cell corresponding to column A within that row. We used three different row ranges of the Value-only dataset: 100k, 300k, and 500k. If spreadsheets use a columnar layout, the sequential access would be much faster than random access due to cache locality." We do have a columnular layout; checkout: https://gerrit.libreoffice.org/plugins/gitiles/core/+/master/sc/inc/column.hxx#111 // Cell values. sc::CellStoreType maCells; is an MDDS cell store - which is fairly well optimized for the test you're doing here jumping down rows. We're missing a final re-work there to make it log(log(N)) lookup with an interpolating search but ... assuming you have reasonably uniform, contiguous data types down a column your test shouldn't have concluded: "Therefore, none of the systems utilize any intelligent in-memory layout to speed up data access. Takeaway: Spreadsheet systems do not employ a columnar data layout to improve computational (e.g., aggregation) performance" I fact we use our columnar model to do SSE optimization of long columnar sums and optimize cache locality for other formulae etc. In contrast my expectation (from previous Excel binary data formats) is that Excel employs a row-based data storage. Of course we build on MDDS - which provides some nice building blocks for speadsheet re-use: https://gitlab.com/mdds/mdds Your general comments on missing [Global] Common Sub-_expression_ elimination in spreadsheets seem fair, and something that should be looked at as we improve our representations. For incremental updates - re-computation is frequently chosen over more smarts since correctness is a far more dominating concern than performance generally, and there are plenty of know performance optimizations that can be done before we try to complicate things further. Also the assumption that after deleting A100: SUM(A1:A100) - A100 === SUM(A1:A99) falls foul of potential precision problems. Then again so does adding the numbers in a different order even at FP64 (cf. OpenCL) - and of course there are lower hanging fruit than this right now. The idea of converting to SQL queries is an interesting one but I find it very hard to believe it would provide any performance advantage at the same memory footprint. Furthermore - I'd be interested to know how you do other spreadsheet operations: row & column insertion, addressing, and dependency work on top of a SQL database with any efficiency. It is also worth bearing in mind that dependency management and tracking of what to update when something changes consumes a far larger proportion of a spreadsheet than is commonly expected: what to re-calculate having changed A1 is far from trivial for large, twisty real-world sheets. Furthermore another big chunk of spreadsheet authors time is spent considering and handling all of the legacy mis-features such as what to do when you add an error to a bool, or a string containing a number that shows up in your sheet in an unwelcome way =) Anyhow - would be happy to have a chat with your team at some stage if you're interested in helping us to improving things here: a good start would be just getting more representative benchmarks for the workload you're interested in would be really useful. And finally of course, you used a really old version. I'd recommend using the 6.4 snapshots, or 6.3 if you must. https://gerrit.libreoffice.org/plugins/gitiles/core/+/46d0afba738d8ee7c9b63384fef513f42ee587f3 https://gerrit.libreoffice.org/plugins/gitiles/core/+/845e1cdca3349c72e3083186502285d5b776abbe ... And lots of others. Indeed, we have a customer who has large ~100k+ row spreadsheets with complex sorts across many rows who claims that Calc sorts the data in second to minutes vs. Excels' hours - so I was surprised to see your sort results; would be good to inspect what you do. I hope that helps, would love to get in touch with you & your team, there is plenty of fun stuff to do here to make things quicker & better =) All the best, Michael. -- michael.meeks@xxxxxxxxxxxxx <><, GM Collabora Productivity Hangout: mejmeeks@xxxxxxxxx, Skype: mmeeks (M) +44 7795 666 147 - timezone usually UK / Europe |
_______________________________________________ LibreOffice mailing list LibreOffice@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/libreoffice