Re: Fw: benchmark of Excel, Calc, Google Docs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Michael,

Great to connect with you.  Sajjadur forwarded your email to me.  

Just a bit of context about the work that you're referring to -- 

I'm Aditya Parameswaran, an assistant professor at UC Berkeley.  Along with Prof. Karrie Karahalios at the University of Illinois and many Ph.D. student researchers, we've been working on developing a scalable spreadsheet system, DataSpread (http://dataspread.github.io), for about half a decade now. We've published a number of papers on this topic as well as developed a prototype system.  Sajjadur led the effort on benchmarking that you referenced in your email and a navigation plugin (more on that in a bit), while Mangesh (also cc-ed, when he was at the University of Illinois) led the efforts to integrate spreadsheets and databases and to build a more interactive (asynchronous) formula execution engine when he was at Illinois.  

We'd be very keen to collaborate to see if some of the ideas that we've developed and opportunities we've identified would make sense in Calc.  Our ultimate aim is to percolate some of these ideas back into popular spreadsheet systems like Calc, so I'm excited to have this opportunity.

Answers to some of your questions in-line below:

        I was interested in a number of things: particularly whether we can get
your test sheets / macros so that we can run the tests under a profiler
& of course see what stupid things jump out that we can fix =)

Yes, of course. Sajjadur, with Kelly's help, is looking into packaging this and sending it your way.

        We do have a columnular layout; checkout:

https://gerrit.libreoffice.org/plugins/gitiles/core/+/master/sc/inc/column.hxx#111

... assuming you have reasonably uniform, contiguous data types down a
column your test shouldn't have concluded:

        "Takeaway: Spreadsheet systems do not employ a columnar
data layout to improve computational (e.g., aggregation) performance"

Looking at Figure 10 in our preprint,  Calc does seem to be somewhat faster for sequential read than random read at least on larger sizes. 

So I am not sure why we concluded outright that none of the spreadsheet systems employ a columnar layout -- this is a good catch; we will fix.

That said, looking at Figure 10, it is surprising that the gains for the sequential read are not a lot more;  and the gains should increase proportionally.  So something funky is going on. Worth investigating. 

 
        For incremental updates - re-computation is frequently chosen over more
smarts  since correctness is a far more dominating concern than
performance generally, and there are plenty of know performance
optimizations that can be done before we try to complicate things
further. Also the assumption that after deleting A100:

        SUM(A1:A100) - A100 === SUM(A1:A99)

        falls foul of potential precision problems. 

Makes sense. 
 

        The idea of converting to SQL queries is an interesting one but I find
it very hard to believe it would provide any performance advantage at
the same memory footprint. Furthermore - I'd be interested to know how
you do other spreadsheet operations: row & column insertion, addressing,
and dependency work on top of a SQL database with any efficiency.

We started by having the relational database be a simple persistent storage layer, when coupled with an index to retrieve data by position, can allow us to scroll through large datasets of billions of rows at ease. We developed a new positional index to handle insertions and deletions in O(log(n)) -- https://arxiv.org/pdf/1708.06712.pdf. I agree that pushing the computation to the relational database does have overheads; but at the same time, it allows for scaling to arbitrarily large datasets. 

Happy to explain this on the phone. 
 
        It is also worth bearing in mind that dependency management and
tracking of what to update when something changes consumes a far larger
proportion of a spreadsheet than is commonly expected: what to
re-calculate having changed A1 is far from trivial for large, twisty
real-world sheets.

Ah, for this, we have a new asynchronous formula computation capability (led by Mangesh) that looks something like this: https://twitter.com/adityagp/status/1146105708024229888?s=20 (gif in tweet).  When a change is made, we use a compact dependency maintenance structure to identify dependent cells that are hidden away with a progress bar, and then compute them in the background. 

        Anyhow - would be happy to have a chat with your team at some stage if
you're interested in helping us to improving things here: a good start
would be just getting more representative benchmarks for the workload
you're interested in would be really useful.

Would love to chat and see if any of the work that we're doing can translate into Calc, and how we can contribute. 

One other project that may be of interest is one where we're trying to build a spreadsheet summarization and navigation tool, which can be especially helpful on very large spreadsheets.  http://srahman7.web.engr.illinois.edu/papers/NOAH.pdf
 

        And finally of course, you used a really old version. I'd recommend
using the 6.4 snapshots, or 6.3 if you must.

https://gerrit.libreoffice.org/plugins/gitiles/core/+/46d0afba738d8ee7c9b63384fef513f42ee587f3
https://gerrit.libreoffice.org/plugins/gitiles/core/+/845e1cdca3349c72e3083186502285d5b776abbe
...
 
Agreed. We started the benchmarking effort a couple years ago, and the old version was the new version back then :-) 

 
        And lots of others. Indeed, we have a customer who has large ~100k+ row
spreadsheets with complex sorts across many rows who claims that Calc
sorts the data in second to minutes vs. Excels' hours - so I was
surprised to see your sort results; would be good to inspect what you do.
        I hope that helps, would love to get in touch with you & your team,
there is plenty of fun stuff to do here to make things quicker & better =)

Again, happy to share what we know!  Let's find a time to chat.  I see that you're in Europe, so mornings for us (PT/CT) may work better?  Sajjadur is traveling, so I'm not entirely sure if he's around, but I should be able to find time to chat early in the morning any day next week. 

Aditya
 
_______________________________________________
LibreOffice mailing list
LibreOffice@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/libreoffice

[Index of Archives]     [LARTC]     [Bugtraq]     [Yosemite Forum]     [Photo]

  Powered by Linux