Hi Aditya, On 08/12/2019 23:14, Aditya Parameswaran wrote: > I'm Aditya Parameswaran, an assistant professor at UC Berkeley. Along > with Prof. Karrie Karahalios at the University of Illinois and many > Ph.D. student researchers, we've been working on developing a scalable > spreadsheet system, DataSpread (http://dataspread.github.io), for about > half a decade now. Interesting stuff. > We'd be very keen to collaborate to see if some of the ideas that we've > developed and opportunities we've identified would make sense in Calc. Sounds good. Very busy this week, but would yo be up for a conference (with whomever is interested) sometime in the evening (UK time) of the 17th or 19th ? We could use https://meet.jit.si/CalcChat eg. > Our ultimate aim is to percolate some of these ideas back into popular > spreadsheet systems like Calc, so I'm excited to have this opportunity. Great. Some good ideas to include there, only a chunk of typing is required =) > Yes, of course. Sajjadur, with Kelly's help, is looking into packaging > this and sending it your way. Excellent; thanks. > So I am not sure why we concluded outright that none of the spreadsheet > systems employ a columnar layout -- this is a good catch; we will fix. =) > That said, looking at Figure 10, it is surprising that the gains for the > sequential read are not a lot more; and the gains should increase > proportionally. So something funky is going on. Worth investigating. Ah - well ... so ;-) as I said it depends on your data-set, and its type homogeneity down the column to a degree, and also we can improve our lookup algorithm there. > We started by having the relational database be a simple persistent > storage layer, when coupled with an index to retrieve data by position, > can allow us to scroll through large datasets of billions of rows at > ease. We developed a new positional index to handle insertions and > deletions in O(log(n)) -- https://arxiv.org/pdf/1708.06712.pdf. I agree > that pushing the computation to the relational database does have > overheads; but at the same time, it allows for scaling to arbitrarily > large datasets. Ooh - nice paper. Your crawled data-set looks quite interesting too, we run wide-scale crash-testing on the LibreOffice code-base across ~100k files and enlarging our corpus there: or better, getting some statistical view of which OOXML attributes (and thus features) are most used out there would be extremely useful to us as we develop the core. I like the data on spreadsheet and formula shape - that is very useful. Do you have data on the geometry of formulae - as in rows vs. columns ? [ we switched to columnar storage based mostly on experience rather than hard data ;-]. It is also interesting to have access to very large (1.3m row) data-sets that can have useful analysis done on them - would love to see the source data there. > Would love to chat and see if any of the work that we're doing can > translate into Calc, and how we can contribute. Great. > One other project that may be of interest is one where we're trying to > build a spreadsheet summarization and navigation tool, which can be > especially helpful on very large > spreadsheets. http://srahman7.web.engr.illinois.edu/papers/NOAH.pdf Sounds good too. Of course, most useful on thee huge corpus of existing sheets out there in XLS[X] / ODS format. > Agreed. We started the benchmarking effort a couple years ago, and the > old version was the new version back then :-) Heh ;-) > Again, happy to share what we know! Let's find a time to chat. I see > that you're in Europe, so mornings for us (PT/CT) may work better? > Sajjadur is traveling, so I'm not entirely sure if he's around, but I > should be able to find time to chat early in the morning any day next week. Sounds good, cf. above - if we can't make that - early in the new year would be great. I look forward to talking, Michael. -- michael.meeks@xxxxxxxxxxxxx <><, GM Collabora Productivity Hangout: mejmeeks@xxxxxxxxx, Skype: mmeeks (M) +44 7795 666 147 - timezone usually UK / Europe _______________________________________________ LibreOffice mailing list LibreOffice@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/libreoffice