Re: Fw: benchmark of Excel, Calc, Google Docs

Wol <antlists@xxxxxxxxxxxxxxx> · Wed, 11 Dec 2019 02:30:06 +0000

On 10/12/2019 17:30, Aditya Parameswaran wrote:
Wols,

Thanks for sharing your concerns, replete with quotes from Einstein :-) !

I believe I share those concerns... but there are of course some 
liberties we can take as an academic group that you folks managing a 
popular spreadsheet tool cannot take (e.g., doing a proof of concept as 
opposed to a robust implementation.)   From a research/academic 
standpoint it is valuable to note that something is possible, even if 
the solution is not ideal given other pragmatic considerations.  BTW, I 
don't believe that anything we're doing *requires* a relational database 
-- a NoSQL setup would work just fine.

Bear in mind I'm not a regular developer. My name is on the list of 
credits, and I follow the mailing list, but I'm more one of those 
annoying people who moans about it. However, as they say, "you want 
customers who moan, because they're the people who want your product to 
improve. It's the people who leave without moaning that you should be 
concerned about".

I'd be happy to discuss more.  Our goal is to understand the stumbling 
blocks in translating our work into spreadsheet tools such as Calc and 
see how we can best help with what we've learned.

I'm more in to proving that relational databases are basically rubbish 
(as is the mathematical justification behind them) :-) That said, the 
maths itself is good, provided you don't use it to build a real-world 
database engine!

Okay, to demolish RDBMSs :-) I don't know which rule it is (no 1?) that 
says "Data comes in rows and columns". So your first job in building 
your relational application is to call in a data analyst, to whom you 
give the job of bashing square pegs in to round holes...

To me, data is what the user gives you. metadata is what the analyst 
deduces about the data. AND YOU DON'T MIX THE TWO!!!

Given that an RDBMS stores data as sets, and a lot of data comes as 
lists (which contain a lot of metadata) the only way to store a list in 
an RDBMS is to muddle data and metadata. Put mathematically, if you 
store a list when you want a set, you can just throw away the 
information you don't want (or need). However, once you've thrown that 
information away (ie the order of the items) so you can store your list 
in a set-based RDBMS, *you can't get it back*!

My database of choice is Pick, where the basic unit of storage is the 
object, not the row. And because an object is an n-dimensional array, 
rather than the 1-dimensional row, I can store an entire relational view 
in it. (Which is what you should do, relational maths is okay, RDBMSs 
aren't :-) And because I have an n-dimensional array, I can store lists 
of lists :-)

The problem I have at this point is that I'm now discussing the merit of 
databases. I'm not actually helping you "translate your work into 
spreadsheet tools". It's easy for me to extract a two-dimensional table 
from Pick and, truth be told, I can probably extract it faster 
(including massaging it into shape) from Pick than selecting it from an 
RDBMS! But unless we can find out some way of displaying a 3- or 
4-dimensional array (a relational view) in a spreadsheet that isn't 
horribly confusing, I don't really know what to suggest other than to 
use a SQL query as the layer between the database and the spreadsheet. 
The database may be n-dimensional, but if your spreadsheet only has two, 
you might as well use a 2-dimensional query and let the database handle 
the complexity. It's not that hard.

(I have - personally - managed a ram-starved 32-user Pick system that 
provided perfectly acceptable user response times despite thrashing like 
mad...)

(Oh, and as for pushing calculation into the database, Pick has had 
virtual fields for ages. I can define a field as being a calculation eg 
VAT = PRICE * 20%, so I can "store price, retrieve vat" and it's just 
automagical :-) (I get the impression most users are unaware of similar 
capabilities in RDBMSs)

Cheers,
Wol

Cheers,
Aditya

On Mon, Dec 9, 2019 at 3:37 PM Wols Lists <antlists@xxxxxxxxxxxxxxx 
<mailto:antlists@xxxxxxxxxxxxxxx>> wrote:

    On 09/12/19 19:14, Aditya Parameswaran wrote:
     >            The idea of converting to SQL queries is an
    interesting one
     >     but I find
     >     it very hard to believe it would provide any performance
    advantage at
     >     the same memory footprint. Furthermore - I'd be interested to
    know how
     >     you do other spreadsheet operations: row & column insertion,
    addressing,
     >     and dependency work on top of a SQL database with any efficiency.
     >
     >
     > We started by having the relational database be a simple persistent
     > storage layer, when coupled with an index to retrieve data by
    position,
     > can allow us to scroll through large datasets of billions of rows at
     > ease. We developed a new positional index to handle insertions and
     > deletions in O(log(n)) -- https://arxiv.org/pdf/1708.06712.pdf. I
    agree
     > that pushing the computation to the relational database does have
     > overheads; but at the same time, it allows for scaling to arbitrarily
     > large datasets.

    "the quickest way to optimise database access is to ditch first normal
    form".

    A provocative statement I know, but I'm very much in the NoSQL camp. I
    can hunt up the details of a face-off between Oracle and Cache, where
    Oracle had to "cheat" to achieve 100K tpm (things like deferring index
    updates) whereas Cache blasted through 250K barely breaking a sweat ...
    (or it might well have been tps)

    The maths supports this ...

    That said, a spreadsheet is inherently first normal formal, so tying a
    spreadsheet and a relational database together MAY make sense.

    In general though, Einstein said "make things as simple as possible BUT
    NO SIMPLER". Relational oversimplifies the database side, which means
    the application side is over-complex in order to compensate. (Which is
    why Cache blew Oracle out of the water.)

    I'm quite happy to wax lyrical, but I'd rather not preach to an audience
    who aren't at least interested. Feel free to ask me to continue on list,
    or contact me privately, and I'll try to prove everything as
    mathematically as I can :-)

     > but at the same time, it allows for scaling to arbitrarily
     > large datasets.

    At the price of massive unnecessary complexity.

    Cheers,
    Wol

_______________________________________________
LibreOffice mailing list
LibreOffice@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/libreoffice

_______________________________________________
LibreOffice mailing list
LibreOffice@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/libreoffice