Re: Support for the Apache Parquet file format

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> the benefits that this format brings

Its column-based format means that its data can be queried without loading the full file.

More can be found at [1]

I see 2 distinct advantages:

1. Convenience: sometimes I build a programmatic process that spits out a bunch of parquet files, then I query them with AWS Athena or Apache Drill. If I want to peak into the parquet file, it requires to either write up a pandas script or to open it with visidata. If I find a problem with the file, I need to go back to the process that generated it, modify it, and re-generate the file (or write a script specifically for editing the file). If I could just double-click, edit, save, it would be so much easier. That's a major advantage for CSV despite its inefficiency in query/filtering performance.

2. Performance: On the other hand, a spreadsheet editor might not be designed to exploit this column-based format for better efficiency. It's expected to open the whole file anyway. Maybe filtering the worksheet with parquet would be faster than with CSV, but that depends on how the filtering is implemented. I have no idea how it's done in Calc or other editors. We all know the dread of opening a large file in a spreadsheet editor. But then again, maybe that's when the data should be moved into a database rather than stay in a heavy spreadsheet.


> https://github.com/apache/parquet-format

> The best place to learn about the specifics of this file format

Yes that's it. I don't want to sound self-contradictory, but maybe it's NOT a good idea to support Parquet. I was just bringing it up, and maybe this needs some more thought about the degree of usefulness or if people will actually use it. Chicken-and-egg problem?


Links:

[1] https://stackoverflow.com/questions/36822224/what-are-the-pros-and-cons-of-parquet-format-compared-to-other-formats

Shadi Akiki
Founder & CEO, AutofitCloud
https://autofitcloud.com/
+1 813 579 4935

On 11/22/19 2:42 PM, Kohei Yoshida wrote:
On 22.11.2019 02:37, Shadi Akiki wrote:

I'm wondering why Parquet is not yet a supported format in LibreOffice
Calc (and most desktop worksheet processing tools for that matter).

Well, one reason may be that nobody had asked for it yet!  On that note, asking about it and raising awareness (which you did) is a necessary first step.

Also, it would be nice to know the benefits that this format brings that any other existing formats currently do not.  I use pandas occasionally and I do work with people who use it on a regular basis, but I had not heard this file format mentioned in our conversations to this day.

Is this page

https://github.com/apache/parquet-format

The best place to learn about the specifics of this file format, or is there any other page that provides more details?

One way we can add support for a new file format such as this one to Calc is to add it to the orcus library [1], which Calc uses internally to handle a subset of file formats.  That may potentially be a much easier route than adding it to the LibreOffice code base directly... Full disclosure: I do maintain this library.

Kohei

[1] https://gitlab.com/orcus/orcus

_______________________________________________
LibreOffice mailing list
LibreOffice@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/libreoffice




[Index of Archives]     [LARTC]     [Bugtraq]     [Yosemite Forum]     [Photo]

  Powered by Linux