Re: S3 Select

Kyle Bader <kyle.bader@xxxxxxxxx> · Tue, 11 Jun 2019 21:51:51 -0700

Originally S3 select only supported csv/json, optionally compressed.
Due to overwhelming customer demand, support for Parquet was
added in very short order.

The main projects I'm aware of that support S3 select are the S3A
filesystem client (used by many big data tools), Presto, and Spark.

https://issues.apache.org/jira/browse/HADOOP-15364
https://prestodb.github.io/docs/current/connector/hive.html

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3select.html
https://github.com/minio/spark-select

Even if a tool leverages S3A, the underlying engine needs to know
how to do projection and predicate pushdowns. Spark falls into this
category. I could see S3 select also being useful for lighter weight
applications, perhaps knative functions or similar?

The csv/json/parquet files are usually part of a larger database, often
with schema stored in a Hive metastore, at least for the data
warehousing use case. Tables can be partitioned, and each partition
can have any number of files. Database engines typically have
cost-based optimizers that use statistics about tables and partitions
in order to only read files that are relevant. Perhaps you partition
by a time and datestamp and your query is only trying to determine the
sales for last month (form of projection pushdown). With a columnar
format like parquet, the data is striped into row groups and each row
group stores the columns together, the parquet metadata keeps
information about offsets and the number of rows, etc. So now the
database engine can do further predicate pushdowns by eliminating
unnecessary row groups, and projection pushdowns by eliminating
unnecessary columns. The way this works absent S3 Select is the
database engine will do a ranged GET for the parquet metadata, then do
range GET requests for the columns of relevant row groups. We've seen
that this is kind of annoying for RGW, because most engines rudely use
the starting offset, and then slam the connection closed once they've
got what they want (perhaps because the metadata only contains
starting offsets and not ending offsets). Basically the RGW is busily
requesting chunks of the range requested object from RADOS only to
throw some of them away because the client closed the connection. It's
not clear to me  if there is a way to make this happen down in RADOS
with object classes, since the files we're acting on are likely going
to cross a striping boundary. Especially since the metadata to data
overhead of small files is going to be obviously worse.

Now, with S3 Select, instead of having the Parquet reader for these
engines do their own predicate / projection pushdowns by examining
file metadata and only reading necessary ranges of a particular
object, the engine can skip the Parquet reader and simply send a
predicate / pushdown select statement to the object store. That means
our strategies to process parquet files should be informed by those
utilized by Parquet readers that have been developed in different
database engines.

This blog post provides some inspiration on various optimizations:

https://eng.uber.com/presto/

On Mon, Jun 3, 2019 at 3:05 PM Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> wrote:
>
> Hi Karan, Kyle, ceph-devel,
>
> I'm looking into a potential implementation of s3 select, and trying
> to gather some information about current use of this feature. Karan,
> is there any specific use case that you have in mind?
> Anyone else that has any experience with this feature and what users
> expect exactly from it please feel free to chime in. The different
> directions we can take implementing it vary a lot, and there are
> likely different trade offs that we need to consider. Any light shed
> into it could be really useful.
>
> Thanks,
> Yehuda