Originally S3 select only supported csv/json, optionally compressed. Due to overwhelming customer demand, support for Parquet was added in very short order. The main projects I'm aware of that support S3 select are the S3A filesystem client (used by many big data tools), Presto, and Spark. https://issues.apache.org/jira/browse/HADOOP-15364 https://prestodb.github.io/docs/current/connector/hive.html https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3select.html https://github.com/minio/spark-select Even if a tool leverages S3A, the underlying engine needs to know how to do projection and predicate pushdowns. Spark falls into this category. I could see S3 select also being useful for lighter weight applications, perhaps knative functions or similar? The csv/json/parquet files are usually part of a larger database, often with schema stored in a Hive metastore, at least for the data warehousing use case. Tables can be partitioned, and each partition can have any number of files. Database engines typically have cost-based optimizers that use statistics about tables and partitions in order to only read files that are relevant. Perhaps you partition by a time and datestamp and your query is only trying to determine the sales for last month (form of projection pushdown). With a columnar format like parquet, the data is striped into row groups and each row group stores the columns together, the parquet metadata keeps information about offsets and the number of rows, etc. So now the database engine can do further predicate pushdowns by eliminating unnecessary row groups, and projection pushdowns by eliminating unnecessary columns. The way this works absent S3 Select is the database engine will do a ranged GET for the parquet metadata, then do range GET requests for the columns of relevant row groups. We've seen that this is kind of annoying for RGW, because most engines rudely use the starting offset, and then slam the connection closed once they've got what they want (perhaps because the metadata only contains starting offsets and not ending offsets). Basically the RGW is busily requesting chunks of the range requested object from RADOS only to throw some of them away because the client closed the connection. It's not clear to me if there is a way to make this happen down in RADOS with object classes, since the files we're acting on are likely going to cross a striping boundary. Especially since the metadata to data overhead of small files is going to be obviously worse. Now, with S3 Select, instead of having the Parquet reader for these engines do their own predicate / projection pushdowns by examining file metadata and only reading necessary ranges of a particular object, the engine can skip the Parquet reader and simply send a predicate / pushdown select statement to the object store. That means our strategies to process parquet files should be informed by those utilized by Parquet readers that have been developed in different database engines. This blog post provides some inspiration on various optimizations: https://eng.uber.com/presto/ On Mon, Jun 3, 2019 at 3:05 PM Yehuda Sadeh-Weinraub <yehuda@xxxxxxxxxx> wrote: > > Hi Karan, Kyle, ceph-devel, > > I'm looking into a potential implementation of s3 select, and trying > to gather some information about current use of this feature. Karan, > is there any specific use case that you have in mind? > Anyone else that has any experience with this feature and what users > expect exactly from it please feel free to chime in. The different > directions we can take implementing it vary a lot, and there are > likely different trade offs that we need to consider. Any light shed > into it could be really useful. > > Thanks, > Yehuda