Creating table over written parquet file in Athena

I am creating a parquet file from a CSV file using the following python script:

import pandas as pd import pyarrow as pa import pyarrow.parquet as pq csv_file = '~/Desktop/SWA_UK_Pickup_Forecast_HOURLY_M1_at_2017-11-28-04_20_21-UTC_from_28-Nov-2017_to_28-Nov-2017.csv' chunksize = 10 csv_stream = pd.read_csv(csv_file, sep=',', chunksize=chunksize, low_memory=False) for i, chunk in enumerate(csv_stream): if i == 0: parquet_schema = pa.Table.from_pandas(df=chunk).schema parquet_writer = pq.ParquetWriter('/Users/kohujwal/Desktop/parquet_forecast_file.parquet', parquet_schema, compression='snappy') table = pa.Table.from_pandas(chunk, schema=parquet_schema); parquet_writer.write_table(table) parquet_writer.close();

Then I am manually uploading it to some s3 location, and creating a table over it in Athena. The query used to create the table:

CREATE EXTERNAL TABLE IF NOT EXISTS forecast_report_lom_parquet ( `forecast_week` int, `for_date` string, `forecast_day_of_week` string, `merchant_id` string, `shipper_account_id` string, `shipper_name` string, `node_id` string, `routing_preference` string, `forecast_shipment_count` int, `forecast_pallet_count` int, `forecast_volume` double ) STORED AS PARQUET LOCATION 's3://prevoir-athena-input-prod/spike-computation/forecast-reports/parquet/live-order-model/' tblproperties ("parquet.compress"="SNAPPY")

The table is created successfully, however when I try to query the table, it doesn't display the contents. It just shows the indexes of the rows and the column names. Here is the CSV and the parquet file. The query on the parquet table :

SELECT * FROM prevoir_prod.forecast_report_lom_parquet

Results that I am seeing on the console.

Can anyone point out what's going wrong here?

I saw this behavior in case of partitioned table. After create table DDL you need to run: MSCK REPAIR TABLE forecast_report_lom_parquet ; I wonder if regular table can also be helped by "REPAIR TABLE" ? aws.amazon.com/premiumsupport/knowledge-center/…
– dovka
Dec 21 '17 at 13:40

1 Answer
1

Your results show row numbers. This means that Athena is finding the file(s) on S3, and is parsing them to the point of identifying rows.

It seems that Athena (or more precisely, the ParquetSerDe) isn't able to get columns from your file. This points to a mismatch between the CREATE EXTERNAL TABLE statement and the actual file. Some possibilities:

CREATE EXTERNAL TABLE

Whenever I have a file that I can't create a table across, I'll use the Glue crawler to inspect and build the create DDL. Works very well, it's straightforward, and saves a bunch of trial and error.

Thanks for contributing an answer to Stack Overflow!

But avoid …

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

But avoid …

To learn more, see our tips on writing great answers.

Required, but never shown

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Dfyjkt