Creating table over written parquet file in Athena
Creating table over written parquet file in Athena
I am creating a parquet file from a CSV file using the following python script:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
csv_file = '~/Desktop/SWA_UK_Pickup_Forecast_HOURLY_M1_at_2017-11-28-04_20_21-UTC_from_28-Nov-2017_to_28-Nov-2017.csv'
chunksize = 10
csv_stream = pd.read_csv(csv_file, sep=',', chunksize=chunksize, low_memory=False)
for i, chunk in enumerate(csv_stream):
if i == 0:
parquet_schema = pa.Table.from_pandas(df=chunk).schema
parquet_writer = pq.ParquetWriter('/Users/kohujwal/Desktop/parquet_forecast_file.parquet', parquet_schema, compression='snappy')
table = pa.Table.from_pandas(chunk, schema=parquet_schema);
parquet_writer.write_table(table)
parquet_writer.close();
Then I am manually uploading it to some s3 location, and creating a table over it in Athena. The query used to create the table:
CREATE EXTERNAL TABLE IF NOT EXISTS forecast_report_lom_parquet (
`forecast_week` int,
`for_date` string,
`forecast_day_of_week` string,
`merchant_id` string,
`shipper_account_id` string,
`shipper_name` string,
`node_id` string,
`routing_preference` string,
`forecast_shipment_count` int,
`forecast_pallet_count` int,
`forecast_volume` double )
STORED AS PARQUET
LOCATION 's3://prevoir-athena-input-prod/spike-computation/forecast-reports/parquet/live-order-model/'
tblproperties ("parquet.compress"="SNAPPY")
The table is created successfully, however when I try to query the table, it doesn't display the contents. It just shows the indexes of the rows and the column names. Here is the CSV and the parquet file. The query on the parquet table :
SELECT * FROM prevoir_prod.forecast_report_lom_parquet
Results that I am seeing on the console.
Can anyone point out what's going wrong here?
1 Answer
1
Your results show row numbers. This means that Athena is finding the file(s) on S3, and is parsing them to the point of identifying rows.
It seems that Athena (or more precisely, the ParquetSerDe) isn't able to get columns from your file. This points to a mismatch between the CREATE EXTERNAL TABLE statement and the actual file. Some possibilities:
CREATE EXTERNAL TABLE
Whenever I have a file that I can't create a table across, I'll use the Glue crawler to inspect and build the create DDL. Works very well, it's straightforward, and saves a bunch of trial and error.
Thanks for contributing an answer to Stack Overflow!
But avoid …
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
But avoid …
To learn more, see our tips on writing great answers.
Required, but never shown
Required, but never shown
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
I saw this behavior in case of partitioned table. After create table DDL you need to run: MSCK REPAIR TABLE forecast_report_lom_parquet ; I wonder if regular table can also be helped by "REPAIR TABLE" ? aws.amazon.com/premiumsupport/knowledge-center/…
– dovka
Dec 21 '17 at 13:40