Optimize AWS Athena Costs - Partitioning, Parquet & Query Tuning

Did you know?

Athena analyzes massive datasets directly in Amazon S3, scaling automatically to handle petabytes of data. It's fast, capable of querying a terabyte of data in under 60 seconds!

‍

Amazon Athena is essential for businesses handling extensive datasets on Amazon S3. Its automatic scalability and processing of petabytes of data ensure efficient and dependable analysis. With over 5,303 customers leveraging Amazon Athena, commanding a market share of 19.92% and a global rank of 2, its significance in the industry is evident.^[1]‍

Optimizing Athena costs is pivotal for sustaining profitability and operational efficiency. In this article, we will delve into the Amazon Athena pricing structure and explore strategies to reduce Athena costs effectively.

‍

Global Utilization of Amazon Athena

Prominent companies with over 10,000 employees that use Amazon Athena for their data analytics needs include Tech Mahindra, Mphasis, PricewaterhouseCoopers, Cardinal Health, and PostNL.

Utilization of Amazon Athena spans across various domains, empowering organizations to extract valuable insights from their data stored in Amazon S3. Amazon Athena empowers various domains by enabling organizations to extract insights from data stored in Amazon S3. It facilitates querying and analysis of big data, supports machine learning workflows, aids AI initiatives, and enhances data analytics with ad hoc querying and exploratory analysis. In software development, Athena aids in log analysis and performance monitoring. It also supports complex SQL queries for diverse analytical tasks and fosters innovation and competitiveness in digital transformation.^[1]

‍

Understanding Amazon Athena Pricing

The table below outlines Amazon Athena's service pricing for data analysis. SQL queries are priced by data scanned, with an option for dedicated resources. Apache Spark execution is available, with additional costs covering standard S3 rates and federated queries.^[2]

Service	Price	Description
SQL queries	$5.00 per TB of data scanned	Pay-per-query billing for data scanned, with a minimum of 10 MB per query. No charges for DDL statements or failed queries.
SQL queries with Provisioned Capacity	$0.30 per DPU hour billed per minute	Use dedicated compute resources with flat-rate pricing. Scale up or down as needed, starting from 24 DPU.
Apache Spark code execution	$0.35 per DPU-hour billed per minute	Pay for Apache Spark application runtime based on DPU usage. Includes provisioned notebook and driver nodes.
Additional costs	Standard S3 rates apply	Athena queries S3 data with no extra storage charges. Standard S3 rates apply for storage, requests, and data transfer. Federated queries are billed per TB scanned, with a 10 MB minimum per query. AWS Glue Data Catalog usage incurs standard rates. Lambda functions invoked by federated queries are charged at standard rates.

‍

Pricing example

Consider a scenario where you have a table stored on Amazon S3 as an uncompressed text file, with a total size of 3 TB. Let's break down the costs and optimizations involved in running a query to retrieve data from a single column of this table:

a) Query Cost without Compression:

File Size: 3 TB
Data Scanned: 3 TB
Cost: 3 TB * $5/TB = $15

b) Columnar Format Optimization:

File Size: 3 TB (uncompressed)
Data Scanned (reading a single column): 3 TB / 4 = 0.75 TB
Cost: 0.75 TB * $5/TB = $3.75

This breakdown demonstrates the significant cost savings achieved through columnar format optimization.

‍

Strategies to Optimize Athena and reduce Costs

Image depicting AWS Athena Cost Opimization Strategies

‍

Given below are some strategies that can help you optimize performance and save costs without compromising efficiency.^[3]

‍

1. Data Partitioning

Partitioning your data in Amazon Athena can significantly reduce costs and improve query performance. By dividing your table into parts based on column values like date, country, or region, you can limit the amount of data scanned for each query. This reduces both query runtime and cost, as Athena charges based on the amount of data scanned from Amazon S3.Partitioning acts like adding virtual columns to your table. You define these partitions when you create the table. For example, if you partition a dataset by year, you only scan data from specific years when querying, which reduces the data volume and costs.

Here's an example of how a dataset might be partitioned by year in an S3 bucket:

‍

a) Creating a Partitioned Table:

When creating a table for this dataset, you include a PARTITIONED BY clause:

‍

After the table is created, you need to add each partition using an AWS Glue crawler, the ALTER TABLE ADD PARTITION command, or by running MSCK REPAIR TABLE.

b) Querying Partitioned Data:

When you query a partitioned table, specify the partition key in the WHERE clause to limit the data scanned.

‍

This query only scans data from s3://athena-examples/flight/parquet/year=1991/, thus reducing costs.

The table below compares query runtimes and costs between partitioned and non-partitioned tables. The dataset contains 74 GB of data, partitioned by the l_shipdate column.

Query	Non-Partitioned Table	Partitioned Table	Savings
SELECT COUNT(*) FROM lineitem WHERE l_shipdate = '1996-09-01'	Execution Time: 4.8s Data Scanned: 74.1 GB Total Price: $0.36	Execution Time: 0.7s Data Scanned: 29.96 MB Total Price: $0.0001	99% cheaper, 85% faster
SELECT COUNT(*) FROM lineitem WHERE l_shipdate >= '1996-09-01' AND l_shipdate < '1996-10-01'	Execution Time: 4.4s Data Scanned: 74.1 GB Total Price: $0.36	Execution Time: 2.0sData Scanned: 898.58 MB Total Price: $0.004	98% cheaper, 54% faster

‍

2. Use Compression

To reduce Athena costs, compress your data to decrease its size, which reduces the amount of data scanned from Amazon S3 and lowers query costs. Athena supports various compression formats like gzip, Snappy, and zstd. Note that many common compression formats are not splittable, so only one processing node can handle a single compressed file. To optimize performance, balance the number and size of your compressed files. Formats like Parquet and ORC are splittable and recommended for better performance and hence reducing Athena costs.

Example Code Snippet:

The code creates an external table for uncompressed CSV data in S3 and another for compressed Parquet data with Snappy compression. It then copies data from the uncompressed table to the compressed table, reducing future Athena query costs by minimizing data scanned.

‍

3. Bucket Your Data

Bucketing your data in Amazon Athena helps reduce costs by minimizing the amount of data that a query reads. Bucketing distributes records into separate files based on the value of a specific column, ensuring that all records with the same value are stored together. This is particularly useful for columns with high cardinality, like user IDs or device IDs, and for queries that frequently search for specific values in these columns.When you bucket data, you organize it into a set number of buckets based on the values in one column. This organization allows Athena to scan only the relevant bucket instead of the entire dataset when executing queries, reducing the data scanned and, therefore, the cost.

Let's consider a customer table of size 2.29 GB, with the c_custkey column used for bucketing into 32 buckets. Here's how we compare the costs:

Metric	Non-Bucketed Table	Bucketed Table	Savings
Query	SELECT COUNT(*) FROM customer WHERE c_custkey = 12677856;	SELECT COUNT(*) FROM customer_bucketed WHERE c_custkey = 12677856;	-
Runtime	1.3 seconds	0.82 seconds	0.48 seconds (37% faster)
Data Scanned	2.29 GB	72.94 MB	2.21706 GB (97% less data)
Cost Calculation	2.29 GB /1024 GB X 5 USD	72.94 MB /1024 MB X 5 USD	-
Cost	$0.01145	$0.0003645	$0.0110855 (97% cheaper)

‍

By bucketing the customer table on the c_custkey column into 32 buckets, the cost of the query was reduced from $0.01145 to $0.0003645, achieving a 97% reduction in cost and scanning significantly less data.

‍

4. Partition Projection

Optimizing partition processing using partition projection in Amazon Athena can significantly reduce costs by speeding up query execution for highly partitioned tables. Instead of retrieving partition metadata from AWS Glue or another metastore, partition projection calculates partition values and locations in-memory, which is faster and reduces query runtime. This approach eliminates the need for manual partition metadata updates, simplifies partition management, and is particularly beneficial for tables with high cardinality columns or finely granular date ranges. By enabling partition projection, queries become more efficient and cost-effective, ultimately lowering Athena processing costs. For implementation, you can configure partition projection in your table creation statement with appropriate settings for partition types and ranges.

To enable partition projection, you can modify your table creation statement to include the necessary configurations. Here’s an example of how to set it up:

‍

5. Query Result Reuse

Query Result Reuse in Amazon Athena helps reduce costs by avoiding unnecessary data scans and query reruns. By reusing the results of previous queries when the source data hasn't changed, Query Result Reuse minimizes data scan costs and query latency. This feature is particularly useful for frequently run queries on datasets that update infrequently, such as daily or less often. It enables users to specify a maximum age for result reuse, ensuring efficient query execution and cost savings. With Query Result Reuse, Athena automatically checks for existing results matching the query criteria, transparently rerunning the query only when necessary. This capability benefits not only individual users but also entire workgroups, improving productivity and lowering overall costs for shared datasets and dashboards.^[4]

Here's an example of how to enable Query Result Reuse when running a query using AWS SDK for Python:

‍

6. Optimizing Query Performance

Optimizing query performance in Athena directly correlates with reducing costs. When queries run more efficiently, they consume fewer computational resources, resulting in lower costs incurred by the user. Here's how each of the mentioned best practices contributes to cost reduction:

a) Optimize ORDER BY: Efficiently ordering query results reduces the amount of data processed and transmitted, ultimately lowering the cost associated with data scanning and network usage.

b) Optimize joins: Optimizing join operations minimizes the computational overhead involved in merging datasets, leading to reduced query execution time and, consequently, lower costs.

c) Optimize GROUP BY: Streamlining GROUP BY operations reduces the amount of data processed for aggregation, resulting in quicker query execution and reduced costs.

d) Use approximate functions: Approximate functions like approximate count distinct or approximate quantiles offer faster execution times compared to their exact counterparts. By leveraging these functions, users can achieve quicker query results and lower costs.

e) Only include the columns that you need: Selecting only the necessary columns reduces the volume of data processed, leading to faster query execution and decreased costs associated with data scanning and transmission.

Here's an example SQL query that incorporates all the mentioned query tuning best practices:

‍

We filter transactions by date, join transaction and customer data, group by product category, sort by total sales, and limit results to the top 10 categories.

‍

Conclusion

In summary, Amazon Athena is a vital tool for businesses managing large datasets on Amazon S3, offering scalability and efficient data analysis. Optimizing Athena costs is crucial for profitability and efficiency. Strategies like data partitioning, compression, bucketing, partition projection,Optimizing query performance and Query Result Reuse can effectively reduce costs and improve query performance, enhancing productivity and decision-making processes.

‍

References

1. https://6sense.com/tech/interactive-content-marketing/amazon-athena-market-share#free-plan-signup

2. https://aws.amazon.com/athena/pricing/

3. https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/

4. https://aws.amazon.com/blogs/big-data/reduce-cost-and-improve-query-performance-with-amazon-athena-query-result-reuse/

‍

FAQs

1. How much does Athena cost per terabyte?

‍Athena costs $5 per terabyte scanned. This pricing model emphasizes efficient data scanning and querying practices to manage costs effectively.‍

2. What is the optimal file size for Athena?

The optimal file size for Athena is generally between 1 to 2 GBs. This size strikes a balance between query performance and cost efficiency.‍

3. Is Amazon Athena fast?‍

Amazon Athena is designed for fast query execution, typically delivering results within seconds.

4. What is AWS Athena good for?

AWS Athena is particularly useful for querying large datasets stored in Amazon S3 using standard SQL. It's ideal for ad-hoc queries, log analysis, and extracting insights from structured and unstructured data.

5. How to save on Athena costs?

To optimize Athena costs, consider practices such as data partitioning, using compression formats like Parquet or ORC, defining optimal file sizes, and leveraging result set caching to minimize redundant queries. These strategies help reduce the amount of data scanned and improve query efficiency, thereby lowering overall costs.

‍

How to optimize AWS Athena Costs ?

Global Utilization of Amazon Athena

Understanding Amazon Athena Pricing

Pricing example

Strategies to Optimize Athena and reduce Costs

1. Data Partitioning

2. Use Compression

3. Bucket Your Data

4. Partition Projection

5. Query Result Reuse

6. Optimizing Query Performance

Conclusion

FAQs

FAQs