How to analyze and reduce S3 storage usage?

Andreas Wittig – 06 Mar 2020

S3 is an object store, not a file system. Object storage comes with significant advantages: unlimited storage capacity, high availability, and durability. However, there are some disadvantages too. For example, it is cumbersome to calculate the storage usage of a specific prefix (also known as a folder) in S3.

How to analyze and reduce S3 storage usage?

For example, I’m using an S3 bucket to store personal data.

|- backups/
|- photos/
|- videos/
|- documents/
|- songsheets/
|- ebooks/

The CloudWatch metric BucketSizeBytes shows the storage usage of each S3 bucket broken down by storage class. Currently, it takes about 460 GB to store my data. That’s $5.60 per month.

CloudWatch Metric showing S3 Storage Usage

To reduce costs, I’d like to delete unused data. But where to start looking for data that is no longer valuable to me? My bucket contains more than 100,000 objects. Therefore, I want to find the prefixes (also known as a folder) with the highest storage consumption.

Doing so isn’t easy. Neither the AWS Management Console nor the S3 API provides a way to fetch the storage usage per prefix. Remember, S3 is an object store, not a file system. In theory, you could list all objects in the bucket, retrieve the object size, and calculate the storage usage per prefix on your own. But doing so is cumbersome, slow, and costly.

Luckily, there is a better way to analyze S3 storage usage.

  • S3 Inventory provides CSV, ORC, or Parquet files listing all the objects stored within an S3 bucket on a daily or weekly basis.
  • Athena queries CSV, ORC, or Parquet files and analyzes data on-the-fly.

S3 Inventory + Athena

Next, you will learn how to enable S3 Inventory, set up Athena, and analyze storage usage with Athena.

Cover of Rapid Docker on AWS

Become a Docker on AWS professional!

Our book Rapid Docker on AWS is designed for DevOps engineers and web developers who want to run dockerized web applications on AWS. We lead you with many examples: From dockerizing your application to Continuous Deployment and Infrastructure as Code on AWS. No prior knowledge of Docker and AWS is required. Get the first chapter for free!

Enabling S3 Inventory

Go through the following steps to enable S3 Inventory for a bucket.

  1. Open the AWS Management Console.
  2. Go to Simple Storage Service (S3).
  3. Create a new bucket to store the inventory files.
  4. Open the bucket you want to analyze and reduce S3 storage usage.
  5. Switch to the Management tab.
  6. Select Inventory.
  7. Press the Add new button.
  8. Fill out the details, as shown in the following screenshot.

Configure S3 Inventory

Please note, it will take up to 24 hours until the first inventory files will show up in the inventory bucket.

Setting up Athena

To run SQL queries to analyze the inventory data, you need to create an Athena table first. Go to the Athena service in the AWS Management Console to do so.

Athena: Create table

Next, execute the following query to create the inventory table. Replace $InventoryBucket, $InventoryPrefix, $Bucket, and $InventoryName with the configuration from the previous section.

CREATE EXTERNAL TABLE inventory(
bucket string,
key string,
size bigint,
last_modified_date timestamp,
e_tag string,
storage_class string,
is_multipart_uploaded boolean,
replication_status string,
encryption_status string
)
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://$InventoryBucket/$InventoryPrefix/$Bucket/$InventoryName/hive';

Also, you need to make sure that the partitions of the table are updated. Do so by executing the following command.

MSCK REPAIR TABLE inventory;

That’s it; you have configured Athena. You are ready to analyze the inventory data.

Analyzing storage usage with Athena

Execute the following query to analyze the storage usage by the first part of the prefix. Replace $Yesterday with yesterday’s timestamp (e.g., 2020-03-03-00-00).

SELECT prefix, SUM(size)/1000/1000/1000 AS total_size FROM (
SELECT regexp_extract(i.key, '([^\/]*\/).*', 1) AS prefix, i.size
FROM inventory AS i WHERE i.dt = '$Yesterday'
) GROUP BY prefix ORDER BY total_size DESC;

In my case, the objects with prefix backups/ use 454 GB of storage. So I know where to look for data that is no longer valuable to me to reduce storage costs.

Athena: Query inventory

Please note, I have used a personal bucket with less than 500 GB as an example in this blog post. However, the shown concept works for buckets with vast amounts of objects and data as well.

Tags: aws s3 costs
Andreas Wittig

Andreas Wittig

I’m an independent consultant, technical writer, and programming founder. All these activities have to do with AWS. I’m writing this blog and all other projects together with my brother Michael.

In 2009, we joined the same company as software developers. Three years later, we were looking for a way to deploy our software—an online banking platform—in an agile way. We got excited about the possibilities in the cloud and the DevOps movement. It’s no wonder we ended up migrating the whole infrastructure of Tullius Walden Bank to AWS. This was a first in the finance industry, at least in Germany! Since 2015, we have accelerated the cloud journeys of startups, mid-sized companies, and enterprises. We have penned books like Amazon Web Services in Action and Rapid Docker on AWS, we regularly update our blog, and we are contributing to the Open Source community. Besides running a 2-headed consultancy, we are entrepreneurs building Software-as-a-Service products.

We are available for projects.

You can contact me via Email, Twitter, and LinkedIn.

Briefcase icon
Hire me