👉 AWS Debug Games (Beta) - Prove your AWS expertise by solving tricky challenges.

👉 AWS Debug Games - Prove your AWS expertise.

How to analyze and reduce S3 storage usage?

Andreas Wittig – 06 Mar 2020

S3 is an object store, not a file system. Object storage comes with significant advantages: unlimited storage capacity, high availability, and durability. However, there are some disadvantages too. For example, it is cumbersome to calculate the storage usage of a specific prefix (also known as a folder) in S3.

How to analyze and reduce S3 storage usage?

For example, I’m using an S3 bucket to store personal data.

|- backups/
|- photos/
|- videos/
|- documents/
|- songsheets/
|- ebooks/

The CloudWatch metric BucketSizeBytes shows the storage usage of each S3 bucket broken down by storage class. Currently, it takes about 460 GB to store my data. That’s $5.60 per month.

CloudWatch Metric showing S3 Storage Usage

To reduce costs, I’d like to delete unused data. But where to start looking for data that is no longer valuable to me? My bucket contains more than 100,000 objects. Therefore, I want to find the prefixes (also known as a folder) with the highest storage consumption.

Doing so isn’t easy. Neither the AWS Management Console nor the S3 API provides a way to fetch the storage usage per prefix. Remember, S3 is an object store, not a file system. In theory, you could list all objects in the bucket, retrieve the object size, and calculate the storage usage per prefix on your own. But doing so is cumbersome, slow, and costly.

Luckily, there is a better way to analyze S3 storage usage.

  • S3 Inventory provides CSV, ORC, or Parquet files listing all the objects stored within an S3 bucket on a daily or weekly basis.
  • Athena queries CSV, ORC, or Parquet files and analyzes data on-the-fly.

S3 Inventory + Athena

Next, you will learn how to enable S3 Inventory, set up Athena, and analyze storage usage with Athena.

Looking for a new challenge?


    Cloud Operations Lead

    DEMICON • AWS Advanced Consulting Partner • Remote (Europe)
    service-delivery-management hiring devops platform

Enabling S3 Inventory

Go through the following steps to enable S3 Inventory for a bucket.

  1. Open the AWS Management Console.
  2. Go to Simple Storage Service (S3).
  3. Create a new bucket to store the inventory files.
  4. Open the bucket you want to analyze and reduce S3 storage usage.
  5. Switch to the Management tab.
  6. Select Inventory.
  7. Press the Add new button.
  8. Fill out the details, as shown in the following screenshot.

Configure S3 Inventory

Please note, it will take up to 24 hours until the first inventory files will show up in the inventory bucket.

Setting up Athena

To run SQL queries to analyze the inventory data, you need to create an Athena table first. Go to the Athena service in the AWS Management Console to do so.

Athena: Create table

Next, execute the following query to create the inventory table. Replace $InventoryBucket, $InventoryPrefix, $Bucket, and $InventoryName with the configuration from the previous section.

bucket string,
key string,
size bigint,
last_modified_date timestamp,
e_tag string,
storage_class string,
is_multipart_uploaded boolean,
replication_status string,
encryption_status string
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://$InventoryBucket/$InventoryPrefix/$Bucket/$InventoryName/hive';

Also, you need to make sure that the partitions of the table are updated. Do so by executing the following command.


That’s it; you have configured Athena. You are ready to analyze the inventory data.

Analyzing storage usage with Athena

Execute the following query to analyze the storage usage by the first part of the prefix. Replace $Yesterday with yesterday’s timestamp (e.g., 2020-03-03-00-00).

SELECT prefix, SUM(size)/1000/1000/1000 AS total_size FROM (
SELECT regexp_extract(i.key, '([^\/]*\/).*', 1) AS prefix, i.size
FROM inventory AS i WHERE i.dt = '$Yesterday'
) GROUP BY prefix ORDER BY total_size DESC;

In my case, the objects with prefix backups/ use 454 GB of storage. So I know where to look for data that is no longer valuable to me to reduce storage costs.

Athena: Query inventory

Please note, I have used a personal bucket with less than 500 GB as an example in this blog post. However, the shown concept works for buckets with vast amounts of objects and data as well.

Become a cloudonaut supporter

Andreas Wittig

Andreas Wittig ( Email Twitter LinkedIn Mastodon )

We launched the cloudonaut blog in 2015. Since then, we have published 365 articles, 67 podcast episodes, and 67 videos. It's all free and means a lot of work in our spare time. We enjoy sharing our AWS knowledge with you.

Please support us

Have you learned something new by reading, listening, or watching our content? With your help, we can spend enough time to keep publishing great content in the future. Learn more

Amount must be a multriply of 5. E.g, 5, 10, 15.

Thanks to Alan Leech, Alex DeBrie, Christopher Hipwell, e9e4e5f0faef, Jason Yorty, Jeff Finley, jhoadley, Johannes Konings, John Culkin, Jonathan Deamer, Juraj Martinka, Ken Snyder, Markus Ellers, Oriol Rodriguez, Ross Mohan, sam onaga, Satyendra Sharma, Simon Devlin, Todd Valentine, Victor Grenu, and all anonymous supporters for your help! We also want to thank all supporters who purchased a cloudonaut t-shirt.