Anonymize CloudFront Access Logs

Michael Wittig – 22 Apr 2020

Amazon CloudFront can upload access log files to an S3 bucket. By default, CloudFront logs the IP address of the client. Optionally, cookies could be logged as well. If EU citizens access your CloudFront distribution, you have to process personally identifiable information (PII) in a General Data Protection Regulation (GDPR) compliant way. IP addresses are considered PII, and cookie data could also contain PII. If you want to process and store PII, you need a reason in the spirit of the GDPR.

Disclaimer: I’m not a lawyer! This is not legal advice.

Access logs are required to support operations to debug issues. For that purpose, it is okay to keep the access logs for seven days1. But you might need access logs for capacity planning as well. How can you keep the access logs for more than seven days without violating GDPR?

Anonymize Data

The question is: do you really need the IP address in our access logs? The answer is likely no. Unfortunately, CloudFront does not allow us to disable the IP address logging. We have to implement a workaround to anonymize the access logs as soon as they are available on S3. The workaround works like this:

Anonymize CloudFront Access Logs

The diagram was created with Cloudcraft - Visualize your cloud architecture like a pro.

We can use a similar mechanism that is implemented by Google Analytics. An IPv4 address like 91.45.135.67 is turned into 91.45.135.0 (the last 8 bits are removed, 24 bits are kept). IPv6 addresses need a different logic: Google removes the last 80 bits. I will go one step further and remove the last 96 bits and keep 32 bits 2.

The following steps are needed to anonymize an access log file:

  1. Download the object from S3
  2. Decompress the gzip data
  3. Parse the data (tab-separated values, log file format)
  4. Replace the IP addresses with anonymized values
  5. Compress the data with gzip
  6. Upload the anonymized data to S3
  7. Remove the original data from S3

There is no documented max size of an access log file. We should prepare for files that are larger than the available memory. Luckily, Lambda functions support Node.js, which has superb support to deal with streaming data. If we stream data, we never load all data into memory at once.

Cover of Amazon Web Services in Action

Level up, strengthen your AWS skills.

Our book Amazon Web Services in Action is a comprehensive introduction to computing, storing, and networking in the AWS cloud. You'll find clear, relevant coverage of all the essential AWS services, emphasizing best practices for security, high availability, and scalability. Get the first chapter for free!

First, weload some core Node.js dependencies and the AWS SDK:

const fs = require('fs');
const zlib = require('zlib');
const stream = require('stream');
const AWS = require('aws-sdk');
const s3 = new AWS.S3({apiVersion: '2006-03-01'});

It’s time to implement the anonymization:

function anonymizeIPv4Address(str) {
const s = str.split('.');
s[3] = '0';
return s.join('.');
}

function anonymizeIPv6Address(str) {
const s = str.split(':').slice(0, 2);
s.push(':');
return s.join(':');
}

function anonymizeIpAddress(str) {
if (str === '-' || str === 'unknown') {
return str;
}
if (str.includes('.')) {
return anonymizeIPv4Address(str);
} else if (str.includes(':')) {
return anonymizeIPv6Address(str);
} else {
throw new Error('Neither IPv4 nor IPv6: ' + str);
}
}

We also have to deal with TSV (tab-separated values)

function transformLine(line) {
if (line.startsWith('#') || line.trim() === '') {
return line;
}
const values = line.split('\t');
values[4] = anonymizeIpAddress(values[4]);
values[19] = anonymizeIpAddress(values[19]);
return values.join('\t');
}

So far, we process only small amounts of data: a single access log file line. It’s time to deal with the whole file.

Each chunk of data is represented as a buffer in Node.js. A buffer represents binary data in the form of a sequence of bytes. In the buffer, we search for the line-end \n byte. We slice all bytes from beginning to \n and convert them into a string to extract a line. Continue with the apporach until end of file is reached. There is one edge case: A chunk of data can stop in the middle of a line. We have to add the old chunk to the beginning of a new chunk.

async function process(record) {
let chunk = Buffer.alloc(0);
const transform = (currentChunk, encoding, callback) => {
chunk = Buffer.concat([chunk, currentChunk]);
const lines = [];
while(chunk.length > 0) {
const i = chunk.indexOf('\n', 'utf8');
if (i === -1) {
break;
} else {
lines.push(chunk.slice(0, i).toString('utf8'));
chunk = chunk.slice(i+1);
}
}
lines.push('');
const transformed = lines
.map(transformLine)
.join('\n');
callback(null, Buffer.from(transformed, 'utf8'));
};
const params = {
Bucket: record.s3.bucket.name,
Key: record.s3.object.key
};
if ('versionId' in record.s3.object) {
params.VersionId = record.s3.object.versionId;
}
const body = s3.getObject(params).createReadStream()
.pipe(zlib.createGunzip())
.pipe(new stream.Transform({
transform
}))
.pipe(zlib.createGzip());
await s3.upload({
Bucket: record.s3.bucket.name,
Key: record.s3.object.key.slice(0, -2) + 'anonymized.gz',
Body: body
}).promise();
if (chunk.length > 0) {
throw new Error('file was not read completly');
}
return s3.deleteObject(params).promise();
}

Finally, Lambda requires a thin interface that we have to implement. I also ensure that anonymized data is not processed again to avoid an expensive infinit loop.

exports.handler = async (event) => {
console.log(JSON.stringify(event));
for (let record of event.Records) {
if (record.s3.object.key.endsWith('.anonymized.gz')) {
continue;
} else if (record.s3.object.key.endsWith('.gz')) {
await process(record);
}
}
};

I integrated the workaround into our collection of aws-cf-templates. Check out the documentation or the code on GitHub. A similar approach can be used to anonymize access logs from ELB load balancers (ALB, CLB, NLB).

PS: You should also enable S3 lifecycle rules to delete access logs after 38 months.

Thanks to Thorsten Höger for reviewing this article.

Michael Wittig

Michael Wittig

I’m an independent consultant, technical writer, and programming founder. All these activities have to do with AWS. I’m writing this blog and all other projects together with my brother Michael.

In 2009, we joined the same company as software developers. Three years later, we were looking for a way to deploy our software—an online banking platform—in an agile way. We got excited about the possibilities in the cloud and the DevOps movement. It’s no wonder we ended up migrating the whole infrastructure of Tullius Walden Bank to AWS. This was a first in the finance industry, at least in Germany! Since 2015, we have accelerated the cloud journeys of startups, mid-sized companies, and enterprises. We have penned books like Amazon Web Services in Action and Rapid Docker on AWS, we regularly update our blog, and we are contributing to the Open Source community. Besides running a 2-headed consultancy, we are entrepreneurs building Software-as-a-Service products.

We are available for projects.

You can contact me via Email, Twitter, and LinkedIn.

Briefcase icon
Hire me