Anonymize CloudFront Access Logs

Michael Wittig – 22 Apr 2020

Amazon CloudFront can upload access log files to an S3 bucket. By default, CloudFront logs the IP address of the client. Optionally, cookies could be logged as well. If EU citizens access your CloudFront distribution, you have to process personally identifiable information (PII) in a General Data Protection Regulation (GDPR) compliant way. IP addresses are considered PII, and cookie data could also contain PII. If you want to process and store PII, you need a reason in the spirit of the GDPR.

Disclaimer: I’m not a lawyer! This is not legal advice.

Access logs are required to support operations to debug issues. For that purpose, it is okay to keep the access logs for seven days1. But you might need access logs for capacity planning as well. How can you keep the access logs for more than seven days without violating GDPR?

Anonymize Data

The question is: do you really need the IP address in our access logs? The answer is likely no. Unfortunately, CloudFront does not allow us to disable the IP address logging. We have to implement a workaround to anonymize the access logs as soon as they are available on S3. The workaround works like this:

Anonymize CloudFront Access Logs

The diagram was created with Cloudcraft - Visualize your cloud architecture like a pro.

We can use a similar mechanism that is implemented by Google Analytics. An IPv4 address like 91.45.135.67 is turned into 91.45.135.0 (the last 8 bits are removed, 24 bits are kept). IPv6 addresses need a different logic: Google removes the last 80 bits. I will go one step further and remove the last 96 bits and keep 32 bits 2.

The following steps are needed to anonymize an access log file:

  1. Download the object from S3
  2. Decompress the gzip data
  3. Parse the data (tab-separated values, log file format)
  4. Replace the IP addresses with anonymized values
  5. Compress the data with gzip
  6. Upload the anonymized data to S3
  7. Remove the original data from S3

There is no documented max size of an access log file. We should prepare for files that are larger than the available memory. Luckily, Lambda functions support Node.js, which has superb support to deal with streaming data. If we stream data, we never load all data into memory at once.

Andreas and Michael Wittig

Please support our work!

We have published 328 articles, 41 podcast episodes, and 15 videos. It's all free and means a lot of work in our spare time.

Thanks to Alan Leech, Alex DeBrie, e9e4e5f0faef, Goran Opacic, jhoadley, Shawn Tolidano, Thorsten Hoeger, Todd Valentine, Vince Fulco, and all anonymous supporters for your help! We also want to thank all supporters who purchased a cloudonaut t-shirt. It gives us great pleasure to send our t-shirts all over the world.

With your help, we can continue to produce independent & high-quality content focused on AWS. Please support us!

Support us

First, weload some core Node.js dependencies and the AWS SDK:

const fs = require('fs');
const zlib = require('zlib');
const stream = require('stream');
const AWS = require('aws-sdk');
const s3 = new AWS.S3({apiVersion: '2006-03-01'});

It’s time to implement the anonymization:

function anonymizeIPv4Address(str) {
const s = str.split('.');
s[3] = '0';
return s.join('.');
}

function anonymizeIPv6Address(str) {
const s = str.split(':').slice(0, 2);
s.push(':');
return s.join(':');
}

function anonymizeIpAddress(str) {
if (str === '-' || str === 'unknown') {
return str;
}
if (str.includes('.')) {
return anonymizeIPv4Address(str);
} else if (str.includes(':')) {
return anonymizeIPv6Address(str);
} else {
throw new Error('Neither IPv4 nor IPv6: ' + str);
}
}

We also have to deal with TSV (tab-separated values)

function transformLine(line) {
if (line.startsWith('#') || line.trim() === '') {
return line;
}
const values = line.split('\t');
values[4] = anonymizeIpAddress(values[4]);
values[19] = anonymizeIpAddress(values[19]);
return values.join('\t');
}

So far, we process only small amounts of data: a single access log file line. It’s time to deal with the whole file.

Each chunk of data is represented as a buffer in Node.js. A buffer represents binary data in the form of a sequence of bytes. In the buffer, we search for the line-end \n byte. We slice all bytes from beginning to \n and convert them into a string to extract a line. Continue with the apporach until end of file is reached. There is one edge case: A chunk of data can stop in the middle of a line. We have to add the old chunk to the beginning of a new chunk.

async function process(record) {
let chunk = Buffer.alloc(0);
const transform = (currentChunk, encoding, callback) => {
chunk = Buffer.concat([chunk, currentChunk]);
const lines = [];
while(chunk.length > 0) {
const i = chunk.indexOf('\n', 'utf8');
if (i === -1) {
break;
} else {
lines.push(chunk.slice(0, i).toString('utf8'));
chunk = chunk.slice(i+1);
}
}
lines.push('');
const transformed = lines
.map(transformLine)
.join('\n');
callback(null, Buffer.from(transformed, 'utf8'));
};
const params = {
Bucket: record.s3.bucket.name,
Key: record.s3.object.key
};
if ('versionId' in record.s3.object) {
params.VersionId = record.s3.object.versionId;
}
const body = s3.getObject(params).createReadStream()
.pipe(zlib.createGunzip())
.pipe(new stream.Transform({
transform
}))
.pipe(zlib.createGzip());
await s3.upload({
Bucket: record.s3.bucket.name,
Key: record.s3.object.key.slice(0, -2) + 'anonymized.gz',
Body: body
}).promise();
if (chunk.length > 0) {
throw new Error('file was not read completly');
}
return s3.deleteObject(params).promise();
}

Finally, Lambda requires a thin interface that we have to implement. I also ensure that anonymized data is not processed again to avoid an expensive infinit loop.

exports.handler = async (event) => {
console.log(JSON.stringify(event));
for (let record of event.Records) {
if (record.s3.object.key.endsWith('.anonymized.gz')) {
continue;
} else if (record.s3.object.key.endsWith('.gz')) {
await process(record);
}
}
};

I integrated the workaround into our collection of aws-cf-templates. Check out the documentation or the code on GitHub. A similar approach can be used to anonymize access logs from ELB load balancers (ALB, CLB, NLB).

PS: You should also enable S3 lifecycle rules to delete access logs after 38 months.

Thanks to Thorsten Höger for reviewing this article.


  1. 1. Germany source: https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Veranstaltungen/ITSiKongress/14ter/Vortraege-19-05-2015/Heidrich_Wegener.pdf?__blob=publicationFile
  2. 2. One official recommendation I found recommends dropping at least the last 88 bits of an IPv6 address (German source: https://www.datenschutz-bayern.de/dsbk-ent/DSK_84-IPv6.html)
Michael Wittig

Michael Wittig

I launched cloudonaut.io in 2015 with my brother Andreas. Since then, we have published hundreds of articles, podcast episodes, and videos. It’s all free and means a lot of work in our spare time. We enjoy sharing our AWS knowledge with you.
Have you learned something new by reading, listening, or watching our content? If so, we kindly ask you to support us in producing high-quality & independent AWS content. We look forward to sharing our AWS knowledge with you.

Support us

Feedback? Questions? You can reach me via Email, Twitter, or LinkedIn.