Anonymize CloudFront Access Logs

Michael Wittig – 22 Apr 2020

Amazon CloudFront can upload access log files to an S3 bucket. By default, CloudFront logs the IP address of the client. Optionally, cookies could be logged as well. If EU citizens access your CloudFront distribution, you have to process personally identifiable information (PII) in a General Data Protection Regulation (GDPR) compliant way. IP addresses are considered PII, and cookie data could also contain PII. If you want to process and store PII, you need a reason in the spirit of the GDPR.

Disclaimer: I’m not a lawyer! This is not legal advice.

Access logs are required to support operations to debug issues. For that purpose, it is okay to keep the access logs for seven days¹. But you might need access logs for capacity planning as well. How can you keep the access logs for more than seven days without violating GDPR?

Anonymize Data

The question is: do you really need the IP address in our access logs? The answer is likely no. Unfortunately, CloudFront does not allow us to disable the IP address logging. We have to implement a workaround to anonymize the access logs as soon as they are available on S3. The workaround works like this:

Anonymize CloudFront Access Logs

The diagram was created with Cloudcraft - Visualize your cloud architecture like a pro.

We can use a similar mechanism that is implemented by Google Analytics. An IPv4 address like 91.45.135.67 is turned into 91.45.135.0 (the last 8 bits are removed, 24 bits are kept). IPv6 addresses need a different logic: Google removes the last 80 bits. I will go one step further and remove the last 96 bits and keep 32 bits ².

The following steps are needed to anonymize an access log file:

Download the object from S3
Decompress the gzip data
Parse the data (tab-separated values, log file format)
Replace the IP addresses with anonymized values
Compress the data with gzip
Upload the anonymized data to S3
Remove the original data from S3

There is no documented max size of an access log file. We should prepare for files that are larger than the available memory. Luckily, Lambda functions support Node.js, which has superb support to deal with streaming data. If we stream data, we never load all data into memory at once.

First, weload some core Node.js dependencies and the AWS SDK:

const fs = require('fs');
const zlib = require('zlib');
const stream = require('stream');
const AWS = require('aws-sdk');
const s3 = new AWS.S3({apiVersion: '2006-03-01'});

It’s time to implement the anonymization:

function anonymizeIPv4Address(str) {
  const s = str.split('.');
  s[3] = '0';
  return s.join('.');
}

function anonymizeIPv6Address(str) {
  const s = str.split(':').slice(0, 2);
  s.push(':');
  return s.join(':');
}

function anonymizeIpAddress(str) {
  if (str === '-' || str === 'unknown') {
    return str;
  }
  if (str.includes('.')) {
    return anonymizeIPv4Address(str);
  } else if (str.includes(':')) {
    return anonymizeIPv6Address(str);
  } else {
    throw new Error('Neither IPv4 nor IPv6: ' + str);
  }
}

We also have to deal with TSV (tab-separated values)

function transformLine(line) {
  if (line.startsWith('#') || line.trim() === '') {
    return line;
  }
  const values = line.split('\t');
  values[4] = anonymizeIpAddress(values[4]);
  values[19] = anonymizeIpAddress(values[19]);
  return values.join('\t');
}

So far, we process only small amounts of data: a single access log file line. It’s time to deal with the whole file.

Each chunk of data is represented as a buffer in Node.js. A buffer represents binary data in the form of a sequence of bytes. In the buffer, we search for the line-end \n byte. We slice all bytes from beginning to \n and convert them into a string to extract a line. Continue with the apporach until end of file is reached. There is one edge case: A chunk of data can stop in the middle of a line. We have to add the old chunk to the beginning of a new chunk.

async function process(record) {
  let chunk = Buffer.alloc(0);
  const transform = (currentChunk, encoding, callback) => {
    chunk = Buffer.concat([chunk, currentChunk]);
    const lines = [];
    while(chunk.length > 0) {
      const i = chunk.indexOf('\n', 'utf8');
      if (i === -1) {
        break;
      } else {
        lines.push(chunk.slice(0, i).toString('utf8'));
        chunk = chunk.slice(i+1);
      }
    }
    lines.push('');
    const transformed = lines
      .map(transformLine)
      .join('\n');
    callback(null, Buffer.from(transformed, 'utf8'));
  };
  const params = {
    Bucket: record.s3.bucket.name,
    Key: record.s3.object.key
  };
  if ('versionId' in record.s3.object) {
    params.VersionId = record.s3.object.versionId;
  }
  const body = s3.getObject(params).createReadStream()
    .pipe(zlib.createGunzip())
    .pipe(new stream.Transform({
      transform
    }))
    .pipe(zlib.createGzip());
  await s3.upload({
    Bucket: record.s3.bucket.name,
    Key: record.s3.object.key.slice(0, -2) + 'anonymized.gz',
    Body: body
  }).promise();
  if (chunk.length > 0) {
    throw new Error('file was not read completly');
  }
  return s3.deleteObject(params).promise();
}

Finally, Lambda requires a thin interface that we have to implement. I also ensure that anonymized data is not processed again to avoid an expensive infinit loop.

exports.handler = async (event) => {
  console.log(JSON.stringify(event));
  for (let record of event.Records) {
    if (record.s3.object.key.endsWith('.anonymized.gz')) {
      continue;
    } else if (record.s3.object.key.endsWith('.gz')) {
      await process(record);
    }
  }
};

I integrated the workaround into our collection of aws-cf-templates. Check out the documentation or the code on GitHub. A similar approach can be used to anonymize access logs from ELB load balancers (ALB, CLB, NLB).

PS: You should also enable S3 lifecycle rules to delete access logs after 38 months.

Thanks to Thorsten Höger for reviewing this article.

1. Germany source: https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Veranstaltungen/ITSiKongress/14ter/Vortraege-19-05-2015/Heidrich_Wegener.pdf?__blob=publicationFile ↩
2. One official recommendation I found recommends dropping at least the last 88 bits of an IPv6 address (German source: https://www.datenschutz-bayern.de/dsbk-ent/DSK_84-IPv6.html) ↩

Michael Wittig

I’ve been building on AWS since 2012 together with my brother Andreas. We are sharing our insights into all things AWS on cloudonaut and have written the book AWS in Action. Besides that, we’re currently working on bucketAV, HyperEnv for GitHub Actions, and marbot.

Here are the contact options for feedback and questions.

Anonymize CloudFront Access Logs

Michael Wittig

Further reading