Cloud adaption strategy: event-based data synchronization

Andreas Wittig – 02 May 2018

Are you building an application for the cloud without the slightest dependency to an on-premises infrastructure? Lucky you, most of us are struggling with uniting an outdated on-premises infrastructure with the shiny cloud. I’d like to share a cloud adaption pattern that we have implemented for some of our clients: event-based data synchronization. The cloud adaption pattern focuses on integrating your legacy with your new environment without corrupting the benefits of the cloud.

Read on, if your answer to one of the following questions is YES!

Are you building a new product or feature which needs to access some data stored on-premises?
Are you re-writing your application and moving to the cloud step by step (also known as Strangler Application?

There are two party poopers for a successful journey to the cloud:

Outdated and old-fashioned database technology
A VPN or a dedicated network connection between your data center and AWS

Make sure a doorkeeper prevents both of them from crushing your party.

Overview

Before diving into the details, let us start with an overview of the cloud adaption pattern. The following figure illustrates the components of the event-based data synchronization pattern.

On-premises Relational Database: stores parts of the data your application needs to access.
On-premises Change Event Publisher: creates a change event whenever someone creates, updates, or deletes a relevant row within the relational database.
Internet connection: used to transport change events from your corporate data center to AWS.
DynamoDB: the NoSQL database stores the data relevant to your application.
Lambda: runs your code which accesses the data stored on DynamoDB.
API Gateway: the entry point into your application authorizing and validating incoming requests.

There is only a loose coupling between the old world, your legacy applications and on-premises infrastructure, and the new world, the application you are building on top of AWS. The described approach comes with the following benefits:

High Availability: all parts of your application and infrastructure are distributed among multiple machines in multiple availability zones which leads to a highly available system. Each downtime of the legacy application or corporate data center will not impact the availability of your application in the cloud.
Scalability: all parts of the cloud infrastructure can scale automatically. Neither the legacy application nor the corporate data center’s infrastructure is adding a bottleneck to your architecture.
Pay-per-use pricing: all cloud resources are billed per usage. You will not have to pay for idling resources any longer.

Make sure you are neither introducing outdated and old-fashioned database technology or nor a VPN or dedicated network connection between your data center and AWS. Both will corrupt the described benefits.

Maybe you are not into a serverless architecture based on Lambda, API Gateway, and DynamoDB, yet. Fine, but not a reason for abandoning the event-based data synchronization pattern as shown in the following figure.

On-premises Relational Database: stores parts of the data your application needs to access.
On-premises Change Event Publisher: creates a change event whenever someone creates, updates, or deletes a relevant row within the relational database.
Internet connection: used to transport change events from your corporate data center to AWS.
Kinesis: scalable and managed stream provided by AWS.
Change Event Subscriber: a small service subscribing to the Kinesis data stream and transform each event into a CREATE, UPDATE, or DELETE statement.
RDS Aurora: the MySQL or Postgres compatible relational database storing your application’s data.
EC2 instances: your application is running on a fleet of virtual machines.
Load Balancer: the entry point into your infrastructure distributing requests among your fleet of EC2 instances.

After explaining the event-based data synchronization in general, I’d like to dive into the details next.

Change Data Capture (CDC)

To be able to synchronize data from your on-premises database to your database in the cloud, you need to know whenever data has been modified.

The change event publisher implements the following process:

Someone changes data in the MSSQL database.
The MSSQL database tracks the change.
The change event publisher polls for changes and fetches the information about changed data.
The change event publisher sends the data to DynamoDB or Kinesis.

Most traditional relational database management systems (RDBMS) provide the ability to track changes with the help of a change data capture (CDC) or change tracking mechanism. For example, an MSSQL server can answer the question: what rows of the table users have changed? To enable change tracking, you need to activated change tracking for the database.

ALTER DATABASE mydatabase SET CHANGE_TRACKING = ON (CHANGE_RETENTION = 14 DAYS, AUTO_CLEANUP = ON)

As well as for each table you need to track.

ALTER TABLE [mydatabase].[users] ENABLE CHANGE_TRACKING WITH (TRACK_COLUMNS_UPDATED = ON)

Next, the change event publisher can ask for changed rows with the following query.

SELECT * FROM CHANGETABLE(CHANGES [mydatabase].[users], last_sync_version)

Based on the change information provided by the relational database management systems (RDBMS) the change event publisher creates an event for each changed row.

Synchronize over the Internet

Establishing a VPN or a dedicated network connection between your data center and AWS adds complexity and should be avoided where necessary. To transfer change events from on-premises to AWS all you need is Internet connectivity for the change event publisher running in your corporate data center. Both DynamoDB and Kinesis are accessible via the Internet. Authentication and authorization are handled on layer 7 (the application layer) with the help of Identity and Access Management (IAM). Of course, data is encrypted in transit (HTTPS).

Summary

Often when building applications for the cloud it is necessary to access data that is stored in databases located in the corporate data center. However, it is not advisable to make use of outdated and old-fashioned database technology and a VPN or dedicated network connection between your data center and AWS. Doing so ruins the main benefits offered by the cloud: high availability, scalability, and pay-per-use pricing. Use loose coupling instead by synchronizing changes from the on-premises database to AWS.

Andreas Wittig

I’ve been building on AWS since 2012 together with my brother Michael. We are sharing our insights into all things AWS on cloudonaut and have written the book AWS in Action. Besides that, we’re currently working on bucketAV, attachmentAV, HyperEnv, and marbot.

Here are the contact options for feedback and questions.