An overview on AWS Data Pipeline

Posted By : Yasir Zuberi | 31-Dec-2015

AWS

AWS Data Pipeline is a web service that transmits data. AWS Data Pipeline ensures managing and streamline workflow which is data-driven. AWS DP can be used for data manipulation and exposing data to data analysis tools.

AWS provides four ways to access Data Pipeline, you may use any of the following interface to create, access and manage your pipeline.

AWS Management Console
AWS Command Line Interface (AWS CLI)
AWS SDKs
Query API

Below is the list of Pipeline Objects

Data Nodes
Activities
Resources
Preconditions
Databases
Data Formats
Actions
Schedule
Utilities

AWS allows you to create pipeline using console templates
For Example:- "Full Copy of RDS MySQL Table to S3" is a template which is used to copy an entire Amazon RDS MySQL table and stores output to an Amazon s3 location.
Note:- The output is stored as a CSV file under specified Amazon S3 location.

The template uses following pipeline objects

CopyActivity which copies data from one location to another.
Ec2Resource EC2 instance which performs work defined by pipeline activity.
MySqlDataNode a data node using MySQL.
S3DataNode a data node using Amazon S3.

As you can see in above image it shows, two data nodes for input (SourceRDSTable), output (DestinationS3Location) and one copy activity(RDSToS3CopyActivity).

For more info on AWS PipeLine, you may refer to AWS Pipeline Documentation

Hope this helps.

Thanks