AWS ML Specialty Exam — Data Storage
This blog is part of a blog series giving a high level overview of Machine Learning theory and the AWS services examined on the AWS Machine Learning Specialty exam. To view the whole series click here.
When working with machine learning you will need access to data and within AWS there are many different ways to store that data. Choosing the right storage method will depend on the type of the data you are storing, its volume and how quickly you need to access it!
AWS S3
S3 is an AWS service for scalable and secure object storage. As we get deeper into ML on AWS you will see that a lot of services expect S3 to be the source of data or at least interact with S3 at some point. It is Object Based storage, so it is suitable for files, but not suitable to store an operating system on. There is unlimited storage, but individual files uploaded can be from 0 bytes to 5TB. Lastly, S3 has a Universal Namespace, meaning all bucket names need to be globally unique.
For more information on S3, check out my very detailed S3 blog post here.
Data Warehouse
A Data Warehouse collects data from many different sources in many different formats and joins it together in one central place. The data stored in a data warehouse is typically pre-processed before storing (Processing done on import — schema on write). It can be used for business intelligence tools typically for business analysts, data scientists/engineers.
Redshift → Is AWS’s fully managed powerful petabyte-scale data warehouse solution. It can work with structured or semi-structured data.
RedShift Spectrum→ is a powerful tool that allows you to efficiently retrieve and query data stored in Amazon S3 without the need to load it into Amazon Redshift tables. It leverages parallel processing to execute quick queries on large datasets, with most of the processing happening in the Redshift Spectrum layer, keeping the data primarily in Amazon S3. This enables data analysts to perform fast and complex analysis on objects stored in the AWS cloud.
To use Redshift Spectrum, you need a Redshift cluster and a connected SQL client. Multiple Redshift clusters can access the same S3 dataset concurrently. In summary, Redshift Spectrum facilitates seamless data analysis by bridging Amazon Redshift and Amazon S3, providing a high-performance and cost-effective solution for working with data in the cloud.
Data Lakes
Data Lakes →A centralised repository that allows you to store mass amounts of unstructured data. The data typically has no pre-processing before storing (processing is done on export — schema on read). It can be used for processing data for real-time analytics, dashboard visualisations and for machine learning.
Relational Database Service (RDS)
RDS is an AWS managed service for operating and scaling relational databases. AWS handles admin tasks for you like hardware provisioning, patching & backups. The different engines supported by RDS are:
- Aurora
- Postgres
- MySQL
- Maria
- Oracle
- SQL Server.
DynamoDB
DynamoDB is a flexible serverless NoSQL database that allows for the storage of large text and binary data. However, there is a limit of 400KB for item size. DynamoDB can deliver at scale with single-digit millisecond latency.
It is recommended to use DynamoDB when you are working with key-value pairs, when simple queries are needed, or when there is a high read/write or high durability requirement.
Timestream
Timestream is a fully managed time series database that is serverless. It can store and analyse trillions of events per day and is ideal for identifying trends and patterns in IoT data. With Timestream, you can define policies to manage the lifecycle of data, keeping recent data in memory and moving historical data to a more cost-optimized storage tier. It also offers built-in time series analytics functions to assist in identifying trends and patterns in data in near real-time.
DocumentDB
DocumentDB is a fully managed JSON database that has MongoDB compatibility. If you are migrating MongoDB data to AWS, this is the service to use.
Database Migration Tools
Data Pipeline
Allows you to automate the processing and movement of data between compute and storage services. It can also be used to transfer data from on premise to AWS.
Examples of transfers:
- DynamoDB → Data Pipeline → S3
- RDS → Data Pipeline → S3
- Redshift → Data Pipeline → S3
DMS (Database Migration Tools)
DMS is used for transferring data between two different relational database, but you can output results to S3. It can support both homogenous and heterogeneous migrations .
e.g. MySQL → MySQL = homogenous
e.g. MySQL → S3 = heterogeneous
DMS can transfer data, but doesn’t really support transformations other than column name changes.