Skip to main content

Data pipelines

Both AI/ML and traditional data analytics need clean and accessible data in a format that's usable by analytics tools and AI algorithms.

Processing your data

AWS Pipeline Analytcs ETL Services:

1. Data Ingestion Services

Service Name	Logo	Key Attributes	Use Cases
Amazon Kinesis Data Streams		- Serverless, real-time data streaming. - Massively scalable for terabytes of data. - Automatic provisioning and scaling.	- Real-time analytics (e.g., clickstreams). - Log and event data collection at scale. - IoT data ingestion from sensors.
Amazon Data Firehose		- Fully managed, near real-time data loading. - Built-in data transformation (ETL). - Delivers directly to S3, Redshift, OpenSearch.	- Streaming ETL pipelines. - Simple delivery of logs to analytics tools. - Ingesting IoT data directly into a data lake.

2. Data Storage Services

Service Name	Logo	Key Attributes	Use Cases
Amazon S3		- Highly scalable, durable object storage. - The foundation for data lakes. - Stores any type of data (structured/unstructured).	- Central data lake for raw data. - Archiving and backup. - Source/destination for analytics and ML services.
Amazon Redshift		- Fully managed, petabyte-scale data warehouse. - High-performance with columnar storage. - Optimized for complex SQL queries.	- Business intelligence (BI) and reporting. - High-performance analytical workloads. - Storing structured, transformed data.

3. Data Cataloging Services

Service Name	Logo	Key Attributes	Use Cases
AWS Glue Data Catalog		- Centralized, managed metadata repository. - Automatic schema discovery with crawlers. - Integrates with Athena, EMR, and Redshift.	- Defining schemas for data in S3. - Enabling data discovery for a data lake. - Providing a unified data view for analytics.

4. Data Processing Services

Service Name	Logo	Key Attributes	Use Cases
AWS Glue		- Serverless, fully managed ETL service. - Automated schema discovery and code generation. - Pay-per-job execution model.	- Transforming raw data into structured formats. - Cleaning, enriching, and validating data. - Automating data preparation workflows.
Amazon EMR		- Managed big data platform for Spark, Hadoop, etc. - Handles infrastructure provisioning and scaling. - Cost-effective with Spot Instance integration.	- Large-scale, petabyte-level data processing. - Machine learning and ETL with big data frameworks. - Genomic and scientific data analysis.

5. Data Analysis and Visualization Services

Service Name	Logo	Key Attributes	Use Cases
Amazon Athena		- Serverless, interactive query service. - Uses standard SQL to query data in place (in S3). - Pay-per-query cost model.	- Ad-hoc data discovery on data lakes. - Quickly querying log files without loading them. - Serverless BI and reporting.
Amazon Redshift		- Fully managed, petabyte-scale data warehouse. - High-performance with columnar storage. - Optimized for complex SQL queries.	- Business intelligence (BI) and reporting. - High-performance analytical workloads. - Storing structured, transformed data.
Amazon QuickSight		- Serverless, cloud-native BI service. - Interactive dashboards and reports. - Natural language querying with Amazon Q.	- Creating executive and operational dashboards. - Data visualization for business users. - Embedding analytics into applications.
Amazon OpenSearch		- Managed service for OpenSearch clusters. - Real-time log analytics and application monitoring. - Full-text search capabilities.	- Interactive log analytics and troubleshooting. - Powering search functionality for applications. - Real-time monitoring dashboards.

AWS Pipeline Analytcs ETL Services: