Skip to main content

Data pipelines

Both AI/ML and traditional data analytics need clean and accessible data in a format that's usable by analytics tools and AI algorithms.

Processing your data

AWS Pipeline Analytcs ETL Services:​

1. Data Ingestion Services​

Service NameLogoKey AttributesUse Cases
Amazon Kinesis Data StreamsAmazon Kinesis- Serverless, real-time data streaming.
- Massively scalable for terabytes of data.
- Automatic provisioning and scaling.
- Real-time analytics (e.g., clickstreams).
- Log and event data collection at scale.
- IoT data ingestion from sensors.
Amazon Data FirehoseAmazon Kinesis- Fully managed, near real-time data loading.
- Built-in data transformation (ETL).
- Delivers directly to S3, Redshift, OpenSearch.
- Streaming ETL pipelines.
- Simple delivery of logs to analytics tools.
- Ingesting IoT data directly into a data lake.

2. Data Storage Services​

Service NameLogoKey AttributesUse Cases
Amazon S3Amazon S3- Highly scalable, durable object storage.
- The foundation for data lakes.
- Stores any type of data (structured/unstructured).
- Central data lake for raw data.
- Archiving and backup.
- Source/destination for analytics and ML services.
Amazon RedshiftAmazon Redshift- Fully managed, petabyte-scale data warehouse.
- High-performance with columnar storage.
- Optimized for complex SQL queries.
- Business intelligence (BI) and reporting.
- High-performance analytical workloads.
- Storing structured, transformed data.

3. Data Cataloging Services​

Service NameLogoKey AttributesUse Cases
AWS Glue Data CatalogAWS Glue- Centralized, managed metadata repository.
- Automatic schema discovery with crawlers.
- Integrates with Athena, EMR, and Redshift.
- Defining schemas for data in S3.
- Enabling data discovery for a data lake.
- Providing a unified data view for analytics.

4. Data Processing Services​

Service NameLogoKey AttributesUse Cases
AWS GlueAWS Glue- Serverless, fully managed ETL service.
- Automated schema discovery and code generation.
- Pay-per-job execution model.
- Transforming raw data into structured formats.
- Cleaning, enriching, and validating data.
- Automating data preparation workflows.
Amazon EMRAmazon EMR- Managed big data platform for Spark, Hadoop, etc.
- Handles infrastructure provisioning and scaling.
- Cost-effective with Spot Instance integration.
- Large-scale, petabyte-level data processing.
- Machine learning and ETL with big data frameworks.
- Genomic and scientific data analysis.

5. Data Analysis and Visualization Services​

Service NameLogoKey AttributesUse Cases
Amazon AthenaAmazon Athena- Serverless, interactive query service.
- Uses standard SQL to query data in place (in S3).
- Pay-per-query cost model.
- Ad-hoc data discovery on data lakes.
- Quickly querying log files without loading them.
- Serverless BI and reporting.
Amazon RedshiftAmazon Redshift- Fully managed, petabyte-scale data warehouse.
- High-performance with columnar storage.
- Optimized for complex SQL queries.
- Business intelligence (BI) and reporting.
- High-performance analytical workloads.
- Storing structured, transformed data.
Amazon QuickSightAmazon QuickSight- Serverless, cloud-native BI service.
- Interactive dashboards and reports.
- Natural language querying with Amazon Q.
- Creating executive and operational dashboards.
- Data visualization for business users.
- Embedding analytics into applications.
Amazon OpenSearchAmazon OpenSearch- Managed service for OpenSearch clusters.
- Real-time log analytics and application monitoring.
- Full-text search capabilities.
- Interactive log analytics and troubleshooting.
- Powering search functionality for applications.
- Real-time monitoring dashboards.