About AWS Services (AWS S3, AWS Glue, EMR, Amazon Kinesis, Amazon Athena, AWS Lambda, AWS Cloudwatch, Amazon Redshift )
1 .Amazon S3 :
It is a simple storage service where the date is stored. The data can be in format of documents, text, images, videos, audios. We can integrate s3 service with another AWS services.
e.g.: We can upload and stores the csv files in s3 bucket to be analyze later.
2.AWS Glue :
Glue is service in which we can do ETL(Extract, Transform and Load) process on big data. Extracting the data from data sources like s3 and transform it and then load the data .
e.g: If the csv data in s3 we have to analyze ,then before analyze there is process of ETL. So,using some stuff of policies we can integrate glue with S3.
3.Glue crawler:
The name itself describe that it crawls the data in file and create a table. It is also smart enough to identify the header and appropriate datatype of respective column.
e.g : As we know, for analyzing the data we have, it should in the form of table so crawler makes it.
4.EMR:
Elastic Map Reduce is a service used to analyze big data using platforms like apache, hive, Hadoop.
5.Amazon Kinesis :
Kinesis is AWS service that process and analyze the big data which streamed data of real-time. Rather than waiting for whole data arriving and processing it, it is nearly immediately process and analyze.
e.g: If the data like continuous logs or any streaming data analyze we can use kinesis and integrate with glue ad crawler.
6.AWS Step Functions :
Step function is service for orchestration for ETL process and other data processing tasks. It is serverless service.
- Data Analysis::> Athena, Redshift, QuickSight
7.Amazon Redshift :
This is data warehousing service for analytics from AWS which stores the data and manages.
8.Amazon Athena :
It is querying serverless service for the big data analyzing .Athena get data directly from S3 bucket and we can analyze it through querying.
e.g: If the data in S3 we have to analyze (assuming validate data), using crawler create a table for database.So that we can querying the data through query editor and analyze.
9.Amazon QuickSight :
The QuickSight service is for visualizing the big data. Just like it is Business Intelligence (BI) service that provides you to visualize data.
e.g: After ETL with glue, and to analyzing using query with Athena we can visualize with Quicksight.
- Monitoring ::> CloudWatch, Lambda (for event-driven automation and orchestration)
10.AWS Lambda :
Lambda is serverless service that means it won’t require to setup or manage infrastructure. It runs the code in response of some events(Event-driven automation).Here, we can add triggers like s3 ,sns,sqs,etc.
e.g: If you want to process the data in s3 automatically ,write a python code in lambda function and add trigger S3. Whenever the files added in s3 ,it runs the code in lambda and we can watch logs in cloudwatch.
11.AWS Cloudwatch :
This is the service used to watch logs of pipeline. Whenever executing the data pipeline we can watch logs of that pipeline whether it is successfully executed or not or any errors.