Building the AWS infrastructure
There are many AWS services that you can use to build the web crawling infrastructure. The Architecture section of this guide includes one proposed solution. We recommend that you consider using the following AWS services to build the supporting infrastructure for your web crawler:
-
Use Amazon Virtual Private Cloud (Amazon VPC) to create the VPC and subnets.
-
Initiate the crawling process by using Amazon EventBridge Scheduler.
-
Manage the web crawler jobs by using AWS Batch jobs and job queues.
-
Use one of the following solutions to run the web crawler jobs:
-
Amazon Elastic Container Service (Amazon ECS) containers on AWS Fargate
-
Amazon Elastic Compute Cloud (Amazon EC2) instances
Note
If your application can handle disruptions, consider using Amazon EC2 Spot Instances through Spot Fleet. Fleets of Spot Instances can help you save significantly on compute costs.
-
AWS Lambda functions
-
-
Store the retrieved data and raw files in an Amazon Simple Storage Service (Amazon S3) bucket.