At Excella, we wanted a Big Data sandbox area for our research and prototyping efforts. We frequently work in the Amazon Cloud and chose Amazon’s Elastic Compute Cloud (EC2) – a “use what you need when you need it” virtual computing environment that allows subscribers to launch computing instances with different configurations. Here are five takeaways from our experience.
1. It’s quick and easy to setup
Amazon’s Elastic MapReduce (EMR) provides pre-configured Hadoop environments you can select from in order to quickly spin up a Hadoop cluster. You choose the version of Hadoop and which tools you want to be installed (HIVE, Pig, Mahout, Spark, HUE, etc.) . You can also choose the hardware configuration that works for you from the following options:
- Compute-optimized featuring the highest performing processors and the lowest price/compute performance in EC2
- Memory optimized for memory intensive applications featuring the lowest cost per GiB of RAM among Amazon EC2 instance types
- Storage optimized for applications with high storage requirements featuring very fast SSD-backed instance storage optimized for very high random I/O performance and provide high IOPS at a low cost
The installation process uses a step-by-step GUI driven process that EMR provides to spin up a working Hadoop cluster in a few minutes. Logging and debugging options can be turned on/off by using checkboxes and additional setup scripts can be executed using existing scripts or specifying commands as setup steps.
2. There are multiple ways of ingesting data
Text files: Data in text files can be ingested in the Hadoop cluster in a variety of ways. Data can be stored in Amazon’s Simple Storage Service (S3) or in the Hadoop Distributed File System (HFDS). Loading data in S3 is as easy as dragging files from your computer and dropping them in S3 buckets (containers for objects stored in S3). Note that the source file size has to be 5 terabytes or less. Files can be loaded into HDFS by running Distcp (distributed copy) or S3Distcp commands as part of the cluster setup job or at the command line interface.
Relational (Structured) Data: To load data stored in a relational database we used AWS RDS (Relational Database Service). You install the tool SQOOP (download it from sqoop.apache.org) in your Hadoop cluster, then connect to the appropriate DBMS in RDS and load the data into HDFS or to S3. Amazon RDS supports interaction with relational Database Management Systems such as Oracle, Postgres, MySQL, MS SQL Server and others.
Streaming Data: We tried to use Flume to load streaming data into our Hadoop cluster but didn’t have much success (AWS EMR doesn’t support Flume). We researched this further and found that the recommended way to load streaming data into our AWS EMR Hadoop cluster is using managed Amazon Kinesis. We plan to use Amazon Kinesis in the next phase of our project.
3. The management console gives you options
AWS’ Management Console is an intuitive console that allows you to manage and monitor your running EMR (Hadoop) clusters, spin up new clusters (by either cloning previously created clusters or by setting up new cluster configurations), monitor EC2 instances, monitor your DBMS in RDS, manage security access and user management, manage your files in S3 storage and so on.
If you prefer the command line, you can SSH to the master node of your running EMR cluster. AWS provides easy to follow steps on how to setup SSH to connect to the master node. You can also use the open source tool HUE to manage your Hadoop jobs and execute commands.
4. You can be proactive with billing and cost management
AWS provides a billing and cost management dashboard that shows the month to date balance, the forecast for the remainder of the month based on current usage and the previous months’ totals for historical comparisons. It provides the month-to-date breakdown by AWS service (EMR, EC2, RDS, S3 …). You can also set alerts to be notified when the cost for the month reaches a certain threshold, which we found to be a really handy feature.
5. You can use what you need when you need it
Amazon AWS for hosting your Hadoop gave us the ability to use only what we needed when we needed it. We were able to quickly add computing power to our cluster when required and remove it when we didn’t. The fact that we were able to do this in a matter of minutes and with a few clicks saved us fees and time. AWS’s elasticity and flexibility coupled with the robust billing and cost management system allowed us to have tight controls on cost.
Share your experiences with us and tell us what Cloud environments you’ve used for Big Data and which ones you prefer.