Wednesday, July 1, 2015

Why do we need Hadoop and What can we do with it? [Hadoop for SQL Developer]

If someone comes and tells;

We have a large dataset, let's use Hadoop for processing this.

What would be your answer? If you are an IT enthusiast, the most common and expected answer would be: Yes, why don't we? But is it the right way of using Hadoop? We love technology but it does not mean that we have to always go for the latest or most popular one even though it is not the right one for the current problem. You may solve the issue with traditional Database Management System without implementing Hadoop, or you may implement Hadoop because there is no other options and it is the right way of addressing the problem. How do you decide?

Let's try to understand Hadoop first:

Apache Hadoop is a scalable and fault-tolerance Open Source Framework that is highly optimized for distributed processing and storage run on inexpensive hardware (commodity hardware).

Okay, why we need it?

The simplest answer for this is, we can process a massive amount of data using Hadoop for seeing the insight, quickly and efficiently.

What else, is it all about processing data? Generally, we can use Hadoop for following;
  • Hadoop as an ETL platform
    A common requirement on ETLing with Big Data is, processing an unstructured, a large amount of data and make a structured result. This is not an easy operation with traditional DBMSs and ETL tools. Therefore, Hadoop can be used for processing an unstructured dataset and producing a structured dataset.
  • Hadoop as an Exploration Engine
    Analysis requires complex logic to apply on structured, semi-structured and unstructured data. Hadoop offers many tools for analyzing data efficiently, providing high performance on analysis as data stored in Hadoop cluster.
  • Hadoop as a data storage
    Since scalability is part of the Hadoop, a massive amount of data storing with Hadoop is beneficial. It automatically provides fault-tolerance and high availability. As data is replicated, at least with three nodes, data read operations can access the best node, even when one node is not available due to a fault. Not only that, when more space is required, more nodes can be added without any limitations for expanding the cluster.

No comments: