Big Data is the buzzword in IT Circle nowadays. The major reason for this is the exploding “Netizen” base. Today Everything is happening Online and Online Data is estimated in zettabytes. The wealth of information one can carve from Online data is undeniably attractive for several organizations for marketing and sales. Organizations like Google, Yahoo, Facebook, Amazon etc process several Petabytes of data on a daily basis. Many more organizations are moving towards being able to collect, store and make sense of data in the Internet to further their interests.That is where “Big Data” has caught the imagination of people around the world. But What is Big Data and How can I jump into this bandwagon. Fret not, for in the blog post, you are going to find all about it. The structure of this blog will be typical of a What you need to know? series posted at Infosecnirvana.com. So lets get started!!!
What is Data?
Data is anything that provides value in a structured or unstructured format. It is the lowest level of abstraction in Computing terms because after this, it is binary digits only. Data is typically stored in File Systems
Introducing File Systems
File Systems are the basis of storing and accessing data from a hardware device. It is nothing but an abstraction layer of software/firmware that gives you the capability to store data in a structured format, remember the structure and when queried, help retrieve it as quickly as possible. There are 2 major and common types of File Systems – Disk Based (local access) and Network Based (remote access). To give a simple example, FAT is a Windows Disk based File System wheres NFS is a Network based File System.
Even though both the file systems continue to dominate IT space, more and more relevance is given to Network based File Systems for obvious reasons like Distributed Data storage, redundancy, fault tolerance capabilities etc. This is the basis of “Big Data Tools and Technologies”.
Distributed File Systems are Network based File Systems that allow data to be shared across multiple machines across multiple networks. This makes it possible for multiple users on multiple machines to share files and storage resources. The client machines don’t have direct access to the Storage disk itself (as in a Disk based file system), but are able to interact with the Data using a File System protocol. One classic example of DFS is Microsoft SMB where All Windows machines are SMB Clients and access a common SMB Share on the File Server. But SMB suffers from issues pertaining to scalability and fault tolerance. This is where systems like Google File System – GFS (Google uses this in their search engine) and Hadoop Distributed File System – HDFS (Yahoo and others) come into prominence. What these File Systems do is provide a mechanism to effectively manage big data collection, storage and processing across multiple machine nodes.
Hadoop Distributed File Systems or shortly HDFS is similar to the other DFS file systems talked above, however it is significantly different as well. HDFS can be deployed on Commodity Hardware, is Highly Fault Tolerant and is very capable of handling large data sets. Originally HDFS was developed as part of the Apache NUTCH Project for an alternate Search Engine akin to Google. Some of the most prominent software players for HDFS are “Apache Hadoop”, “Greenplum”, Cloudera etc.
In this post, we will be looking at Log Collection and Management using the Hadoop Platform.
APACHE Hadoop: The Apache Hadoop architecture in a Nutshell consists of the following components:
- HDFS is a Master Slave Architecture
- Master Server is called a NameNode
- Slave Servers are called DataNodes
- Underlying Data Replication across Nodes
- Interface Language – Java
Installing Hadoop: Installation of Apache Hadoop is not a very easy task, but at the same time it is not too complex either. Understanding of the Hardware Requirements, Operating System Requirements and Java Programming Language can help you install Apache Hadoop without any issues. Installing Hadoop can be either a Single Node Installation or a Cluster Installation. For this post, we will look at only Single Node Installation steps:
- Install Oracle Java on your machine – Ubuntu
- Install OpenSSH Server
- Create a Hadoop Group and Hadoop User and set Key Based Login for SSH
- Download the Latest Distribution of Hadoop from http://www.apache.org/dyn/closer.cgi
- Installation is just extracting the Hadoop files into a folder and editing some property files
- Provide the location for the JAVA home in the following file location- hadoop/conf/hadoop-env.sh
- Create a working folder in Hadoop User Home Directory /home//tmp
- Add the relevant details about the host and the home directory following configuration elements in /hadoop/conf/core-site.xml
conf/core-site.xml —> hadoop.tmp.dir /home//tmp A base for other temporary directories. fs.default.name hdfs://localhost:54310 The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri’s scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri’s authority is used to determine the host, port, etc. for a filesystem.
- Then we need to edit the hadoop/conf/mapred-site.xml using a text editor and add the following configuration values (like core-site.xml)
conf/mapred-site.xml —> mapred.job.tracker localhost:54311 The host and port that the MapReduce job tracker runs at. If “local”, then jobs are run in-process as a single map and reduce task.
- Open hadoop/conf/hdfs-site.xml using a text editor and add the following configurations:
conf/hdfs-site.xml —> dfs.replication 1 Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
- Before running the Hadoop Installation, the most important step is to format the NameNode or the Master Server. This is critical because, Without the NameNode, the DataNodes will not be setup. In a Single Node Installation, NameNode and DataNodes will reside on the same host, where as in Cluster Installation, NameNodes and DataNodes will reside on different hosts. In order to format the NameNode using Hadoop commands, Run the following command - /hadoop/bin/hadoop namenode -format
- In order to start the Hadoop Instance, from hadoop/bin run ./start-dfs.sh and Running the commands will start up Hadoop and when you query the Java Process, you should be able to see the following components of Hadoop Running:
NameNode DataNode SecondaryNameNode JobTracker TaskTracker
- If you have successfully completed till this, then you now have a Hadoop Single Node Instance running on your machine.
Getting Data in/out of Hadoop:
Once the installation is completed, the next thing we need to worry about is getting data in and out of Hadoop File System. Typically in order to get the data into the system, we need a API interface into HDFS. This typically is a JAVA or HTTP API. Tools like FluentD, Flume etc help in getting data in and out of Hadoop. Both the tools have plugins for receiving HTTP data, Streaming data and Syslog Data as well.
MapReduce: Hadoop and Big data discussions are incomplete without talking about MapReduce. MapReduce is a software policy framework that maps Input data based on a map file and outputs data in key value pairs. These are two different jobs when it comes to actual processing. One is the Map Task that splits the data into smaller chunks and there is the Reduce Job that generates a Key Value combination for each of the smaller data chunks. This framework is the powerhouse for Hadoop because, this is built with parallelism in mind. Map Tasks and Reduce Tasks can both be run parallel on several machines without compromising on speed, cpu and memory resources. The NameNode is the central master that tracks the Maps and the Jobs where as the DataNodes are just providing processing resource.
Finally, Using Hadoop: Now that we know what drives Hadoop and how to get Hadoop installed, the easiest thing would be to start using them. Several examples for MapReduce jobs using Java are available to aid in learning. There are several related projects running to make the Hadoop Ecosystem more scalable and mature. Some of them are:
- HBase, a Bigtable-like structured storage system for Hadoop HDFS
- Apache Pig is a high-level data-flow language and execution framework for parallel computation. It is built on top of Hadoop Core.
- Hive a data warehouse infrastructure which allows sql-like adhoc querying of data (in any format) stored in Hadoop
- ZooKeeper is a high-performance coordination service for distributed applications.
- Hama, a Google’s Pregel-like distributed computing framework based on BSP (Bulk Synchronous Parallel) computing techniques for massive scientific computations.
- Mahout, scalable Machine Learning algorithms using Hadoop
Conclusion: Hope this post helped you in understanding the basic concepts of Big Data and also to setup a Hadoop Single Node Installation to play with. Please do post your thoughts on how Big Data is playing a major role in your organisations.