The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
The project includes these modules:
Other Hadoop-related projects at Apache include:
Following are the steps for installing Hadoop. I have just listed the steps with very brief explanation at some places. This is more or less like some reference notes for installation. I made a note of this when I was installing Hadoop on my system for the very first time.
Please let me know if you need any specific details.
Installing HDFS (Hadoop Distributed File System)
OS : Linux Mint (Ubuntu)
Installing Sun Java on Linux (Mint/Ubuntu)
sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java7-installer sudo update-java-alternatives -s java-7-oracle
Create hadoop user
Install SSH Server if not already present. This i
$ sudo apt-get install openssh-server
$ su - hduser
$ ssh-keygen -t rsa -P ""
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
$sudo gedit /etc/sysctl.conf
This command will open sysctl.conf in text editor, you can copy the following lines at the end of the file:
#disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1
$sudo sysctl -p
To make sure that IPV6 is disabled, you can run the following command:
Download hadoop from Apache Downloads.
$wget http://www.eng.lsu.edu/mirrors/apache/hadoop/core/hadoop-0.22.0/hadoop-0.22.0.tar.gz $ cd /home/hduser $ tar xzf hadoop-0.22.2.tar.gz $ mv hadoop-0.22.2 hadoop
# Set Hadoop-related environment variables export HADOOP_HOME=/home/hduser/hadoop # Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on) export JAVA_HOME=/usr/lib/jvm/java-6-sun # Add Hadoop bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/bin
We need only to update the JAVA_HOME variable in this file. Simply you will open this file using a text editor using the following command:
Add/update the following
Temp directory for hadoop
Then add the following configurations between <configuration> .. </configuration> xml elements:
<!— In: conf/core-site.xml —> <property> <name>hadoop.tmp.dir</name> <value>/home/hduser/tmp</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri’s scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri’s authority is used to determine the host, port, etc. for a filesystem.</description> </property>
We will open the hadoop/conf/mapred-site.xml using a text editor and add the following configuration values (like core-site.xml)
<!— In: conf/mapred-site.xml —> <property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description>The host and port that the MapReduce job tracker runs at. If “local”, then jobs are run in-process as a single map and reduce task. </description> </property>
Open hadoop/conf/hdfs-site.xml using a text editor and add the following configurations:
<!— In: conf/hdfs-site.xml —> <property> <name>dfs.replication</name> <value>1</value> <description>Default block replication. The actual number of replications can be specified when the file is created. The default is used if replic
You should format the NameNode in your HDFS. You should not do this step when the system is running. It is usually done once at first time of your installation.
Run the following command
$/home/hduser/hadoop/bin/hadoop namenode -format
Starting Hadoop Cluster
Stopping Hadoop Cluster
To check for processes running use:
$ps -eaf | grep “java”
Tasks running should be as follows:
NameNode DataNode SecondaryNameNode JobTracker TaskTracker
Example Application to test success of hadoop:
$hadoop jar ../hadoop-mapred-examples-0.22.0.jar pi 3 10
The should complete successfully with several details and output value of pi.
I am a software developer who loves to learn and build new things. To share my learning I blog here and have also built Hadoop Screencasts (www.hadoopscreencasts.com), that hosts screencasts on Apache Hadoop and its components.