Hadoop is one of the most popular tool to deal with the big data. I construct the environment of Hadoop in mint 17. Mint 17 is based on the ubuntu 14.04. The following steps also works in ubuntu 14.04.
install java jdk 8
1 2 3 4 5 6 7 8 9 10
sudo apt-get install python-software-properties sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java8-installer javac -version # check the version of java
# remove openjdk java sudo apt-get purge openjdk-\* # java version vs hadoop version # refer: http://wiki.apache.org/hadoop/HadoopJavaVersions
install ssh
1 2 3 4 5 6 7
sudo apt-get install ssh rsync openssh-server ssh-keygen -t rsa -P ""# generate SSH key # Enable SSH Key cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys # test whether it works, it should not need password if it works ssh localhost exit
download hadoop
1 2 3 4 5
wget http://apache.stu.edu.tw/hadoop/common/stable2/hadoop-2.7.2.tar.gz tar zxvf hadoop-2.7.2.tar.gz sudo mv hadoop-2.7.2 /usr/local/hadoop cd /usr/local sudo chown -R celest hadoop
4-1. network environment (for multi-node hadoop) If you use the VMware, you need to add another host-only network card. You can install hadoop successfully and clone it to be slaves. In following, master stands for the primary node and slaveXX for the other nodes.
a. setting the network
using ifconfig to check whether the network card is adding.
1
sudo subl /etc/network/interfaces
the content of file looks like this: (there must be eth1. Put a address for it on the machines master and slaves.)
1 2 3 4 5 6 7 8 9 10 11 12 13
# The loopback network interface for master auto lo iface lo inet loopback
# The primary network interface auto eth0 iface eth0 inet dhcp
# Host-Only auto eth1 iface eth1:0 inet static address 192.168.29.130 # (192.168.29.131 for slave01) netmask 255.255.0.0
restart network by command sudo /etc/init.d/networking restart and check the network again by ifconfig.
<configuration> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop/tmp</value> <description> A base for other temporary directories. </description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> <!-- <value>hdfs://master:9000</value> for the multi-node case --> <description> The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem. </description> </property> </configuration>
c. mapred-site.xml There is no file mapred-site.xml, we get by copy the mapred-site.xml.template. This command would be helpful: cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml && subl etc/hadoop/mapred-site.xml. We set the specification of job tracker like this:
<configuration> <property> <name>dfs.replication</name> <value>3</value> <description> Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. </description> </property> <property> <name>dfs.datanode.data.dir</name> <value>/usr/local/hadoop/tmp/data</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>/usr/local/hadoop/tmp/name</value> </property> </configuration>
put names of your machines in the hadoop/etc/hadoop/slaves. Command: subl etc/hadoop/slaves. The file looks like:
1 2
master slave01
Starting hadoop
1 2 3 4 5 6 7 8
# import environment variable # ubuntu: source ~/.bashrc source /etc/bash.bashrc # format hadoop space hdfs namenode -format # start the hadoop start-dfs.sh && start-yarn.sh # or start-all.sh
The output looks like this: (standalone)
1 2 3 4 5 6 7
localhost: starting namenode, logging to /usr/local/hadoop/logs/hadoop-master-namenode-master-virtual-machine.out localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-master-datanode-master-virtual-machine.out Starting secondary namenodes [0.0.0.0] 0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-master-secondarynamenode-master-virtual-machine.out starting yarn daemons starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-master-resourcemanager-master-virtual-machine.out localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-master-nodemanager-master-virtual-machine.out
The output looks like this: (standalone)
1 2 3 4 5 6 7
master: starting namenode, logging to /usr/local/hadoop/logs/hadoop-master-namenode-master.out slave01: starting datanode, logging to /usr/local/hadoop/logs/hadoop-master-datanode-slave01.out Starting secondary namenodes [0.0.0.0] 0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-master-secondarynamenode-master.out starting yarn daemons starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-master-resourcemanager-master.out slave01: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-master-nodemanager-slave01.out
check whether the server starts by connecting the local server.
1 2 3 4 5
# for standalone firefox http:\\localhost:50070 firefox http:\\localhost:50090 # for multi-node hadoop hdfs dfsadmin -report
Run a example in the folder
1
hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar pi 10 100
The last two line will show the following informations:
1 2
Job Finished in 167.153 seconds Estimated value of Pi is 3.14800000000000000000
Run second example
run wordaccout
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
cd ~/Downloads && mkdir testData && cd testData # download data for test wget http://www.gutenberg.org/ebooks/5000.txt.utf-8 cd .. # upload to the hadoop server hdfs dfs -copyFromLocal testData/ /user/celest/ # check that the file is in the hadoop server hdfs dfs -ls /user/celest/testData/ # run wordcount on hadoop hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wordcount /user/celest/testData /user/celest/testData-output # check the whether it successes hdfs dfs -ls /user/celest/testData-output/ # view the result hdfs dfs -cat /user/celest/testData-output/part-r-00000
# clean the test file (optional) hdfs dfs -rm -r /user/celest/testData hdfs dfs -rm -r /user/celest/testData-output
Stopping hadoop stop-all.sh or stop-dfs.sh && stop-yarn.sh
compiling hadoop library by yourself (optional)
To avoid the warning WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable, you can build the 64 bit hadoop lib by yourself.