HadoopでPython使ってテストしてみた - その１ - もちおのWEBアプリ開発日記

だいぶ乗り遅れた感はあるけどHadoopを試してみた。

環境はvm上のubuntu-9.10

サンプルは使用せずにmapperとreducerはPythonで書いてみました。

まず準備。

javaの確認。なんかの時に入れたのでインスコは省く



mochi@ubuntu-vm:~$ java -version

java version "1.6.0_0"

OpenJDK Runtime Environment (IcedTea6 1.6.1) (6b16-1.6.1-3ubuntu1)

OpenJDK Client VM (build 14.0-b16, mixed mode, sharing)

次、ユーザ作成。グループもhadoopにしてログイン。



mochi@ubuntu-vm:~$ sudo adduser hadoop

・

・

・

mochi@ubuntu-vm:~$ su - hadoop

hadoopでログインしたまま公開鍵の設定。パス無しで作成する。



hadoop@ubuntu-vm:~$ ssh-keygen -t rsa -P ""

・

・

hadoop@ubuntu-vm:~$ cat .ssh/id_rsa.pub >> .ssh/authorized_keys

hadoop@ubuntu-vm:~$ chmod 600 .ssh/authorized_keys 

hadoop@ubuntu-vm:~$ ssh localhost

・

・

hadoop@ubuntu-vm:~$ exit

hadoop@ubuntu-vm:~$ exit

上記でパス聞かれずにログインできればOK。

次、hadoopをここから落としてテスト。
http://ftp.riken.jp/net/apache/hadoop/core/stable/



mochi@ubuntu-vm:~$ mkdir work

mochi@ubuntu-vm:~$ cd work

mochi@ubuntu-vm:~$ wget http://ftp.riken.jp/net/apache/hadoop/core/stable/http://ftp.riken.jp/net/apache/hadoop/core/stable/hadoop-0.20.1.tar.gz

mochi@ubuntu-vm:~$ tar zxvf hadoop-0.20.1.tar.gz

mochi@ubuntu-vm:~$ sudo mv hadoop-0.20.1 /usr/local/hadoop

mochi@ubuntu-vm:~$ sudo chown -R hadoop:hadoop /usr/local/hadoop

ここで/etc/passwdを編集してhadoopユーザのhomeを
/usr/local/hadoopに変更しました。
shellはbashなので$HOME/.bashrcにJAVA_HOMEを追記しておきます。
また、コマンドパスについても同様にexport PATH=$PATH:$HOME/binとかしてあげます。

再度ユーザの変更。



mochi@ubuntu-vm:~$ su - hadoop

・

hadoop@ubuntu-vm:~$

設定ファイルにJAVA_HOMEの環境変数を設定してあげます。
自身の環境に合わせて設定して下さい。

hadoop@ubuntu-vm:~$ vim conf/hadoop-env.sh

# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use. Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
export JAVA_HOME=/usr/lib/jvm/****

設定はだいたいここまで。

次に処理するファイルを用意してあげます



hadoop@ubuntu-vm:~$ mkdir input

hadoop@ubuntu-vm:~$ vim input/example.tsv

1  test

2  mochi

3  aaaa

4  aaaa

5  test

6  bbbbb

7  test

8  mochi

9  hagaeru3sei

10 hagaeru3sei

11 test

こんな感じのtsvファイルを用意しました。

次にこれを処理する[mapper]と[reducer]を用意してあげます。

今回はpythonで書いてみました。

[mapper]

hadoop@ubuntu-vm:~$ mkdir work
hadoop@ubuntu-vm:~$ mkdir work/python
hadoop@ubuntu-vm:~$ vim work/python/map.py

#!/usr/bin/env python
# coding:utf-8

import sys

def main():
    line = sys.stdin.readline()
    try:
        while line:
            line = line[:-1]
            fields = line.split("\t")
            print "%s\t%s" % (fields[0], fields[1])
            line = sys.stdin.readline()
    except Exception, e:
        print e

if __name__ == "__main__":
    main()

[reducer]

hadoop@ubuntu-vm:~$ vim work/python/reduce.py

#!/usr/bin/env python
# coding: utf-8

import sys

cnt = {}

def main():
    global cnt
    line = sys.stdin.readline()
    try:
        while line:
            line = line[:-1] # del \n
            key, value = line.split("\t")
            if not cnt.has_key(value):
                cnt[value] = 0
            cnt[value] += 1
            line = sys.stdin.readline()
    except Exception, e:
        print(e)

if __name__ == "__main__":
    main()
    for k, v in cnt.iteritems():
        print "[ "+ str(k) +" ]\t:\t"+ str(v)

それぞれ単体テスト。



hadoop@ubuntu-vm:~$ chmod 755 work/python/map.py work/python/reduce.py

hadoop@ubuntu-vm:~$ work/python/map.py < input/example.tsv

1  test

2  mochi

3  aaaa

4  aaaa

5  test

6  bbbbb

7  test

8  mochi

9  hagaeru3sei

10 hagaeru3sei

11 test
hadoop@ubuntu-vm:~$ work/python/reduce.py < input/example.tsv

[ test ]        :       4

[ aaaa ]        :       2

[ bbbbb ]       :       1

[ hagaeru3sei ] :       2

[ mochi ]       :       2

OK。

長くなったので記事を分割します。

続きはここから。