WordCount Analysis

1.Create a new java project, then copy examples folder from /home/hadoop/hadoop-1.0.4/src;

Create a new folder named src, then Paste to the project to this folder.

Error: Could not find or load main class

right-click src folder, --> build Path --> Use as source Folder

2.Copy hadoop-1.0.4-eclipse-plugin.jar to eclipse/plugin . Then restart eclipse.

3.Set the hadoop install directory and configure the hadoop location.

clip_image001

4.Attched the hadoop source code for the project, then you can check hadoop source code freely.

5.Java heap space Error

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
java.lang.OutOfMemoryError: Java heap space
 
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:949)
 
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:674)
 
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:756)
 
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
 
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
 
int maxMemUsage = sortmb << 20;
 
int recordCapacity = (int)(maxMemUsage * recper);
 
recordCapacity -= recordCapacity % RECSIZE;
 
kvbuffer = new byte[maxMemUsage - recordCapacity];

so we should configure the value of io.sort.mb to avoid this.

我运行的机器环境配置比较低,three nodes, all 512M memory .

我没有在core-site.xml中设置这个参数的值,为了这次job,我直接设置在job的driver code中,

conf.set("io.sort.mb","10");

6.sample test data for WordCount:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
10
 
9
 
8
 
7
 
6
 
5
 
4
 
3
 
2
 
1
 
line1
 
line3
 
line2
 
line5
 
Line4
 
运行结果文件是:
 
1        1
 
10        1
 
2        1
 
3        1
 
4        1
 
5        1
 
6        1
 
7        1
 
8        1
 
9        1
 
line1        2
 
line2        2
 
line3        2
 
line4        2
 
line5        2
 
line6        1

还有一个文件是_Success.表明job执行成功。

可以看到执行后的文件是排过序的。是根据key 值的类型进行排序的,我们wordcount示例中,key值是string类型。

7.在Wordcount示例中,没有专门处理如果输出目录已经存在的情况,为了方便测试,我们添加如下的代码来处理目录.

复制代码
Path outPath = new Path(args[1]);

FileSystem dfs = FileSystem.get(outPath.toUri(), conf);

if (dfs.exists(outPath)) {

dfs.delete(outPath, true);

}
复制代码

8.why the wordcount demo 's mapper and reduce class are both static?

(为什么WordCount示例中的mapper和reducer都设计成static的,难道非要这样吗?)

Let me remove the static key word for mapper class, then run the job, you will get exception as follow:

1
2
3
4
5
6
7
8
9
10
11
java.lang.RuntimeException: java.lang.NoSuchMethodException: org.apache.hadoop.examples.WordCount$TokenizerMapper.<init>()
 
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
 
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
 
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
 
Caused by: java.lang.NoSuchMethodException: org.apache.hadoop.examples.WordCount$TokenizerMapper.<init>()
 
at java.lang.Class.getConstructor0(Class.java:2730)

在这个时候,mapper类变成了wordcount类的内部类,反射辅助类无法准备地找到它的构造函数,无法实例化。

解决方案,把mapper类从内部类转成非内部类,从wordcount类中拿出来,放到外面去或另起一个文件,这样

执行依然可以。

我们可以看到,我们的示例,尽可能地简单,都放在一个类里面了,使用static就可以保证可以正确运行,如果我们的mapper和reducer不是特别复杂,这样的设计也无可厚非。如果复杂的话,最好单拎出来放一个类。

9.默认我们在eclipse里面直接调试运行或直接运行的时候,我们并非是执行在hadoop cluster上面的,而是进程中模拟执行的,这样方便我们进行调试,我们可以看到console中会有输出类似LocalJobRunner的字样,而不是JobTracker去执行。

这就是为什么,即使我们设置reducetask number大于1的时候,我们仍会在输出的目录里面看到一个part-0000之类的输出,是因为localjobrunner只支持一个.

为了方便我们直接在这里写完代码,就模拟在集群上执行,是很有必要的,有时候是因为你写的代码不在集群上执行就

不能及时地发现错误(分布式应用程序写的时候还是需要注意很多事项的)。

因为提交到集群其实需要做的一件事就是打包你的代码为jar文件,然后提交到集群中去,所以这里需要做这些事情。

我使用spork兄的EJob类来完成这件事,如果你熟悉可以自己写,可以参照http://www.cnblogs.com/spork/archive/2010/04/21/1717592.html.

参照文章,然后在驱动代码中进行部分调整即可。

10.

如果我想把单词中第一个字母小于N的放在第一个reduce task中完成,其他的放在第二个reduce task中输出,该怎么做呢?

写自己的partitioner类,默认的partitioner类是HashPartitioner类,我们简单实现自己的,然后设置一下就可以了。

11.附上修改后的完整的WordCount类源码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
package org.apache.hadoop.examples;
 
import java.io.File;
import java.io.IOException;
import java.util.StringTokenizer;
 
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
  
class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
 
private final   IntWritable one = new IntWritable(1);
private Text word = new Text();
 
@SuppressWarnings("unused")
public void map(Object key, Text value, Context context
             ) throws IOException, InterruptedException {
if(false){
    StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
         word.set(itr.nextToken());
         context.write(word, one);
        }
}else
{   
    String s = value.toString();
    String[] words = s.split("\\s+");
    for (int i = 0; i < words.length; i++) {
        words[i] = words[i].replaceAll("[^\\w]", "");
       // System.out.println(words[i]);
        word.set(words[i].toUpperCase());
        if(words[i].length()>0)
            context.write(word,one);
    }
}       
 
    }
}
 
public class WordCount {
 
public static class MyPartitioner<K, V> extends Partitioner<K, V> {
 
          public int getPartition(K key, V value,
                                  int numReduceTasks) {
              if(key.toString().toUpperCase().toCharArray()[0]<'N') return 0;
              else return 1;
          }
}
 
  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();
 
    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }
 
  public static void main(String[] args) throws Exception {
     
    File jarFile = EJob.createTempJar("bin");
    EJob.addClasspath("/home/hadoop/hadoop-1.0.4/conf");
    //conf.set("mapred.job.tracker","namenode:9001");
    ClassLoader classLoader = EJob.getClassLoader();
    Thread.currentThread().setContextClassLoader(classLoader);
     
     
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: wordcount <in> <out>");
      System.exit(2);
    }
    //drop output directory if exists
    Path outPath = new Path(args[1]);
    FileSystem dfs = FileSystem.get(outPath.toUri(), conf);
    if (dfs.exists(outPath)) {
        dfs.delete(outPath, true);
    }
     
    conf.set("io.sort.mb","10");
    Job job = new Job(conf, "word count");
     
    ((JobConf) job.getConfiguration()).setJar(jarFile.toString());
    job.setNumReduceTasks(2);//use to reducer process to process work
    job.setPartitionerClass(MyPartitioner.class);
     
     
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

posted on   tneduts  阅读(363)  评论(0)    收藏  举报

导航

< 2025年5月 >
27 28 29 30 1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
1 2 3 4 5 6 7
点击右上角即可分享
微信分享提示