Nutch配置文件的加载

Nutch的配置文件主要有三类：

Nutch插件的配置文件，这些配置文件主要是在加载插件的时候由插件自己加载的，主要是filter和normalizer插件的配置文件
Nutch自己的配置文件，nutch-default.xml和nutch-site.xml
Hadoop的配置文件，hadoop-default.xml和hadoop-site.xml

这些配置文件的加载顺序决定了它们的优先级，优先级低的会被优先级高的配置文件中的配置覆盖，所以要想配置好nutch，了解配置文件的加载顺序是必须的。下面我通过对nutch源码的剖析来看看nutch是怎样加载配置文件的。

Nutch的主要命令是”./nutch crawl”，而这个crawl命令main类是org/apache/nutch/crawl/Crawl.java，我们就从Crawl.java的main方法开始。

Nutch配置文件的加载主要是以下代码：

  /* Perform complete crawling and indexing given a set of root urls. */
  public static void main(String args[]) throws Exception {
    if (args.length < 1)
    {
      System.out.println("Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN N] [-r]");
      System.out.println("-r\tremove css and javascript, default is do not remove");
      return;
    }
 
    Configuration conf = NutchConfiguration.create();
    conf.addResource("crawl-tool.xml");
    JobConf job = new NutchJob(conf);

上述代码中，”Configuration conf = NutchConfiguration.create();”生成一个NutchConfiguration的对象，NutchConfiguration是管理Nutch自己的配置文件的类，Configuration类是管理Hadoop配置文件的类，我们进入create方法：

^?View Code JAVA

  /** Create a {@link Configuration} for Nutch. */
  public static Configuration create() {
    Configuration conf = new Configuration();
    addNutchResources(conf);
    return conf;
  }

create方法中先创建一个Configuration对象，Configuration方法如下：

^?View Code JAVA

  /** A new configuration. */
  public Configuration() {
    this(true);
  }
 
  /** A new configuration where the behavior of reading from the default 
   * resources can be turned off.
   * 
   * If the parameter {@code loadDefaults} is false, the new instance
   * will not load resources from the default files. 
   * @param loadDefaults specifies whether to load from the default files
   */
  public Configuration(boolean loadDefaults) {
    if (LOG.isDebugEnabled()) {
      LOG.debug(StringUtils.stringifyException(new IOException("config()")));
    }
    if (loadDefaults) {
      resources.add("hadoop-default.xml");
      resources.add("hadoop-site.xml");
    }
  }

由此可见，当构造Configuration对象的时候，会先去加载hadoop-default.xml，然后再去加载hadoop-site.xml，所以hadoop-site.xml里面的配置会覆盖hadoop-default.xml里面的配置。
了解了Hadoop的配置文件的加载，我们再回到刚才的create方法里面。
现在要调用“addNutchResources(conf);”了，其定义如下：

^?View Code JAVA

  /** Add the standard Nutch resources to {@link Configuration}. */
  public static Configuration addNutchResources(Configuration conf) {
    conf.addResource("nutch-default.xml");
    conf.addResource("nutch-site.xml");
    return conf;
  }

这里很明显看出，先加载nutch-default.xml文件，然后再加载nutch-site.xml文件。
下面我们再沿着main方法继续往下看，该到调用“conf.addResource(“crawl-tool.xml”);”了，看来crawl-tool.xml最后加载，这个配置文件主要是用于配置抓取企业内部网。

通过我们上面简单的源码分析，我们得出Nutch配置文件的优先级为：

hadoop-site.xml要高于hadoop-default.xml
crawl-tool.xml高于nutch-site.xml，nutch-site.xml高于nutch-default.xml

Nutch配置文件的加载

Trending Articles

《沈冰自述——我和周永康的故事》全本

Moog - Subsequent 25

出售: 林憶蓮•回來愛的身邊 (東芝1A1頭版)

筆記 - 使用 PowerShell 清除停用 AD 帳號與 OU

df-dferh-01 中国区 Android 安装 Google Play Store 后报错的解决办法

「一棒接一棒、棒棒強棒」108學年度家長會長交接典禮

吸烟与MBTI类型判断捷径 (豆瓣 INFJ的奇幻之旅小组)

acermark龍璿國際展出多款包裝設備

枋寮北勢寮隆山宮睽違12年再辦迎王祭典

日本女优有村千佳COS集锦：狂三&黑白岩&亚丝娜&绫波丽

有遇到过这个问题么。/jsb-videoplayer.js not found, possible missing file.

MAS v2.8 magicgenius 汉化版 - 11.11更新

出售: Monster Cable Interlink Reference 2

福建佛教人士望云和尚(林斌)的九仙禅寺被强行收走，望云妈妈被赶出寺庙

R 语言中的OpenBLAS*和英特尔® 数学核心函数库的性能比较

[转载]煞貢、直星、人專吉日\金神七煞歌

HAKERS哈克士戶外 12月8~14日廠拍

OBS Studio 23.2.1 免安裝中文版 - 免費網路實況廣播軟體實況主必備軟體取代Fraps

<請教>行駛中安卓機會重新開機

Udp2raw-tunnel 及其一键安装脚本