Heritrix3.x主配置文件(crawler-beans.cxml)详解-白红宇

Heritrix3.x主配置文件(crawler-beans.cxml)详解

阅读量：4347 次

发布时间：2019-06-07

本文共 27734 字，大约阅读时间需要 92 分钟。

<?xml version="1.0" encoding="UTF-8"?>

<!--

HERITRIX 3 抓取工作配置文件

This is a relatively minimal configuration suitable for many crawls.

这是一个相对来说最简化的配置，它可以适用于大部分抓取。

Commented-out beans and properties are provided as an example; values

shown in comments reflect the actual defaults which are in effect

if not otherwise specified specification. (To change from the default

behavior, uncomment AND alter the shown values.) 　　

实际上，如果没有其它的更改，添加在注释中的 beans和它的属性以及属性的值已经作为默认值提供给Heritrix使用，你只需要取消

相关的注释并更改其中的值就可以改变默认的抓取行为

-->

<beans xmlns="http://www.springframework.org/schema/beans"

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xmlns:context="http://www.springframework.org/schema/context"

xmlns:aop="http://www.springframework.org/schema/aop"

xmlns:tx="http://www.springframework.org/schema/tx"

xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-3.0.xsd

http://www.springframework.org/schema/aop http://www.springframework.org/schema/aop/spring-aop-3.0.xsd

http://www.springframework.org/schema/tx http://www.springframework.org/schema/tx/spring-tx-3.0.xsd

http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context-3.0.xsd">

<context:annotation-config/>

<!--

OVERRIDES

Values elsewhere in the configuration may be replaced ('overridden')

by a Properties map declared in a PropertiesOverrideConfigurer,

using a dotted-bean-path to address individual bean properties.

This allows us to collect a few of the most-often changed values

in an easy-to-edit format here at the beginning of the model

configuration.

-->

<value>

# This Properties map is specified in the Java 'property list' text format

# http://java.sun.com/javase/6/docs/api/java/util/Properties.html#load%28java.io.Reader%29

metadata.operatorContactUrl=http://127.0.0.1/

metadata.jobName=basic

metadata.description=Basic crawl starting with useful defaults

##..more?..##

</value>

</property>

</bean>

<!-- overrides from declared <prop> elements, more easily allowing

multiline values or even declared beans -->

<props>

# URLS HERE

#此处填写抓取的种子地址,一行一个。

http://example.example

</prop>

</props>

</property>

</bean>

<!-- <property name="userAgentTemplate"

value="Mozilla/5.0 (compatible; heritrix/@VERSION@ +@OPERATOR_CONTACT_URL@)"/> -->

</bean>

<!-- SEEDS: crawl starting points

ConfigString allows simple, inline specification of a moderate

number of seeds; see below comment for example of using an

arbitrarily-large external file. -->

<value>

# [see override above]

</value>

</property>

</bean>

</property>

</bean>

<!-- SEEDS ALTERNATE APPROACH: specifying external seeds.txt file in

the job directory, similar to the H1 approach.

Use either the above, or this, but not both. -->

<!--

</bean>

</property>

</bean>

-->

<!-- <property name="surtsSource">

<value>

# example.com

# http://www.example.edu/path1/

# +http://(org,example,

</value>

</property>

</bean>

</property> -->

</bean>

<!-- SCOPE: rules for which discovered URIs to crawl; order is very

important because last decision returned other than 'NONE' wins. -->

<list>

</bean>

</bean>

<!-- <property name="surtsSource">

</bean>

</property> -->

</bean>

<!-- <property name="regexList">

<list>

</list>

</property> -->

</bean>

</bean>

</bean>

</bean>

</bean>

</list>

</property>

</bean>

<!--

PROCESSING CHAINS

Much of the crawler's work is specified by the sequential

application of swappable Processor modules. These Processors

are collected into three 'chains'. The CandidateChain is applied

to URIs being considered for inclusion, before a URI is enqueued

for collection. The FetchChain is applied to URIs when their

turn for collection comes up. The DispositionChain is applied

after a URI is fetched and analyzed/link-extracted.

-->

</bean>

<!-- <property name="canonicalizationPolicy">

</property> -->

<!-- <property name="queueAssignmentPolicy">

</property> -->

<!-- <property name="uriPrecedencePolicy">

</property> -->

<!-- <property name="costAssignmentPolicy">

</property> -->

</bean>

<list>

</list>

</property>

</bean>

</bean>

</bean>

</bean>

<!-- <bean id="fetchWhois" class="org.archive.modules.fetcher.FetchWhois">

<map>

</map>

</property>

</bean> -->

<!-- <property name="shouldFetchBodyRule">

</property> -->

<!-- <property name="acceptHeaders">

<list>

<value>Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value>

</list>

</property>

-->

</bean>

</bean>

</bean>

</bean>

</bean>

</bean>

<list>

</list>

</property>

</bean>

<!-- <property name="storePaths">

<list>

<value>warcs</value>

</list>

</property> -->

</bean>

</bean>

</bean>

<!-- <bean id="rescheduler" class="org.archive.crawler.postprocessor.ReschedulingProcessor">

</bean> -->

<list>

<!-- ...send each outlink candidate URI to CandidateChain,

and enqueue those ACCEPTed to the frontier... -->

</list>

</property>

</bean>

<bean id="crawlController"

class="org.archive.crawler.framework.CrawlController">

</bean>

<bean id="frontier"

class="org.archive.crawler.frontier.BdbFrontier">

<!-- <property name="queuePrecedencePolicy">

</property> -->

<!-- <property name="outbound">

<constructor-arg value="200"/>

<constructor-arg value="true"/>

</bean>

</property> -->

<!-- <property name="inbound">

<constructor-arg value="40000"/>

<constructor-arg value="true"/>

</bean>

</property> -->

</bean>

<bean id="uriUniqFilter"

class="org.archive.crawler.util.BdbUriUniqFilter">

</bean>

<!--

EXAMPLE SETTINGS OVERLAY SHEETS

Sheets allow some settings to vary by context - usually by URI context,

so that different sites or sections of sites can be treated differently.

Here are some example Sheets for common purposes. The SheetOverlaysManager

(below) automatically collects all Sheet instances declared among the

original beans, but others can be added during the crawl via the scripting

interface.

-->

<!-- forceRetire: any URI to which this sheet's settings are applied

will force its containing queue to 'retired' status. -->

<map>

</map>

</property>

</bean>

<!-- smallBudget: any URI to which this sheet's settings are applied

will give its containing queue small values for balanceReplenishAmount

(causing it to have shorter 'active' periods while other queues are

waiting) and queueTotalBudget (causing the queue to enter 'retired'

status once that expenditure is reached by URI attempts and errors) -->

<map>

</map>

</property>

</bean>

<!-- veryPolite: any URI to which this sheet's settings are applied

will cause its queue to take extra-long politeness snoozes -->

<map>

</map>

</property>

</bean>

<!-- highPrecedence: any URI to which this sheet's settings are applied

will give its containing queue a slightly-higher than default

queue precedence value. That queue will then be preferred over

other queues for active crawling, never waiting behind lower-

precedence queues. -->

<map>

</map>

</property>

</bean>

<!--

EXAMPLE SETTINGS OVERLAY SHEET-ASSOCIATION

A SheetAssociation says certain URIs should have certain overlay Sheets

applied. This example applies two sheets to URIs matching two SURT-prefixes.

New associations may also be added mid-crawl using the scripting facility.

-->

<!--

<list>

<value>http://(org,example,</value>

<value>http://(com,example,www,)/</value>

</list>

</property>

<list>

<value>veryPolite</value>

<value>smallBudget</value>

</list>

</property>

</bean>

-->

<!--

OPTIONAL BUT RECOMMENDED BEANS

-->

<!-- ACTIONDIRECTORY: disk directory for mid-crawl operations

Running job will watch directory for new files with URIs,

scripts, and other data to be processed during a crawl. -->

</bean>

</bean>

<bean id="checkpointService"

class="org.archive.crawler.framework.CheckpointService">

</bean>

<!--

OPTIONAL BEANS

Uncomment and expand as needed, or if non-default alternate

implementations are preferred.

-->

<!--

<bean id="canonicalizationPolicy"

class="org.archive.modules.canonicalize.RulesCanonicalizationPolicy">

<list>

</list>

</property>

</bean>

-->

<!--

<bean id="queueAssignmentPolicy"

class="org.archive.crawler.frontier.SurtAuthorityQueueAssignmentPolicy">

</bean>

-->

<!--

<bean id="uriPrecedencePolicy"

class="org.archive.crawler.frontier.precedence.CostUriPrecedencePolicy">

</bean>

-->

<!--

<bean id="costAssignmentPolicy"

class="org.archive.crawler.frontier.UnitCostAssignmentPolicy">

</bean>

-->

<!--

<bean id="credentialStore"

class="org.archive.modules.credential.CredentialStore">

</bean>

-->

<!-- DISK SPACE MONITOR:

Pauses the crawl if disk space at monitored paths falls below minimum threshold -->

<!--

<list>

</list>

</property>

</bean>

-->

<!--

REQUIRED STANDARD BEANS

It will be very rare to replace or reconfigure the following beans.

-->

<bean id="statisticsTracker"

class="org.archive.crawler.reporting.StatisticsTracker" autowire="byName">

<!-- <property name="reports">

<list>

</list>

</property> -->

</bean>

<bean id="loggerModule"

class="org.archive.crawler.reporting.CrawlerLoggerModule">

</bean>

<!-- SHEETOVERLAYMANAGER: manager of sheets of contextual overlays

Autowired to include any SheetForSurtPrefix or

SheetForDecideRuled beans -->

<bean id="sheetOverlaysManager" autowire="byType"

class="org.archive.crawler.spring.SheetOverlaysManager">

</bean>

<bean id="bdb"

class="org.archive.bdb.BdbModule">

<!-- if neither cachePercent or cacheSize are specified (the default), bdb

uses its own default of 60% -->

</bean>

<bean id="cookieStorage"

class="org.archive.modules.fetcher.BdbCookieStorage">

<!-- <property name="bdb">

</property> -->

</bean>

<bean id="serverCache"

class="org.archive.modules.net.BdbServerCache">

<!-- <property name="bdb">

</property> -->

</bean>

<!-- CONFIG PATH CONFIGURER: required helper making crawl paths relative

to crawler-beans.cxml file, and tracking crawl files for web UI -->

<bean id="configPathConfigurer"

class="org.archive.spring.ConfigPathConfigurer">

</bean>

</beans>

转载于:https://www.cnblogs.com/potato1896/archive/2013/04/19/3029815.html

你可能感兴趣的文章

winform非常实用的程序退出方法！！！！！（转自博客园）

NIO：与 Buffer 一起使用 Channel

查看>>

Android帧缓冲区（Frame Buffer）硬件抽象层（HAL）模块Gralloc的实现原理分析

查看>>

MFC接收ShellExecute多个参数

查看>>

volatile和synchronized的区别

Fedora14 mount出现错误时解决办法【亲测有效】

Codeforce Round #219 Div2

查看>>