DataparkSearch Engine 4.54
Reference manual
Copyright © 2003-2012 DataPark Ltd.
Copyright © 2001-2003 Lavtech.com corp.
Table of Contents
1.
Introduction
1.1.
DataparkSearch Features
1.2.
Where to get
DataparkSearch
.
1.3.
Disclaimer
1.4.
Authors
1.4.1.
Contributors
2.
Installation
2.1.
SQL database requirements
2.2.
Supported operating systems
2.3.
Tools required for installation
2.4.
Installing
DataparkSearch
2.5.
Possible installation problems
2.6.
Quick usage tour
3.
Indexing
3.1.
Indexing in general
3.1.1.
Configuration
3.1.2.
Running
indexer
3.1.3.
How to create SQL table structure
3.1.4.
How to drop SQL table structure
3.1.5.
Subsection control
3.1.6.
How to clear database
3.1.7.
Database Statistics
3.1.8.
Link validation
3.1.9.
Parallel indexing
3.2.
Supported HTTP response codes
3.3.
Content-Encoding support
3.4.
Stopwords
3.4.1.
StopwordFile
command
3.4.2.
Format of stopword file
3.4.3.
FillDictionary
command.
3.4.4.
StopwordsLoose
command.
3.5.
Clones
3.5.1.
DetectClones
command
3.6.
Specifying WEB space to be indexed
3.6.1.
Server
command
3.6.2.
Realm
command
3.6.3.
Subnet
command
3.6.4.
Using different parameter for server and it's subsections
3.6.5.
Default
indexer
behavior
3.6.6.
Using
indexer -f <filename>
3.6.7.
URL
command
3.6.8.
ServerDB, RealmDB, SubnetDB and URLDB
commands
3.6.9.
ServerFile, RealmFile, SubnetFile and URLFile
commands
3.6.10.
Robots exclusion standard
3.7.
Aliases
3.7.1.
Alias
indexer.conf
command
3.7.2.
Different aliases for server parts
3.7.3.
Using aliases in
Server
commands
3.7.4.
Using aliases in
Realm
commands
3.7.5.
AliasProg
command
3.7.6.
ReverseAlias
command
3.7.7.
ReverseAliasProg command
3.7.8.
Alias
command in
search.htm
search template
3.8.
Servers Table
3.8.1.
Loading servers table
3.8.2.
Servers table structure
3.8.3.
Flushing Servers Table
3.9.
External parsers
3.9.1.
Supported parser types
3.9.2.
Setting up parsers
3.9.3.
Avoid indexer hang on parser execution
3.9.4.
Pipes in parser's command line
3.9.5.
Charsets and parsers
3.9.6.
DPS_URL environment variable
3.9.7.
Some third-party parsers
3.9.8.
libextractor library
3.10.
Other commands are used in
indexer.conf
3.10.1.
Include
command
3.10.2.
DBAddr
command
3.10.3.
VarDir
command
3.10.4.
NewsExtensions
command
3.10.5.
SyslogFacility
command
3.10.6.
Word length commands
3.10.7.
MaxDocSize
command
3.10.8.
MinDocSize
command
3.10.9.
IndexDocSizeLimit
command
3.10.10.
URLSelectCacheSize
command
3.10.11.
URLDumpCacheSize
command
3.10.12.
UseCRC32URLId
command
3.10.13.
HTTPHeader
command
3.10.14.
Allow
command
3.10.15.
Disallow
command
3.10.16.
CheckOnly
command
3.10.17.
HrefOnly
command
3.10.18.
CheckMp3
command
3.10.19.
CheckMp3Only
command
3.10.20.
IndexIf
command
3.10.21.
NoIndexIf
command
3.10.22.
HoldBadHrefs
command
3.10.23.
DeleteOlder
command
3.10.24.
UseRemoteContentType
command
3.10.25.
AddType
command
3.10.26.
Period
command
3.10.27.
PeriodByHops
command
3.10.28.
ExpireAt
command
3.10.29.
UseDateHeader
command
3.10.30.
MaxHops
command
3.10.31.
TrackHops
command
3.10.32.
MaxDepth
command
3.10.33.
MaxDocsPerServer
command
3.10.34.
MaxHrefsPerServer
command
3.10.35.
MaxNetErrors
command
3.10.36.
ReadTimeOut
command
3.10.37.
DocTimeOut
command
3.10.38.
NetErrorDelayTime
command
3.10.39.
Cookies
command
3.10.40.
Section
command
3.10.41.
HrefSection
command
3.10.42.
FastHrefCheck
command
3.10.43.
Index
command
3.10.44.
ProxyAuthBasic
command
3.10.45.
Proxy
command
3.10.46.
AuthBasic
command
3.10.47.
ServerWeight
command
3.10.48.
OptimizeAtUpdate
command
3.10.49.
SkipUnreferred
command
3.10.50.
Bind
command
3.10.51.
ProvideReferer
command
3.10.52.
LongestTextItems
command
3.10.53.
MakePrefixes
command
3.11.
Extended indexing features
3.11.1.
Indexing SQL database tables (htdb: virtual URL scheme)
3.11.2.
Indexing binaries output (exec: and cgi: virtual URL schemes)
3.11.3.
Mirroring
3.11.4.
Data acquisition
3.12.
Using syslog
3.13.
Storing compressed document copies
3.13.1.
Configure stored
3.13.2.
How stored works
3.13.3.
Using stored during search
3.13.4.
Document excerpts
4.
DataparkSearch
HTML parser
4.1.
Tag parser
4.2.
Special characters
4.3.
META tags
4.4.
Links
4.5.
Comments
4.6.
Body patterns
4.7.
Sub-documents
5.
Storing data
5.1.
SQL storage types
5.1.1.
General storage information
5.1.2.
Various modes of words storage
5.1.3.
Storage mode - single
5.1.4.
Storage mode - multi
5.1.5.
Storage mode - crc
5.1.6.
Storage mode - crc-multi
5.1.7.
SQL structure notes
5.1.8.
Additional features of non-CRC storage modes
5.2.
Cache mode storage
5.2.1.
Introduction
5.2.2.
Cache mode word indexes structure
5.2.3.
Cache mode tools
5.2.4.
Starting cache mode
5.2.5.
Optional usage of several splitters
5.2.6.
Using run-splitter script
5.2.7.
Doing search
5.2.8.
Using search limits
5.3.
DataparkSearch
performance issues
5.3.1.
searchd
usage recommendation
5.3.2.
Search results caching
5.3.3.
Memory based filesystem (mfs) usage recommendation
5.3.4.
URLInfoSQL
command
5.3.5.
SRVInfoSQL
command
5.3.6.
MarkForIndex
command
5.3.7.
CheckInsertSQL
command
5.3.8.
MySQL performance
5.3.9.
Post-indexing optimization
5.3.10.
Asynchronous resolver library
5.4.
SearchD support
5.4.1.
Why using searchd
5.4.2.
Starting searchd
5.5.
Oracle notes
5.5.1.
Introduction
5.5.2.
Compilation, Installation and Configuration
6.
Subsections
6.1.
Tags
6.1.1.
Tag
command
6.1.2.
TagIf
command
6.1.3.
Tags in SQL version
6.2.
Categories
6.2.1.
Category
command
6.2.2.
CategoryIf
command
6.2.3.
Loading categories table
6.2.4.
FlushCategoryTable command
7.
Languages support
7.1.
Character sets
7.1.1.
Supported character sets
7.1.2.
Character sets aliases
7.1.3.
Recoding
7.1.4.
Recoding at search time
7.1.5.
Document charset detection
7.1.6.
Automatic charset guesser
7.1.7.
Default charset
7.1.8.
Default Language
7.1.9.
LocalCharset
command
7.1.10.
ForceIISCharset1251
command
7.1.11.
RemoteCharset
command
7.1.12.
URLCharset
command
7.1.13.
CharsToEscape
command
7.2.
Making multi-language search pages
7.2.1.
How does it work?
7.2.2.
Possible troubles
7.3.
Segmenters for Chinese, Japanese, Korean and Thai languages
7.3.1.
Japanese language phrase segmenter
7.3.2.
Chinese language phrase segmenter
7.3.3.
Thai language phrase segmenter
7.3.4.
Korean language phrase segmenter
7.4.
Multilingual servers support
8.
Searching documents
8.1.
Using search front-ends
8.1.1.
Performing search
8.1.2.
Search parameters
8.1.3.
Changing different document parts weights at search time
8.1.4.
Using front-end with an shtml page
8.1.5.
Using several templates
8.1.6.
Search operators
8.1.7.
Advanced boolean search
8.1.8.
The Verity Query Language, VQL
8.1.9.
How search handles expired documents
8.2.
mod_dpsearch
module for Apache httpd
8.2.1.
Why using
mod_dpsearch
8.2.2.
Configuring
mod_dpsearch
8.3.
How to write search result templates
8.3.1.
Template sections
8.3.2.
Variables section
8.3.3.
Includes in templates
8.3.4.
Conditional template operators
8.3.5.
Security issues
8.4.
Designing search.html
8.4.1.
How the results page is created
8.4.2.
Your HTML
8.4.3.
Forms considerations
8.4.4.
Relative links in search.htm
8.4.5.
Adding Search form to other pages
8.5.
Relevance
8.5.1.
Ordering documents
8.5.2.
Relevance calculation
8.5.3.
Popularity rank
8.5.4.
Boolean search
8.5.5.
Crosswords
8.5.6.
The Summary Extraction Algorithm (SEA)
8.6.
Search queries tracking
8.7.
Search results cache
8.8.
Fuzzy search
8.8.1.
Ispell
8.8.2.
Aspell
8.8.3.
Synonyms
8.8.4.
Accent insensitive search
8.8.5.
Acronyms and abbreviations
9.
Miscellaneous
9.1.
Reporting bugs
9.1.1.
Currently known bugs
9.1.2.
Core dump reports
9.2.
Using
libdpsearch
library
9.2.1.
dps-config
script
9.2.2.
DataparkSearch
API
9.3.
Database schema
A.
Donations
Index
List of Tables
3-1.
Relationship between libextractor's keyword types and DataparkSearch section names
3-2.
Verbose levels
5-1.
Cache mode predefined limit types
5-2.
SQL-based cache mode limit types
7-1.
Language groups
7-2.
Charsets aliases
8-1.
Available search parameters
8-2.
VQL operators supported by DataparkSearch
8-3.
Configure-time parameters to tune relevance calculation (switches for
configure
)
9-1.
server
table schema
9-2.
Several server's parameters values in
srvinfo
table
Next
Introduction