JCheckLinks

a multi-threaded Java hyperlink validator

http://web.purplefrog.com/~thoth/jchecklinks/

V0.4 23-May-1999

Copyright 1999 Robert Forsman <thoth@purplefrog.com>
GNU General Public License and
GNU Lesser General Public License

  This is a prototype release and will expire on June 27.  I command
you right now to report bugs and request features or kill me!  I
intend to LGPL the code as I enter beta unless somebody drives a dump
truck full of money up to my door (which will be immediately preceeded
by a troop of monkeys exiting my butt).  The report generator is very
primitive.  Fortunately, it's very easy to read the raw results and
generate your own reports.

  To use:

  java CheckLinks  [ -nthreads n ] [ -scanners n ] [ -checkpoint nmin ] 
	[ -progress nmin ] [ -proxy host:port ] [ -exact ] [ -loose ] 
	[ -noautoinclude ] [ -rewriteurl from to ] [ -include URLprefix ]* 
	[ -includeexact URLprefix ]* [ -exclude URLprefix ]* 
	[ -excludeexact URLprefix ]* URL1 [ URL2 ... ]

Output is the two files ./references and ./statuses

  Example:

    java CheckLinks -nthreads 3 -exact http://web.ortge.ufl.edu/contact.html

  JCheckLinks probes all the URLs on the command line to make sure
they're valid.  It then follows hyperlinks to scan the tree specified
by -include, -exclude, -includeexact, -excludeexact, and the URLs on
the command line as affected by -exact, -loose, and -noautoinclude..
For the URLs that are valid, in the include list, and have
Content-Type text/html, the link checker downloads the document and
scans it for more links.  The new URLs are then checked, and if they
are -included and not -excluded they are harvested for more URLS and
the process continues.

  To determine whether a document is -included or -excluded the url is
compared against all -includeexact and -excludeexact for a
match. Otherwise the longest -include or -exclude which exactly
matches the beginning of the URL applies.  (case of hostnames and
protocol are unimportant).

  -nthreads determines how many total link-checking threads there will
be (default 1).  -scanners determines how many of those will be
harvesting threads (default 2); the rest will be probers only.  Too
many threads will cause network timeouts.  Too many harvesting threads
will burden the CPU with forced thread context switches.  n+1
harvesters is a reasonable number for a multi-CPU machine.  n+4 total
threads is a reasonable number to keep your network connection
somewhat occupied.

  -checkpoint creates a checkpointing thread which dumps a set of 4
checkpoint files which can be used by the -resume form of the link
checker in case of a crash or user interruption.  Current code is
inefficient and the process of writing a checkpoint can become
unreasonable (>1 minute) on large trees.

  -progress governs the interval that a progress report is printed.
The progress report is simply a list of URLs which have been assigned
to link checker threads.  Default is 30 seconds (which can not be
expressed on the command line).  To disable it, select 0 minutes.

  -proxy specifies the location of a proxy server.  I have not tested
this.

  -rewriteurl is most useful for checking web trees where the files
are local.  Whenever a work thread goes to check a URL, it is
rewritten without the work thread's knowledge (so if you've rewritten
ftp://ftp.ortge.ufl.edu/ to file:/home/ftp/, it will still be reported
as ftp://ftp.ortge.ufl.edu/).  The first -rewriteurl which applies
(based on a prefix match) will be applied and no other rewrite rules
will be checked.  This way you can say "-rewriteurl
http://web.ortge.ufl.edu/cgi http://web.ortge.ufl.edu/cgi -rewriteurl
http://web.ortge.ufl.edu/ file:/home/httpd/data/" and use direct file
access for all but the cgi-bin and other cgi* web directories.  Of
course, if you use server-parsed HTML, the contents of the file will
be different from the contents of the document as served by the httpd
and you may miss some hyperlinks.

  The HTML is harvested for hyperlinks in the following attributes of
the HTML tags.  The only %URI from the HTML 4.0 spec that I've ignored
is <form action=>.  The last thing I need to do is be making random
form submissions.


<base href=>
<a href=>
<link href=>
<area href=>
<img src= longdesc= usemap=>
<input src= usemap=>
<frame src= longdesc=>
<iframe src= longdesc=>
<style src=>
<script src= for=>
<object codebase= classid= data= archive= usemap=>
<applet codebase= code= archive=>
<head profile=>
<body background=>
<blockquote cite=>
<q cite=>
<ins cite=>
<del cite=>


  JCheckLinks adheres to the Robots Exclusion Protocol (
http://info.webcrawler.com/mak/projects/robots/exclusion.html ).  For
the purpose of writing Disallow clauses, the robot calls itself
"jchecklinks" (all lower case).  It identifies itself as
JCheckLinks/0.1 in the User-Agent header (and the HTTPClient library
adds the string RPT-HTTPClient/0.3-1).


  Each line of the ./references output file consists of two %-encoded
URLs separated by whitespace.  The second URL can be malformed if the
HTML document has a malformed reference.  Each line of the ./statuses
output file has a status string (usually a number from the HTTP
response code, but sometimes indicating an exception), the %-encoded
URL which we probed to get the status, and sometimes a third %-encoded
URL which is present when we got a redirection (301, 302, or 307).


  The -resume form can be used when a checkpointed link checking
process is interrupted.  URLs can not be specified with -resume (since
they are listed in one of the checkpoint files.  -include and -exclude
also can not be specified because they are in another of the
checkpoint files.

  java COM.purplefrog.HTML.LinkChecker.CheckLinks -resume [ -nthreads n ] [ -checkpoint nmin ] [ -progress nmin ] [ -proxy host:port ]


  report.perl will generate a primitive HTML report from ./references
and ./statuses on STDOUT.


  The jar includes the .class files from the HTTPClient package.
These .class files are built from a modified version of 0.3-1.  The
patches should be in the next release of HTTPClient.  You can get the
full HTTPClient package (including source and docs) from
http://www.innovation.ch/java/HTTPClient/ .  The author of the
HTTPClient package is Ronald Tschalär < ronald@innovation.ch >.  It is
LGPL.