JCheckLinks a multi-threaded Java hyperlink validator http://web.purplefrog.com/~thoth/jchecklinks/ V0.4 23-May-1999 Copyright 1999 Robert Forsman GNU General Public License and GNU Lesser General Public License This is a prototype release and will expire on June 27. I command you right now to report bugs and request features or kill me! I intend to LGPL the code as I enter beta unless somebody drives a dump truck full of money up to my door (which will be immediately preceeded by a troop of monkeys exiting my butt). The report generator is very primitive. Fortunately, it's very easy to read the raw results and generate your own reports. To use: java CheckLinks [ -nthreads n ] [ -scanners n ] [ -checkpoint nmin ] [ -progress nmin ] [ -proxy host:port ] [ -exact ] [ -loose ] [ -noautoinclude ] [ -rewriteurl from to ] [ -include URLprefix ]* [ -includeexact URLprefix ]* [ -exclude URLprefix ]* [ -excludeexact URLprefix ]* URL1 [ URL2 ... ] Output is the two files ./references and ./statuses Example: java CheckLinks -nthreads 3 -exact http://web.ortge.ufl.edu/contact.html JCheckLinks probes all the URLs on the command line to make sure they're valid. It then follows hyperlinks to scan the tree specified by -include, -exclude, -includeexact, -excludeexact, and the URLs on the command line as affected by -exact, -loose, and -noautoinclude.. For the URLs that are valid, in the include list, and have Content-Type text/html, the link checker downloads the document and scans it for more links. The new URLs are then checked, and if they are -included and not -excluded they are harvested for more URLS and the process continues. To determine whether a document is -included or -excluded the url is compared against all -includeexact and -excludeexact for a match. Otherwise the longest -include or -exclude which exactly matches the beginning of the URL applies. (case of hostnames and protocol are unimportant). -nthreads determines how many total link-checking threads there will be (default 1). -scanners determines how many of those will be harvesting threads (default 2); the rest will be probers only. Too many threads will cause network timeouts. Too many harvesting threads will burden the CPU with forced thread context switches. n+1 harvesters is a reasonable number for a multi-CPU machine. n+4 total threads is a reasonable number to keep your network connection somewhat occupied. -checkpoint creates a checkpointing thread which dumps a set of 4 checkpoint files which can be used by the -resume form of the link checker in case of a crash or user interruption. Current code is inefficient and the process of writing a checkpoint can become unreasonable (>1 minute) on large trees. -progress governs the interval that a progress report is printed. The progress report is simply a list of URLs which have been assigned to link checker threads. Default is 30 seconds (which can not be expressed on the command line). To disable it, select 0 minutes. -proxy specifies the location of a proxy server. I have not tested this. -rewriteurl is most useful for checking web trees where the files are local. Whenever a work thread goes to check a URL, it is rewritten without the work thread's knowledge (so if you've rewritten ftp://ftp.ortge.ufl.edu/ to file:/home/ftp/, it will still be reported as ftp://ftp.ortge.ufl.edu/). The first -rewriteurl which applies (based on a prefix match) will be applied and no other rewrite rules will be checked. This way you can say "-rewriteurl http://web.ortge.ufl.edu/cgi http://web.ortge.ufl.edu/cgi -rewriteurl http://web.ortge.ufl.edu/ file:/home/httpd/data/" and use direct file access for all but the cgi-bin and other cgi* web directories. Of course, if you use server-parsed HTML, the contents of the file will be different from the contents of the document as served by the httpd and you may miss some hyperlinks. The HTML is harvested for hyperlinks in the following attributes of the HTML tags. The only %URI from the HTML 4.0 spec that I've ignored is
. The last thing I need to do is be making random form submissions.