The HotSAX project
HotSAX is a small fast SAX2 parser for HTML, XHTML and XML.
This is version 0.1 of HotSAX. It is in pre-Alpha release state.
The home site of this project is http://hotsax.sourceforge.net
The project page is at http://sourceforge.net/projects/hotsax/
SAX is 'Simple API for XML' which was developed by David Meggison and others on the xml-dev mailing list. (http://www.meggison.com/SAX). SAX parsers parse XML by generating events for start tags, text, and end tags which trigger event handlers in your code. They are meant to be faster and use less memory than an equivalent DOM parser. SAX2 adds lexical handling extensions like comments and CDATA.blocks.
Until now, you needed at least well-formed XML as input to a SAX parser. With the introduction of HotSAX, you can parse HTML (even badly formed HTML,) and still generate SAX events.
Why would you want to do this? This tool is designed to help build other useful things like link spiders, page scrapers, HTML to other format converters and scripted web browsers. A quick example would be a simple text only browser like 'lynx'.
You can embed HotSAX in larger projects like a headline grabber for a content management system. Similar to what My Yahoo does when it displays the top stories from CNET, NY Times etc. See the README and the FAQ files for more information.
Contributers: Shamant Ayyar - Developer
Thanks go out to Pat Niemeyer for his invaluable suggestion to use BeanShell as the scripting language for the test cases. http://www.beanshell.org
Thanks also to the SourceForge people and VA Linux for hosting this project.