03 Sep 2011

Using TagSoup to extract text from HTML

These days, I am experimenting Apache Lucene. I need a way to extract text from HTML source, feed it to Lucene. I first come up with a solution by using regex and Clojure:

(defn extract [html]
  (when html
    (str/replace html #"(?m)<[^<>]+>|\n" "")))

Most of the time, it works, and very fast. But it can’t ignore Javascript and CSS, which is needed. So I come up with another solution, by using enlive.

Here is the Clojure code.

(defn- emit-str [node]
   (cond (string? node) node
         (and (:tag node)
              (not= :script (:tag node))) (emit-str (:content node))
         (seq? node) (map emit-str node)
         :else ""))
 (defn extract-text [html]
   (when html
     (let [r (html/html-resource (java.io.StringReader. html))]
       (str/trim (apply str (flatten (emit-str r)))))))

It’s works. javascript is ignored. But it’s a little slow: On my machine, extract a given html file, regex takes 0.21ms, But extract-text takes 2.76ms.

Enlive is build on top of TagSoup, which a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short.

By calling

(html/html-resource (java.io.StringReader. html)

Enlive build a tree for the html, which is a little overkill for only extract text. By directly using TagSoup, I can bypass this overhead. Here is the Java code:

public class Utils {
    public static String extractText(String html) throws IOException,
            SAXException {
        Parser p = new Parser();
        Handler h = new Handler();
        p.setContentHandler(h);
        p.parse(new InputSource(new StringReader(html)));
        return h.getText();
    }
}
class Handler extends DefaultHandler {
    private StringBuilder sb = new StringBuilder();
    private boolean keep = true;
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        if (keep) {
            sb.append(ch, start, length);
        }
    }
    public String getText() {
        return sb.toString();
    }
    public void startElement(String uri, String localName, String qName,
            Attributes atts) throws SAXException {
        if (localName.equalsIgnoreCase("script")) {
            keep = false;
        }
    }
    public void endElement(String uri, String localName, String qName)
            throws SAXException {
        keep = true;
    }
}

After experiment, I find

Parser p = new Parser();

takes a lot of CPU time. By using ThreadLocal

private static final ThreadLocal<Parser> parser = new ThreadLocal<Parser>() {
    protected Parser initialValue() {
        return new Parser();
    }
};

It takes 0.38ms to extract text from the same html file. I am happy with the result.

blog comments powered by Disqus