Using TagSoup to extract text from HTML
These days, I am experimenting Apache Lucene. I need a way to extract text from HTML source, feed it to Lucene. I first come up with a solution by using regex and Clojure:
(defn extract [html]
(when html
(str/replace html #"(?m)<[^<>]+>|\n" "")))
Most of the time, it works, and very fast. But it can’t ignore Javascript and CSS, which is needed. So I come up with another solution, by using enlive.
Here is the Clojure code.
(defn- emit-str [node]
(cond (string? node) node
(and (:tag node)
(not= :script (:tag node))) (emit-str (:content node))
(seq? node) (map emit-str node)
:else ""))
(defn extract-text [html]
(when html
(let [r (html/html-resource (java.io.StringReader. html))]
(str/trim (apply str (flatten (emit-str r)))))))
It’s works. javascript is ignored. But it’s a little slow: On my machine, extract a given html file, regex takes 0.21ms, But extract-text takes 2.76ms.
Enlive is build on top of TagSoup, which a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: poor, nasty and brutish, though quite often far from short.
By calling
(html/html-resource (java.io.StringReader. html)
Enlive build a tree for the html, which is a little overkill for only extract text. By directly using TagSoup, I can bypass this overhead. Here is the Java code:
public class Utils {
public static String extractText(String html) throws IOException,
SAXException {
Parser p = new Parser();
Handler h = new Handler();
p.setContentHandler(h);
p.parse(new InputSource(new StringReader(html)));
return h.getText();
}
}
class Handler extends DefaultHandler {
private StringBuilder sb = new StringBuilder();
private boolean keep = true;
public void characters(char[] ch, int start, int length)
throws SAXException {
if (keep) {
sb.append(ch, start, length);
}
}
public String getText() {
return sb.toString();
}
public void startElement(String uri, String localName, String qName,
Attributes atts) throws SAXException {
if (localName.equalsIgnoreCase("script")) {
keep = false;
}
}
public void endElement(String uri, String localName, String qName)
throws SAXException {
keep = true;
}
}
After experiment, I find
Parser p = new Parser();
takes a lot of CPU time. By using ThreadLocal
private static final ThreadLocal<Parser> parser = new ThreadLocal<Parser>() {
protected Parser initialValue() {
return new Parser();
}
};
It takes 0.38ms to extract text from the same html file. I am happy with the result.