Jaunt (WebScraping Tool)

web-scraping

Jaunt is a new, free, Java library for web-scraping & web-automation, including JSON querying. The library provides an ultra-light headless browser (ie, no GUI). By using Jaunt your Java programs can easily perform browser-level, document-level, and DOM-level operations. Jaunt is the ideal tool when Javascript support is not required, for tasks including:

  • filling out and submitting web forms
  • creating web-bots or web-scraping programs.
  • writing http-clients for REST APIs or web-apps (JSON, HTML, XHTML, or XML).

Jaunt Beta is a new, free, Java web-scraping/automation library. The API presents a lightweight, headless browser for interfacing with websites, web-apps, and web services. Jaunt makes it easy to parse, traverse, search, extract and filter HTML & XML data. It provides three levels of abstraction: DOM-level, component-level, and browser-level. It is an ideal API for web automation where Javascript is not required, including: filling out and submitting forms creating web-bots or web-scraping programs, creating REST clients for XML services, interfacing with web-based APIs or web-apps, automated testing.

You can download the api/library from the link given below:

http://jaunt-api.com/download.htm

It is for free but we can use the library for only one and then we will have to download again to use it. Each time you download the library, it will work for only 30 days.

Here is an example source which is used to extract data: UserAgentString

It is also filtering the data using a regex.

import com.jaunt.*;
import java.io.FileNotFoundException;
import java.io.PrintWriter;
import java.io.UnsupportedEncodingException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class UserAgentStringScrapper {
    public static void main(String[] args) {
        UserAgent userAgent=new UserAgent();
        String regex="(^Mozilla|^Opera).*";
        Pattern pattern=Pattern.compile(regex);

        try {
            userAgent.visit("http://www.useragentstring.com/pages/useragentstring.php?name=All");
            Elements links=userAgent.doc.findEvery("<ul>").findEvery("<li>").findEvery("<a>");
            String line=new String();
            PrintWriter writer=new PrintWriter("UserAgentString.txt", "UTF-8");
            for (Element link: links){
                line=link.getText();
                Matcher m=pattern.matcher(line);
                if (m.matches()){
                    writer.println(line);
                }

            }
            writer.close();
        } catch (ResponseException e) {
            e.printStackTrace();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        }

    }

}

 

Leave a comment