Jaunt (WebScraping Tool)

web-scraping

Jaunt is a new, free, Java library for web-scraping & web-automation, including JSON querying. The library provides an ultra-light headless browser (ie, no GUI). By using Jaunt your Java programs can easily perform browser-level, document-level, and DOM-level operations. Jaunt is the ideal tool when Javascript support is not required, for tasks including:

  • filling out and submitting web forms
  • creating web-bots or web-scraping programs.
  • writing http-clients for REST APIs or web-apps (JSON, HTML, XHTML, or XML).

Jaunt Beta is a new, free, Java web-scraping/automation library. The API presents a lightweight, headless browser for interfacing with websites, web-apps, and web services. Jaunt makes it easy to parse, traverse, search, extract and filter HTML & XML data. It provides three levels of abstraction: DOM-level, component-level, and browser-level. It is an ideal API for web automation where Javascript is not required, including: filling out and submitting forms creating web-bots or web-scraping programs, creating REST clients for XML services, interfacing with web-based APIs or web-apps, automated testing.

You can download the api/library from the link given below:

http://jaunt-api.com/download.htm

It is for free but we can use the library for only one and then we will have to download again to use it. Each time you download the library, it will work for only 30 days.

Here is an example source which is used to extract data: UserAgentString

It is also filtering the data using a regex.

import com.jaunt.*;
import java.io.FileNotFoundException;
import java.io.PrintWriter;
import java.io.UnsupportedEncodingException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class UserAgentStringScrapper {
    public static void main(String[] args) {
        UserAgent userAgent=new UserAgent();
        String regex="(^Mozilla|^Opera).*";
        Pattern pattern=Pattern.compile(regex);

        try {
            userAgent.visit("http://www.useragentstring.com/pages/useragentstring.php?name=All");
            Elements links=userAgent.doc.findEvery("<ul>").findEvery("<li>").findEvery("<a>");
            String line=new String();
            PrintWriter writer=new PrintWriter("UserAgentString.txt", "UTF-8");
            for (Element link: links){
                line=link.getText();
                Matcher m=pattern.matcher(line);
                if (m.matches()){
                    writer.println(line);
                }

            }
            writer.close();
        } catch (ResponseException e) {
            e.printStackTrace();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        }

    }

}

 

Advertisements

Using Selenium with ChromeDriver (Automation Tool)

selenium1

Selenium WebDriver is a test tool that allows you to write automated web application UI tests in any programming language against any HTTP website. Selenium has the support of some Selenium native part of their browser. It is also the core technology in countless other browser automaton tools, APIs and frameworks.

Here are some most common commands used in selenium.

Command Description
driver.get(“URL”) To navigate to an application.
element.sendKeys(“inputtext”) Enter some text into an input box.
element.clear() Clear the contents from the input box.
select.deselectAll() Deselect all OPTIONs from the first SELECT on the page.
select.selectByVisibleText(“some text”) Select the OPTION with the input specified by the user.
driver.switchTo().window(“windowName”) Move the focus from one window to another.
driver.switchTo().frame(“frameName”) Swing from frame to frame.
driver.switchTo().alert() Helps in handling alerts.
driver.navigate().to(“URL”) Navigate to the URL.
driver.navigate().forward() To navigate forward.
driver.navigate().back() To navigate back.
driver.close() Closes the current browser associated with the driver.
driver.quit() Quits the driver and closes all the associated window of that driver.
driver.refresh() Refreshes the current page.

To use Selenium with ChromeDriver, you will need to download selenium API and ChromeDriver. You can get them from the following links:

Selenium: http://www.seleniumhq.org/download/

ChromeDriver: https://sites.google.com/a/chromium.org/chromedriver/downloads

Here is an example of code:

public class SeleniumDemo {
    public static void main(String[] args){
        System.setProperty("webdriver.chrome.driver", "C:\\chromedriver.exe");
        WebDriver driver=new ChromeDriver();
        driver.get("https://www.google.com.np");
        WebElement element=driver.findElement(By.tagName("q"));
        element.sendKeys("watchnaruto.tv");
        element.submit();
        driver.quit();
    }
}

Selenium is a very useful tool if you want to do automaton project and testing purposes. It can also be used for Data Mining(Web Scraping).

Regex(Regular Expression)

A regular expression, regex or regexp (sometimes called a rational expression) is, in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for “find” or “find and replace” operations on strings.

Regular expressions are extremely useful in extracting information from text such as code, log files, spreadsheets, or even documents. The first thing to recognize when using regular expressions is that everything is essentially a character, and we are writing patterns to match a specific sequence of characters (also known as a string). Most patterns use normal ASCII, which includes letters, digits, punctuation and other symbols on your keyboard like %#$@!, but unicode characters can also be used to match any type of international text.

Here is the table listing down all the regular expression metacharacter syntax available in Java −

Subexpression Matches
^ Matches the beginning of the line.
$ Matches the end of the line.
. Matches any single character except newline. Using m option allows it to match the newline as well.
[…] Matches any single character in brackets.
[^…] Matches any single character not in brackets.
\A Beginning of the entire string.
\z End of the entire string.
\Z End of the entire string except allowable final line terminator.
re* Matches 0 or more occurrences of the preceding expression.
re+ Matches 1 or more of the previous thing.
re? Matches 0 or 1 occurrence of the preceding expression.
re{ n} Matches exactly n number of occurrences of the preceding expression.
re{ n,} Matches n or more occurrences of the preceding expression.
re{ n, m} Matches at least n and at most m occurrences of the preceding expression.
a| b Matches either a or b.
(re) Groups regular expressions and remembers the matched text.
(?: re) Groups regular expressions without remembering the matched text.
(?> re) Matches the independent pattern without backtracking.
\w Matches the word characters.
\W Matches the nonword characters.
\s Matches the whitespace. Equivalent to [\t\n\r\f].
\S Matches the nonwhitespace.
\d Matches the digits. Equivalent to [0-9].
\D Matches the nondigits.
\A Matches the beginning of the string.
\Z Matches the end of the string. If a newline exists, it matches just before newline.
\z Matches the end of the string.
\G Matches the point where the last match finished.
\n Back-reference to capture group number “n”.
\b Matches the word boundaries when outside the brackets. Matches the backspace (0x08) when inside the brackets.
\B Matches the nonword boundaries.
\n, \t, etc. Matches newlines, carriage returns, tabs, etc.
\Q Escape (quote) all characters up to \E.
\E Ends quoting begun with \Q.

Demo:

first
Demo Program
output
Output

A regex – Regular Expression – is a way of extracting/verifying data in a text.
Sometimes, a good regex can replace a hundred of lines of code!

There are websites that allow us to check regex online such as:

  1. https://regex101.com/
  2. http://regexr.com/
  3. http://www.regextester.com/
  4. http://pythex.org/
  5. http://www.gethifi.com/tools/regex

Some website provides regex tester for various languages, some provides for specific languages. Regex syntax may differ from language to language.

If you are going to be a programmer, regex is a very important thing to learn.