Java - HTTPClient - How to simulate a web form ?

Java Conceptuel Diagram

About

httpclient is a headless browser library build with Java that can simulate programmatically a user web navigation.

It simulate the navigation of a user on Internet with http request. We can :

  • retrieve a page from a weblink,
  • search in this page (for instance a link, mail, …),
  • and important, we can simulate a validation of form (With get of post Method). It's then possible to retrieve all information of the world.

Becareful on the web, you have a lot explication that are performed over an old version of httpclient. For instance, the website of innovation speak about the version 0.3-3. I will talk about the version 3.1

I recommend you to read this great article Client HTTP Programming Primer for ForAbsoluteBeginners. You will find a lot of good basis information on the web, a explanation of the difference between Httpclient and a browser (see below HttpClient in black and the rest of the browser in blue) and a text that talk about how to perform a connection with a login form.

You have only a text and not a real example. You can read in this article : “So this document is all bla-bla, and you will have to work out the details - all the details - yourself. Such is life.”

It's why I have done and you can find below a simple example from a form login. It connect to my wordpress admin page (version 2.6.3) and retrieve the theme page.

Steps

To have HttpClient working and the code below, you must first download this two JAR :

and to add them to your java path.

package com.gerardnico.httpclient;

import java.io.BufferedReader;
import java.io.InputStreamReader;

import org.apache.commons.httpclient.Cookie;
import org.apache.commons.httpclient.Header;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpStatus;
import org.apache.commons.httpclient.NameValuePair;
import org.apache.commons.httpclient.cookie.CookiePolicy;
import org.apache.commons.httpclient.cookie.CookieSpec;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.commons.httpclient.methods.PostMethod;

public class FormLoginDemo
{
    static final String LOGON_SITE = "gerardnico.com";
    static final int    LOGON_PORT = 80;

    public FormLoginDemo() {
        super();
    }

    public static void main(String[] args) throws Exception {

        HttpClient client = new HttpClient();
        client.getHostConfiguration().setHost(LOGON_SITE, LOGON_PORT, "http");
        client.getParams().setCookiePolicy(CookiePolicy.BROWSER_COMPATIBILITY);
        // 'developer.java.sun.com' has cookie compliance problems
        // Their session cookie's domain attribute is in violation of the RFC2109
        // We have to resort to using compatibility cookie policy

        GetMethod authget = new GetMethod("/wp-login.php");

        client.executeMethod(authget);
        System.out.println("Login form get: " + authget.getStatusLine().toString()); 
        // release any connection resources used by the method
        authget.releaseConnection();
        // See if we got any cookies
        CookieSpec cookiespec = CookiePolicy.getDefaultSpec();
        Cookie[] initcookies = cookiespec.match(
            LOGON_SITE, LOGON_PORT, "/", false, client.getState().getCookies());
        System.out.println("Initial set of cookies:");    
        if (initcookies.length == 0) {
            System.out.println("None");    
        } else {
            for (int i = 0; i < initcookies.length; i++) {
                System.out.println("- " + initcookies[i].toString());    
            }
        }
        
        PostMethod authpost = new PostMethod("/wp-login.php");
        // Prepare login parameters
        NameValuePair submit     = new NameValuePair("wp-submit", "Log In");
        NameValuePair url        = new NameValuePair("redirect_to", "http://gerardnico.com/wp-admin/themes.php");
        NameValuePair userid     = new NameValuePair("log", "User_Login"); <- You have to change this.
        NameValuePair password 	 = new NameValuePair("pwd", "User_Pwd"); <- and that.
        NameValuePair rememberme = new NameValuePair("rememberme", "forever");
        NameValuePair cookie = new NameValuePair("testcookie", "1");
        authpost.setRequestBody( 
        new NameValuePair[] {submit, url, userid, password, rememberme, cookie});
        
        client.executeMethod(authpost);
        System.out.println("Login form post: " + authpost.getStatusLine().toString()); 
        // release any connection resources used by the method
        authpost.releaseConnection();
        // See if we got any cookies
        // The only way of telling whether logon succeeded is 
        // by finding a session cookie
        Cookie[] logoncookies = cookiespec.match(
            LOGON_SITE, LOGON_PORT, "/", false, client.getState().getCookies());
        System.out.println("Logon cookies:");    
        if (logoncookies.length == 0) {
            System.out.println("None");    
        } else {
            for (int i = 0; i < logoncookies.length; i++) {
                System.out.println("- " + logoncookies[i].toString());    
            }
        }
        // Usually a successful form-based login results in a redicrect to 
        // another url
        int statuscode = authpost.getStatusCode();
        if ((statuscode == HttpStatus.SC_MOVED_TEMPORARILY) ||
            (statuscode == HttpStatus.SC_MOVED_PERMANENTLY) ||
            (statuscode == HttpStatus.SC_SEE_OTHER) ||
            (statuscode == HttpStatus.SC_TEMPORARY_REDIRECT)) {
            Header header = authpost.getResponseHeader("location");
            if (header != null) {
                String newuri = header.getValue();
                if ((newuri == null) || (newuri.equals(""))) {
                    newuri = "/";
                }
                System.out.println("Redirect target: " + newuri); 
                GetMethod redirect = new GetMethod(newuri);

                client.executeMethod(redirect);
                System.out.println("Redirect: " + redirect.getStatusLine().toString()); 
	  			BufferedReader br = new BufferedReader(new InputStreamReader(redirect.getResponseBodyAsStream()));
	  	        String readLine;
				while(((readLine = br.readLine()) != null)) {
	  	          System.out.println(" 1 - " + readLine);
				}
	  	        // release any connection resources used by the method
	  	        redirect.releaseConnection();
				
            } else {
                System.out.println("Invalid redirect");
                System.exit(1);
            }
        }
    }
}

Very great, no ? Well, I spend a lot of time to try to get around this code and I realized that when you download the archive from httpclient, you have very good examples in the directory “commons-httpclient-3.1\src\examples”. The code above come from the FormLoginDemo.java file.

Support

You will have a problem of connection because the cookies of wordpress are rejected for a Illegal path attribute reason.

Nov 17, 2008 11:47:15 AM org.apache.commons.httpclient.HttpMethodBase processCookieHeaders
WARNING: Cookie rejected: "$Version=0; wordpress_55ddaa0a24a40c041e4b5cb342cec90a=Nico%7C1228128437%7C679c8500a0a977d85955d370cd2e32f5; $Path=/wp-content/plugins". Illegal path attribute "/wp-content/plugins". Path of origin: "/wp-login.php"
Nov 17, 2008 11:47:15 AM org.apache.commons.httpclient.HttpMethodBase processCookieHeaders
WARNING: Cookie rejected: "$Version=0; wordpress_55ddaa0a24a40c041e4b5cb342cec90a=Nico%7C1228128437%7C679c8500a0a977d85955d370cd2e32f5; $Path=/wp-admin". Illegal path attribute "/wp-admin". Path of origin: "/wp-login.php"

HttpClient do this because it's in the RFC2109 but a lot of browser accept this cookie with bad path. You have may be the possibilities to work around this issue with the IGNORE_COOKIES policy. What I have done, it's to download the source and comment this instruction in the CookieSpecBase.java file :

        // another security check... we musn't allow the server to give us a
        // cookie that doesn't match this path

//        if (!path.startsWith(cookie.getPath())) {
//            throw new MalformedCookieException(
//                "Illegal path attribute \"" + cookie.getPath() 
//                + "\". Path of origin: \"" + path + "\"");
//        }

See this very good thread for more explanation.

Log

To initiate the header log, just add this lines in the head of your program.

System.setProperty("org.apache.commons.logging.Log", "org.apache.commons.logging.impl.SimpleLog");
System.setProperty("org.apache.commons.logging.simplelog.showdatetime", "true");
System.setProperty("org.apache.commons.logging.simplelog.log.httpclient.wire.header", "debug");
System.setProperty("org.apache.commons.logging.simplelog.log.org.apache.commons.httpclient", "debug");

HttpClient in OC4J

Oracle Application Server and OC4J 10g (10.1.3) provide the HTTPClient Java package as a complete HTTP client library. It currently implements most of the relevant parts of the HTTP/1.0 and HTTP/1.1 protocols, including the request methods HEAD, GET, POST and PUT, and automatic handling of authorization, redirection requests, and cookies. Furthermore the included Codecs class contains coders and decoders for the base64, quoted-printable, URL-encoding, chunked and the multipart/form-data encodings.

This how-to illustrates a few basic features of the HTTPClient package with different JSPs, like the GET method and cookies.

Examples





Discover More
(HTTP|HTTPS) - Hypertext Transfer Protocol

Hypertext Transfer Protocol (HTTP) is the transfer protocol to exchange or transfer web resource between nodes (host). The H in HTTP means an hypertext (ie HTML). The protocol was first designed...
Dataquality Metrics
Data Quality - Verification with an external directory website (Scrapping)

The data quality often must use external database to control the validation of the data. It's often the case with the address cleaning. And what better tools that all the data that you can find on the...



Share this page:
Follow us:
Task Runner