Selenium WebDriver Java Framework Course Limited Time Offer for $20

Selenium WebDriver Java Framework Course Limited Time Offer for $20

 

Print

Parse HTML With Jsoup

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. This article explains how to parse a html file with Jsoup library. We have a file "testfile.html" with the content below. 

TestIdTestNameTestResult
001 Login Passed
002 Logout Passed

We will parse the html file and obtain text in each table cell.  

Step 1: create a maven based java project and add the following dependency in the pom.xml file

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.7.2</version>
</dependency>

Write the following code in the class "ParseHtmlWithJsoup.java" file.

package com.example.html;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.File;
import java.io.IOException;

public class ParseHtmlWithJsoup {

    public static void main(String[] args)
    {
       //get the html file to parse its content
        File htmlFile=new File("testfile.html");
        try {
            //parse the html file
            Document doc= Jsoup.parse(htmlFile,"UTF-8");
            //get all table cell elements
            Elements tableRows=doc.getElementsByTag("tr");
            //print headers
            for(Element row: tableRows)
            {
                Elements tableHeaders=row.getElementsByTag("th");
                for (Element element : tableHeaders) {
                    //get text content of each cell element
                    String elementText = element.text();
                    //print out the content
                    System.out.print(elementText +" ");
                }
            }
            //get each row to content without header
            for(Element row: tableRows)
            {
                Elements tableCells=row.getElementsByTag("td");
            //iterate over each element
                for (Element element : tableCells) {
                    //get text content of each cell element
                    String elementText = element.text();
                    //print out the content
                    System.out.print(elementText +" ");
                }
                System.out.println();
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Run the code above and see the result below. Html file content is displayed. 

TestId TestName TestResult 
001 Login Passed 
002 Logout Passed