Monday, January 23, 2012

How to Parse HTML using C# code?

There are situation where data is not directly available in form used by program. The data is in on HTML page but have good well formatted structure (table, divs, spans, paragraph). In this situation many tools are available to directly access site and scraping html as required. To further process this data some manual efforts required.

So, I was searching for HTML parser in .NET but did not find quick way to do it. Came cross WebBrowser Control. There are different ways of using this control but to parse simple HTML from stream or text here is one way with sample.


public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void Form1_Load(object sender, EventArgs e)
        {
            webBrowser1.Navigate("http://kalpish.blogspot.com");
            webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowser1_DocumentCompleted);
        }

        void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
        {
            //Lets say following HTML I got from either database or stream or any other resource and converted to HTML

            string strHTML = @"lt;div id='testDiv'>
                                <table>
                                    <tr>
                                        <td>
                                            Header1</td>
                                        <td>
                                            Header2</td>
                                        <td>
                                            Header3</td>
                                    </tr>
                                    <tr>
                                        <td>
                                            Data1</td>
                                        <td>
                                            Data2</td>
                                        <td>
                                            Data3</td>
                                    </tr>
                                    <tr>
                                        <td>
                                            Data4</td>
                                        <td>
                                            Data7</td>
                                        <td>
                                            &nbsp;</td>
                                    </tr>
                                    <tr>
                                        <td>
                                            Data5</td>
                                        <td>
                                            &nbsp;</td>
                                        <td>
                                            Data10</td>
                                    </tr>
                                    <tr>
                                        <td>
                                            Data6</td>
                                        <td>
                                            Data8</td>
                                        <td>
                                            &nbsp;</td>
                                    </tr>
                                </table>
                                </div>
                                ";

            //Now Load HTML to HTML Document
            webBrowser1.Document.Body.InnerHtml = strHTML;

            //Find Top Div with id='testDiv'
            HtmlElement testDiv = webBrowser1.Document.GetElementById("testDiv");

            //Find all row in in this Div
            HtmlElementCollection rows = testDiv.GetElementsByTagName("tr");
            string currentMantisData = string.Empty;
            foreach (HtmlElement row in rows)
            {

                currentMantisData += "\r\n";
                HtmlElementCollection cols = row.GetElementsByTagName("td");
                foreach (HtmlElement col in cols)
                {
                    currentMantisData += (!string.IsNullOrEmpty(col.InnerText) ? col.InnerText.Trim() : string.Empty) + ",";
                }
            }
            MessageBox.Show(currentMantisData);
        }        
    }