I had a client the other day ask me if there was an easy way for him to check a page he had that had a list of over one hundred web sites. Obviously it's very tedious to have to go and click each link one by one, so I offered to try and come up with some code that can do it for him. While I know it's fairly trivial to write a Windows app that would do this, I wanted to make it web-based so it was easy to distribute and re-use within our organization.
Well, as it turns out, there's no "quick and easy" way to parse HTML with server-side ASP.NET code! In a Windows app it's easy to use a Webbrowser control and get an HTMLDocument object, but there's no server-side equivalent. After much searching I ran across something called the HTMLAgilityPack which made it super easy to get a list of A elements in a web page and parse out the text and href information.
Here's some simple code if you want to try something like this yourself ...
<form id="form1" runat="server">
URL to check: <asp:TextBox ID="TextBoxURL" runat="server" Width="550px" /> <asp:Button ID="ButtonGo" runat="server" Text="Go" />
<asp:literal ID="LinkTable" runat="server" />
Partial Class linkCheckerInherits System.Web.UI.Page Protected Sub ButtonGo_Click(ByVal sender As Object, ByVal e As System.EventArgs) Handles ButtonGo.ClickDim hw As HtmlWeb = New HtmlWebDim request As WebRequest = WebRequest.Create(TextBoxURL.Text)
'use this line if you need to authenticaterequest.Credentials = New NetworkCredential("username", "password", "domain")Dim response As HttpWebResponse = CType(request.GetResponse(), HttpWebResponse)Dim doc As HtmlDocument = New HtmlDocument
'select all the Anchor elementsDim hrefs As HtmlNodeCollection = doc.DocumentNode.SelectNodes("//a[@href]")Dim links As String = "<table style='width:100%;font-family:verdana;font-size:10pt;'>" & vbCrLfFor Each href As HtmlNode In hrefsDim linkColor As String = "blue"Dim uri As String = href.Attributes("href").ValueIf InStr(uri, "http", CompareMethod.Text) > 0 Then
links +="<tr><td>"TryDim testReq As WebRequest = WebRequest.Create(uri)Dim myProxy As New WebProxy()
myProxy.Address =New Uri(http://proxy.mydomain.com:8000)'if you go through a proxy for external site but exclude for internals site, put your domain hereIf (InStr(uri, "mydomain.com/", CompareMethod.Text) > 0) Then
testReq.Proxy = myProxyEnd IfDim testRes As HttpWebResponse = CType(testReq.GetResponse(), HttpWebResponse)
links += testRes.StatusDescriptionCatch ex As Exception
links +="<span style='color:red;'>Err</span>"
linkColor ="red"End Trylinks += "</td><td>" & href.InnerHtml & "<br><a style='color:" & linkColor & ";' href='" & uri & "'>" & uri & "</a></td></tr>" & vbCrLfEnd IfNextlinks += "</table>" & vbCrLf
LinkTable.Text = linksEnd Sub