Simple C# Screen Scraping Proxy with JQuery

Was asked today how to do a screen scrape an external site using JQuery. The short version is, you can’t do it with JQuery alone. There exist certain security measures that prevent ajax requests going out to other domains/points of origin.

You can achieve the effect in a number of ways. The most old school of these is using an iframe, but in most cases this just won’t cut it as you’ll need to be able to manipulate the returned HTML.

A better way is to code up a simple server side proxy that does the scrape, and then do your ajax postback to there instead. Here’s an example in C#…

            
using (WebClient client = new WebClient())
{
   string url = "http://www.google.com/"; 
   Byte[] requestedHTML = client.DownloadData(url);
   UTF8Encoding objUTF8 = new UTF8Encoding();       
   
   //This line just writes the string straight back to the response, but you 
   //could just as easily stick it in a string variable and manipulate it to 
   //your hearts content!                         
   Response.Write(objUTF8.GetString(requestedHTML)); 

}

Let’s say you saved that as the code behind of an otherwise blank page called “Ajax/scape.aspx”. You’d then just need to use the jquery “.load” command

$("myDiv").load("Ajax/scrape.aspx");

and you’re there! Note that the “load” command will cache by default, so if you need something more complex look up the “.ajax” command.

Implementing many-to-many relationships in MVC3

Just going to do a quick little example of how to implement Many to many relationships in MVC3, basically because the information available out there is pretty rubbish. We’re going to do the classic scenario: Products or product types posessing categories. I’m going to skip over what I consider to be the “bits that you probably don’t need help with” and focus directly on the problem. Below is the tidyest solution I can think of without having to include any third party or non-core assemblies etc. Please: If you can think of a better way, let me know!

So what we’re attempting to do is to implement a page which will allow the user to basically manage their categories. So, when they select a product type, they are going to be presented with a list of checkboxes (one for each category). They tick/untick these and then hit “submit”, and the product gets assigned those categories. This gives each product many categories, and each category many products. The fact that the info out there at time of writing is pretty poor staggers me given that this will be something that almost EVERY developer will need to do!

Technologies used: MVC3, Razor View Engine, C#, Linq to SQL / entity framework, SQL.
Patterns used: ViewModel, MVC, Repository.

The intention of this post isn’t to go through each line of code line by line; I’m just going to post how I did it and hopefully you can follow the code yourself. Having said that, I would be happy to answer any questions anyone has.

Database schema

Is pretty standard. Just the two tables we’re trying to relate and a link table containing the Primary Keys from each to represent a relationship. The rest of the code in this example assumes you’ve used the Entity framework to infer classes from these tables.

The View Model

So first things first, we need to create a view model (the model that will back up our view). We begin thinking about this by thinking about what is going to be on our page; basically a list of categories appropriate to a product type. So we’re only going to need two properties (ProductType and a dictionary of Categories, which will be the checkboxes when rendered). The View Model is therefore as follows;

    public class AssignProductTypeCategoriesViewModel
    {
        //The two properties
        public ProductType ProductType { get; private set; }       
        public Dictionary<Category, bool> CategoriesChecklist  { get; private set; }

        //a constructor to create the above two properties upon instantiation
        public AssignProductTypeCategoriesViewModel(ProductType productType, IEnumerable categories) 
        {
            bool found;
            ProductType = productType;
            CategoriesChecklist = new Dictionary<Category, bool>();

            foreach (var category in categories)
            {
                found = false;
                foreach (var producttypecategory in productType.ProductTypesCategories)
                {
                    if (category.ID == producttypecategory.CategoryId)
                    {
                        CategoriesChecklist.Add(category, true);
                        found = true;
                        break;
                    }                        
                }
                if(!found)
                    CategoriesChecklist.Add(category, false);
            }
        }
    }

Both property classes (Category and ProductType) as well as the link table class (ProductTypesCategories) were generated by the Linq to SQL classes generator / Entity Framework, which if you’ve named your tables differently to me will also be different.

The controller

Two actions: one to display the current product type-to-category setup, the other to handle the form once it’s posted back. Note that in the second one we are taking a form collection and the id of the product type we were editing. This is because there doesn’t appear to be a tidy way to bind the form straight to the model without sacrificing web accessibility/standards in the process. There are calls to repository methods in this code, will cover them in the next section.

        public ActionResult AssignProductTypeCategories(int id)
        {
            var productType = productsRepository.GetProductTypeById(id);
            var categories = productsRepository.GetAllCategories();
            var model = productsRepository.GetAssignProductTypeToCategoryModel(productType, categories);
            return View(model);
        }

        [HttpPost]
        public ActionResult AssignProductTypeCategories(int id, FormCollection postedForm)
        {
            List categoriesToAdd = new List();
            ProductType productType = productsRepository.GetProductTypeById(id);
            foreach (var category in productsRepository.GetAllCategories())
            {
                if (postedForm[category.Name].ToString().Contains("true"))
                {
                    categoriesToAdd.Add(category);
                }
            }

            productsRepository.ReassignCategories(productType, categoriesToAdd);
            productsRepository.Save();
            return RedirectToAction("ProductTypes");
        }

The Repository Methods

Here are the methods that were called in the last example. I haven’t bothered including “Save”, or any of the other query type methods as I imagine you’ve pretty much got those covered.

//This method just returns the view model (abstracted here for cleaner Controller code)
public AssignProductTypeCategoriesViewModel GetAssignProductTypeToCategoryModel(ProductType productType, IEnumerable categories)
        {
            return new AssignProductTypeCategoriesViewModel(productType, categories);
        }

//This method creates the new category to product configuration
        public void ReassignCategories(ProductType productType, List categoriesToAdd)
        {
            db.ProductTypesCategories.DeleteAllOnSubmit(productType.ProductTypesCategories);

            List relationshipsToAdd = new List();

            foreach (var category in categoriesToAdd)
            {
                relationshipsToAdd.Add(new ProductTypesCategory(productType, category));
            }

            db.ProductTypesCategories.InsertAllOnSubmit(relationshipsToAdd);
        }

One thing to note here is that the reassign method deletes the current configuraiton and replaces it with the new one. This is absolutely fine for this demonstration and probably for most cases (as every category will be considered every time they’re edited). The only risk you’d run in larger applications would be users “stomping on each others configurations” if they both happened to be changing the category setup at the same time.

And finally, the View

Strongly typed to the viewmodel we set up at the beginning. The only issue I’ve had is that the view engine adds some superfluous markup (hidden checkboxes) to the page, and these come back in the form collection that the controller handles. I have absolutely no idea why it does this, but I tried the Checkbox HTML helper outside of this example and it does it everywhere.

@model MirrorExpert.Models.ViewModels.AssignProductTypeCategoriesViewModel

@{
ViewBag.Title = "Assign Product Type Categories";
}

<h2>Assign Product Type Categories</h2>

@using (Html.BeginForm(new { id=Model.ProductType.ID }))
{
<fieldset>
<legend>@Model.ProductType.Name</legend>

@foreach (var category in @Model.CategoriesChecklist)
{
@Html.CheckBox(category.Key.Name, category.Value)
@Html.Label(category.Key.Name)
<br />
}

<p>
<input type="submit" value="Save" />
</p>
</fieldset>
}
<div>
@Html.ActionLink("Back to List", "ProductTypes")
</div>

Which essentially renders like this with the categories and product types in my database;

This is a pretty tight solution for achieving this sort of functionality, but I would welcome any criticism or comments that people might have.

Managing State in ASP.net

Managing state is possibly the most important thing any web developer has to do. A basic request to any web server has no concept of state – and if the developer has not explicitly specified this then each request will be treated as if it was the first time the user had ever made a request. It might be that this is absolutely fine: Small, static projects may have no need to explicitly manage state. Most projects or applications, however, will.

Now: This article is by no means going to go into the topic in great detail. There would be absolutely no need for me to do that because there is already a wealth of information out there. There is an absolutely fantastic article over on the MSDN and if you want a more in-depth explanation of state management, I would recommend you stop reading this and go and read that instead. Although, the concept of state management is not limited purely to sites and applications developed with Microsoft technology, their guide is the best I‘ve seen.

What I will instead do here is give a very brief overview of some of the methods of state management, Show an example of using each, bullet out some pros and cons of each and give some real-world examples of when you might use each one. I do expect this article to evolve over time, so if you’re reading this close to post date expect it to go through several revisions (and suggestions are most welcome).

This is also a fairly basic topic so apologies in advance: I’ve been writing applications for years but am relatively new to writing articles so am practicing on the basic topics first.

Session

You can think of a Session beginning when the user “first visits“, and ending after they’re “done”. The concept of a user being “done” is a decision that the developer needs to make themselves: it is determined by a period of inactivity from the user since their last request. At time of writing, the default period of time is 20 minutes. Session actually works by sticking a small cookie on the client’s machine (more on cookies later) containing a session ID. This ID corresponds to an ID server-side which again you can choose to either store “In Process”, in a SQL database or on a state server. Whichever of these choices you make, the basic idea remains the same.

Example

It’s just a collection, accessible through properties of the Request, Page and HttpContext
objects among others. You use it just like any other collection;

Put something in…

Session["Foo"] = "Bar";

And take something out…

String bar = (string)Session[“Foo”];

Pros

  • Put something in Session, and it’s accessible at any point during the user’s visit to your site.
  • It can store complex objects (although you need to cast on retrieval as above)

Cons

  • Not Scaleable: Data is stored server side.
  • State is lost if application pool is rebooted for any reason.

Real world Application

  • Shopping trolley for the duration of a user’s visit.
  • Authentication / Remembering if a user is logged in or not.

Other considerations

While the concept of session will remain the same, you will additionally have to consider where you’ll store the session data itself. As previously mentioned, ASP.net allows you to store this in three different ways and the one you choose will basically be determined by how much traffic you’re expecting to get and how big your session objects are going to be. See the MSDN article for more info about these considerations.

Application

Not to be confused with Session, although very similar. You can access it through a property of the page object, and it survives until the application terminates. If you are storing your state data “in process”, this is essentially the same thing although application state can not be stored in a SQL database or on a session server.

Example

Like session, It’s just a collection (although this time you can‘t access it through the request). You use it just like any other collection;

Put something in…

Application["Foo"] = "Bar";

And take something out…

String bar = (string)Application[“Foo”];

Pros

  • Accessible at any point during the user’s visit to your site.
  • It can store complex objects (although you need to cast on retrieval as above)

Cons

  • Not Scaleable: Data is stored server side.
  • State is lost if application pool is rebooted for any reason.

Real world Application

You should use application state to store things that aren’t specific to the user or their visit. So, whilst a “shopping cart” is appropriate for session, it’s not appropriate for application. Things that *are* suitable are;

  • A database connection string
  • A third party web service or widget’s username or password

Commonly people will fish these things out of their web.config file on the “application start” event and put them into application object.

Cookies

Cookies are small key-value pairs that you can deposit onto the client’s machine through the web response. Any given cookie has an “expiry date” that you set when you serve one to the client. You can’t just store these things forever on a clients PC – by default the cookie expiration time is set to the same as the session expiration time, which is 20 minutes after the last request. Commonly though, with cookies you‘ll want to extend this period of time, the reason for which will become obvious when you see the real-world uses for cookies.

One other thing: Many developers seek to create cookies that never expire. This is not actually possible: the date of cookie expiration is send back through the response and stored on the clients PC. You can sort of achieve it by setting a ridiculously distant expiration date (say, 50 years) but this is generally bad practice because if someone isn’t going to visit your site for 49 years then you shouldn’t be giving them a cookie in the first place. A better approach is to “top up” the expiration date of the cookie every time a user visits your site, and limit this period of time to no more than 3 years maximum.

Example

There’s a property of the Repsonse object that you can use to send cookies back to the user;

Response.Cookies["foo"] = "bar";

And you can retreive the cookie from a request…

string bar = Request.Cookies["foo"]

Pros

  • Scaleable (data stored on client machine)
  • Long expiration dates are allowed

Cons

  • Limited to 4KB in current browsers
  • Users can block them AND change them
  • Not secure (don’t use for passwords)

Real world Application

  • Tracking user conversions: If a user first visited your site on Monday, but didn’t convert until Friday, you might want to know that they *originally* visited through one of your PPC campaigns. You wouldn’t be able to tell this unless your application remembered their first visit.
  • Multivariate / AB testing (so your user doesn’t see all tests, you would need to assign them a ‘toss of the coin’ on the first visit and retain that throughout their experience)
  • Remembering login credentials (remember the original point about security though)
  • Cookies are used to power the aforementioned session state.

ViewState

Viewstate is a method by which you can maintain information during a postback (where a page is its own referrer). This is done by putting a hidden encoded hidden field in the HTML response, which the client then sends back to you on their next request. For this reason, it should be obvious that large view state data will mean large requests and therefore will increase traffic size.

Example

Again, view state is accessible through properties of the Request, Page and HttpContext
objects among others, and is just a collection;

 if (!this.IsPostBack)
{
        ViewState["foo"] = "bar";
}
else
{
                string bar = ViewState["foo"];
}

Pros

Scope limited to single postback, then it’s gone, so no lingering memory usage.

Cons

  • Increases traffic so data size should be kept small.
  • Can only be used in post-backs, then it‘s gone (I know this is a pro as well, but there you go).

Real world Application

  • Step-by-step wizards, such as checkout pages
  • “Contact us” forms, which handle procurement as well as usage of information from the user

QueryString

This is the method by which you include (possibly) state related information within the URL of the request. They take the following form;

http://www.mytestsite.com/apage.html?akey=avalue&anotherkey=anothervalue

You will then be able to access each key/value pair stored in the request.

Example

Given the above example, the values would be accessible from within the request using the following;

string aValue = Request.QueryString["akey"].ToString();

Pros

  • State can exist within links to your site.
  • User can not “disable” query string values in their browser settings (they’d have to physically delete the details from the URL, which they can’t do by accident)

Cons

  • Size restrictions in some browsers
  • Limited to simple strings (certain characters, such as ? Or &, are illegal in URLS)

Real world Application

  • Achieving simple dynamic content by passing an ID through the query string and accessing a database to retrieve content relavant to that ID (be VERY wary of SQL injection when anything from the query string ends up near a database)
  • Tracking where a user has come from (e.g. a banner ad may link through with a bunch of query string information that you want to record)
  • External links that may want to pre-fill or pre-prepare the resultant page (such as a link with an activation code that you have emailed to someone, may automatically “activate” something when the user clicks the link)

Forms

The HTML form tag provides a facility for you to make details available in the request that follows the submission of that form. Unlike view state, information stored in a form is available from any page that the “action” attribute of that form sends the subsequent request to, and is not just limited to post backs. HTML forms can also have hidden fieds that are handled in the exact same way on the resulting page as non-hidden fields. This should go without saying, but hidden fields on a form are only hidden from the user’s ‘view’, and are in fact perfectly un-hidden if the user views the source of the page, so no sensitive information should ever go in them unless it is sufficiently encrypted.

Unlike other state management techniques that are described here, this one is interwoven with the HTML structure of your site. As a result it can be styled and used semantically as a “form” but this does not mean that you must use it in this way. A common practice is to have one form on a page and many actions that would submit that form. For example, if you had a page of products that a user could potentially add to their basket, it’s far simpler to have one form and allow many buttons to submit that form through javascript than it is to have many forms.

Example

A form with any input with the ID of ‘foo’ would be accessed through the request as follows;

string bar = Request.Forms["foo"].ToString();

Pros

  • Achieves (internally to your site) same result as query string without causing complex URLs.
  • Resultant request can be interrogated regardless of which page was the referrer.

Cons

  • Limited to simple strings (certain characters, such as quotation marks, are illegal in forms)

Real world Application

  • “Add to cart” buttons
  • “Contact us” forms

The ASP.net Request object

The ASP.net object is a collection itself; and can be used in the following manner;

string foo = Request["bar"].ToString()

This will actually interrogate the entirity of the request object (including Querystrings, Cookies, Forms and Session) for a value with a key corresponding to the one asked for. I don’t know where it looks first (if you know, please tell me out of curiosity) but the order should not be a consideration you need to be making: It may be that you want to look through the enire request for your key but in 99% of cases you will want to specify where you are looking and any decision to use the above functionality should be a concious one due to the inherrent serurity vunerabilities caused by the ambiguity.

Conclusion

Managing state should be at the forefront of your mind during web application development. There are numerous techniques at your disposal to achieve this, each with appropriate uses. Make every decision a concious one to achieve the most scaleable, secure and bug free application you can.

Extending the Page Object in ASP.net

I was once asked the following question during an interview for an asp.net developer role;

“How would you go about achieving common functionality and members across multiple pages?”

I was offered the job, and later found out that my answer to this question had been instrumental in the decision to hire me. Around 4 months later, I was interviewing developers myself and decided to throw the same question into the mix. Every single person gave the same answer, which was;

“Master Pages”

Nooooo! A quick google of this reveals that it’s a point that many asp.net programmers seem to get stuck on, so I’ve chosen it as the topic of my first post. In this article I’m going to explain why that wasn’t the correct answer, and then move onto some neat patterns for extending  the page object as well as some ideas and examples.

Why shouldn’t I use a Master Page?

Master pages are incredibly useful when used for the correct purpose, but this purpose is largely limited to the presentation layer of your website.You can kind of see why so many people think that Master Pages are a good way of achieving common functionality across pages. The name MasterPage itself heavily implies there is a releationship and that the Master Page is higher up the heirachy. To the unaccustomed eye, the master page is a container which surrounds content stored on other pages, so the logical conclusion to jump to is that the master page is a parent of the page object. In actual fact, the complete opposite is true.

The first common ancestor that the master page and the page objects have is the TemplateControl class , where the page object is a direct extension of it and the MasterPage object is an extension of one of its children (The UserControl object). As they’ve gone down two seperate lineages, there is no way they can be related in the traditional sense.

You might then jump to the conclusion that the MasterPage is related to a page through possessing a reference to the page as a member. Again, this is not correct. The master page, in fact, has no architectural relationship to the page object at all, and accessing elements on a page from a master page is a colossal pain in the neck. You can only really do it using the FindControl method on your ContentPlaceHolders, which is not only incredibly inefficient and messy but also means you suddenly have to be very careful about the name you give each control’s ID across your entire site. The compiler will not be able to spot errors of this kind, and you could be setting yourself up for a massive headache later on.

In actual fact, the master page is accessed through a property of the Page object, which makes accessing the properties of a master page incredibly simple from within a page. For this reason, you should think of a master page as actually being a child of a page, not vice versa. Given that this is the case, it should be fairly clear why Master Pages are therefore not the place to put shared functionality.

Extending the page object

In order to achieve common functionality across multiple pages, you can simply inherit from the Page object.

public class ExtendedPage  : System.Web.UI.Page
{
    protected string _aMember;

	public ExtendedPage()
	{
              _aMember = “something we need on every page”;
	}
}

And then just inherit from that class in your actual pages;

public partial class _Default : ExtendedPage
{
    protected void Page_Load(object sender, EventArgs e)
    {
	  string s = _aMember;
    }
}

Remember also that you have access to the full page life cycle, you can and should use this to do most of your initialization within your new base class. Code-behind files within asp.net websites, particularly the page_load method, should really only contain code relevant to building the page you are looking at (for example, the page_load event above should only contain code for building the _Default page, and should not contain initialization code for an ExtendedPage). You can achieve this very easily using event handlers;

public class ExtendedPage  : System.Web.UI.Page
{
    protected string _aMember;

	public ExtendedPage()
	{
        Init += AFunction;
	}

    private void AFunction(Object sender,EventArgs e)
    {
        //any functionality
    }
}

This provides a tidy method of getting functionality across multiple pages and keeps all of the initialization out of the code-behind files.

Examples of Usage

I would strongly advocate that you don’t just extend the page object once and use that everywhere. You can use the exact same techniques described here to extend the page multiple times. Probably the best real world example I can think of is the idea that your website might have some pages which require a user to be “logged in” before he or she can view them. You could achieve this very easily by inheriting from your “ExtendedPage” object above, perhaps calling it a “LoggedInPage” and putting functionality in its initialization routines to check for the ‘logged-in-ness’ of your user whichever way you want to, redirecting them to a different page if they’re not. After you’ve done this you can simply inherit from “LoggedInPage” for all those pages that require a login.

Another good use of this would be for managing database connections for the duration of a pages lifetime. You may not want a database connection on every page, so again you could create a new page class that inherits from your ExtendedPage, and use the event handlers to open the connection on the Load event and close it on the Unload event. As the unload event will always fire, you can be sure that you are always closing your connections which reduces the chances of a DB connection related memory leak to zero. You might not want to have your database connection on the page itself, for example you may want to abstract this concept into the seperate layer so you can easily change your data source type should you need to. The general idea described here can be applied to that situation too, I’ve kept it as a database connection in the example for simplicity.

And back on to master pages for a second: Maybe you do want access to some of the page’s properties from within a master page? For example, the aforementioned database connection might be pretty useful on the master page. To achieve this you could simply put a bunch of properties on your master page object (through inheritence or the use of an interface) and set them within an event handler of a page (of which, remember, the master page is a member).

public class ExtendedPage  : System.Web.UI.Page
{
    SqlConnection _connection;

	public ExtendedPage()
	{
        Init += StartConection;
        Init += MasterProperties;
        Unload += EndConnection;
	}

    private void MasterProperties(Object sender, EventArgs e)
    {
        var master = (MyMaster)this.Master;
        master.Connection = _connection;
    }

    ... other event handlers to open/close connection
}

In terms of members, there may be information that you want to use on all pages, such as the site root and the sites home page. For code readability it might even be nice to get some of the things that you commonly access in the request object and stick them into member variables. Without naming any names, it might also be that you want to integrate a third party tool or widget into your code, and that this has to appear on every page of your site, and you could do this in your page object without the need to put any code into your master page at all.

Conclusion

The answer to the question right at the top of this article is therefore;

“the best way to achieve common functionality and members across multiple pages is to extend the page object”

Hopefully the advantages of doing things this way are pretty clear now. The sky is pretty much the limit once you get your page object architecture together, which is something that it’s worth spending a bit of time on to get right.