Take Two Programming: July 2009

Friday, July 31, 2009

SC Engine: The Trouble With Testing

This is sort of an ad-hoc post, so I'm not really considering it part of my SC Engine series.

The Mercurial repository shows SC Engine at just under two months old. I guess it's time to look back at some of the things that I've tried and maybe think about stuff that worked vs. stuff that didn't work.

The big change in this project from other things I've done is the architecture. Although I'm sure my implementation was not the best, I really don't know how much I approve of a messaging system for everything. On the one hand, it did great for separating out different parts of the system, and on very large systems I could imagine it being useful when you have multiple systems with different platforms. Also, horizontally scaling would be extremely easy, although this wasn't really a concern for me. Really, the big problem was that testing applications was not as easy as I had hoped.

The goal was simple: make testing apps as easy as passing in data and collecting the data back. Make sure that the data returned was expected. Rinse and repeat...


def test_some_app(self):
    msg = IncomingMsg(1, 2)
    results = self.app.on_msg(msg)
    expected = OutgoingMsg(1,3)
    self.assertContains(results, expected)

The problem is with the messages that had a TON of attributes and needed a few earlier messages to set it up. Suddenly, a test could look like...


def test_some_app(self):
   incoming_data = {
      'data_1' : 1,
      'data_2' : 1,
      'data_3' : 1,
      'data_4' : 1,
      'data_5' : 1,
      'data_6' : 1,
      'data_7' : 1,
   }

   incoming_data_2 = {
      'data_1' : 1,
      'data_2' : 1,
      'data_3' : 1,
      'data_4' : 1,
      'data_5' : 1,
    }

    incoming_msg_1 = IncomingMsg(**incoming_data_1)
    incoming_msg_2 = IncomingMsg(**incoming_data_2)

    self.app.on_msg(incoming_msg_1)
    self.app.on_msg(incoming_msg_2)

    incoming_data_3 = {
        'data_1' : 1,
        'data_2' : 1,
        'data_3' : 1,
        'data_4' : 1,
        'data_5' : 1,
    }

    results = self.app.on_msg(incoming_msg_1)

    expected_outgoing_data = {
        'data_1' : 1,
        'data_2' : 1,
        'data_3' : 1,
        'data_4' : 1,
        'data_5' : 1,
    }

    expected_outgoing_msg = OutgoingMsg(**expected_outgoing_data)
    self.assertContains(results, expected_outgoing_msg)

Now, sometimes I only needed data_1 and data_2, and the rest I could care less what the incoming data was. However, I often needed to make sure that perhaps data_3, data_4, and data_5 on the outgoing was the same. However, there just seemed to be a theme of code being less logic and more naming attributes. Basically, every app needed to get all the attributes and specify what they were in their own way. This wasn't just a problem in tests, the actual applications themselves were, at times, reeking of a bunch of attribute dictionaries and no real logic. It made the code pretty ugly looking.

In the end, I started writing less unit tests. At first, I thought that this was ok. I didn't necessarily want to write unit tests for every class as I was writing it, since sometimes you really need to write the objects and see how they will work together before you have a clear idea over the best way to implement the objects. Later, I realized it was really because the tests were a pain. For some objects it was understandable, but apps were pretty straightforward in terms of their interface (it's a bunch of methods with a single argument: the msg).

One potential solution would be to split messages up. In all actuality, I did this in some cases. However, in terms of testing, this would mean going from large messages to having more messages, which means the test would be just as long.

Another potential solution was to build helper objects to build the messages, but in the case of tests where all the attributes were indeed important, this wouldn't really help. Plus, often the attributes might not be important for the specific test. This probably would just end up being more work anyway.

At this point, I'm starting to see my problem in the following manner: Even the most simplest of applications are trying to do three things:

1.) Take the data that was passed to the app via messages and convert it into some locally-defined data structure.
2.) Run operations on the data structures to arrive at results.
3.) Convert results into messages to send out.

In trying to create tests for my app to make sure that #2 was being done correctly, I would inevitably end up writing extra code for EACH test to ensure that #1 and #3 were being done as well.

Perhaps I should start thinking of these as separate stages all-together. I can keep much of the infrastructure, but place objects on top that splits the functionality of handling a single message into the three distinct parts.

Edit:

Alternatively, a situation where only enough data to identify important info is being passed in any message, and the rest of the data is stored in a database somewhere outside of the application. Basically, the message would only contain ids.

Wednesday, July 22, 2009

SC Engine: Part 8 - The Site


    # CUSTOM ROUTES HERE
    map.connect('/', controller='main', action='index')
    map.connect('/uploader_info', controller='main', action='uploader_info')
    map.connect('/contact', controller='main', action='contact')
    map.connect('/contact_submit', controller='main', action='contact_submit')
    map.connect('/contact_submitted', controller='main', action='contact_submitted')

    # Uploads
    map.connect('/update/{action}', controller='update')

    # Ajax retrieval of commentary
    map.connect('/commentary/{game_id}/{author}', controller='commentary', action='view')

    # Matches
    map.connect('/{league}/{stage_0}/{stage_1}/{stage_2}/{team_one}-v-{team_two}',
            controller='league', action='match')
    map.connect('/{league}/{stage_0}/{stage_1}/{team_one}-v-{team_two}',
            controller='league', action='match')
    map.connect('/{league}/{stage_0}/{team_one}-v-{team_two}',
            controller='league', action='match')
    map.connect('/{league}/{team_one}-v-{team_two}',
            controller='league', action='match')

    # Match set
    map.connect('/{league}/{stage_0}/{stage_1}/{stage_2}/{team_one}-v-{team_two}/Set{game_number}',
            controller='league', action='view')
    map.connect('/{league}/{stage_0}/{stage_1}/{team_one}-v-{team_two}/Set{game_number}',
            controller='league', action='view')
    map.connect('/{league}/{stage_0}/{team_one}-v-{team_two}/Set{game_number}',
            controller='league', action='view')
    map.connect('/{league}/{team_one}-v-{team_two}/Set{game_number}',
            controller='league', action='view')

    # Stages
    map.connect('/{league}/{stage_0}/{stage_1}/{stage_2}', controller='league', action='stage')
    map.connect('/{league}/{stage_0}/{stage_1}', controller='league', action='stage')
    map.connect('/{league}/{stage_0}', controller='league', action='stage')
    map.connect('/{league}', controller='league', action='stage')

Eventually, I'm going to need to do matches other than team_one vs. team_two (some leagues have matches that contain four players in a group, and the url would need to have something like "GroupA"), and I figure that when that time comes I'll rewrite how the rule matching works. But, it works for now, and I can see it in one page, so I'm not too concerned with it (a year ago I would've spent another day getting these routes down to five lines of code, one example of how I think I've matured as a programmer recently).

Right now, the SQLAlchemy code that actually retrieves the data is very simple. It suffers from N+1 Select issues, and at some point will need to be rewritten for performance reasons. I won't show any of it because it really is just a matter of get some data by the id and then keep fetching child data. It'll end up with six or more database calls for each page, and at some point I'll want to reduce that number to one. One thought was to use an application to denormalize the data and upload the data already denormalized, but it probably seems like a bunch of work for what could probably be solved with a few minutes of tweaking the SQLAlchemy queries.

Ignoring specific files or directories with Mercurial

I've switched from using svn to Mercurial about a year ago, and for the most part it has been going well. One gotcha is when trying to ignore files, things may not work as you expect (instead, they work as you SHOULD have expected).

Imagine a directory structure like this...


project
|-- development.ini
|-- run.py
`-- data

Data is a directory that stores the data for your project while running, so you want to ignore it. Opening up your .hgignore file you've had for awhile, you see something like this...


syntax:glob

*.pyc
*.log

...and you decide to do this...


syntax:glob

*.pyc
*.log
data

This is a bad move. Sooner or later, your project is going to look like this...


project
|-- development.ini
|-- run.py
|-- data
`-- src
    |-- main.py
    `-- parts
        |-- graphics
        |   |-- stuff.py
        |   `-- stuff2.py
        `-- data                 <-- !!!
            |-- loader.py
            `-- retrieval.py

And suddenly, when you make that src/parts/data directory, it's going to be ignored by Mercurial. If you want to ignore a specific directory, such as one that is used for data (common when using pylons), it's better to use:


syntax:regexp
^data$

This will ignore any file or directory in your entire repository that goes by the name "data". Note that this is the name relative to the project's root, so our data file in the src directory would not be matched since it's name is 'src/parts/data'. The ^ and $ are used in regular expressions to signify the start and end of the string.

Friday, July 17, 2009

SC Engine: Part 7 - The Console

The two main things are error reports and youtube partial parses:

Error Reports

The error report lists all of the errors that have been encountered that still need to be dealt with. It gives some quick data for all the errors (in the screenshot, have just one) such as the time the error occurred, the name of the application that the error occurred in, the name of the message that was being handled when the error occurred, and the error message itself (constructed by converting the error raises into a string). There's also the ability to retry (that is, the message will be sent out again) or deleted.

It's important to realize that retrying will send the message out again, so if a message is handled by two applications and one raises an error that we later retry, BOTH applications will be asked to handle the message. Luckily, it typically isn't too difficult to write applications in to gracefully handle the same message more than once, but it is something to be aware of. An application that would keep track of how many games a player has ever played, for example, would need to keep track of the actual games, not just increment a counter every time a message arrives stating that a player is in a game.

Anyway, clicking on "View" gives you more info on the error:

In addition to the data seen in the list screen, we also get to see the actual data that was contained in the message (which is typically a dictionary of values), the traceback of the error, as well as any log messages that were written during the application's execution of handling the message.

Youtube Partial Parses

This screen shows me all the cases where a youtube video came to my attention, had enough data to possibly be a proleague game, but I couldn't determine what game it was for. It shows the date & time of the attempted link, the video id (with a link that I can use to easily open the video), and the title of the video for quick reference Additionally, the dates, players, teams and game numbers parsed from the videos are shown, as well as the number of possible results found. Of course, the same retry and delete buttons.

Clicking on view gives me more info:

The date played links to the kespa website, specifically to the page for that schedule, which is a handy feature. The player ids also have a link to the kespa page.

Down below, information on the games that are seen as possible candidates are shown. These help me figure out if the engine is anywhere in the ballpark in terms of guessing the game, and also give me all the info I need that, if I wanted to, add to the script that tells the engine to manually mark videos for games. However, for most situations (including this one), the solution is to add aliases for players. In this case, players "Baxter" and "[Oops]Cloud" were not recognized. Some sluethwork reveals that "Baxter" is another name for a player typically known as "Killer" (id: 1305), and "[Oops]Cloud" is typically just known as "Cloud" (id: 348). For something like this I would add these two new aliases to the engine, and then hit "retry". This time around, the link would be made, because the one other game would no longer be reported as a candidate. This is a better result than manually linking, because now future games that use these aliases will also be immediately recognized, causing less work for me.

To quickly go over the rest of the menu:

Proleague Partial Fetch PP:

This is a feature I still haven't implemented, but I put it in the menu anyway. It will be similar to YouTube partial parse, except instead of showing ambiguous YouTube videos, it'll show ambiguous game schedules.

Utils:

This contains a bunch of forms that can be filled out to send specific messages, such as a schedule fetch for a certain day. Was very useful when the only messaging system was just a locally storing the messages in a list in a single application, where the only way I could add new messages was to shut down the application and change the code. However, it still works, so I keep it around.

Status:

Will eventually show a "heartbeat" status of all running apps, which is information on how long an application's message queue is by sending out a heartbeat and seeing how long before the heartbeat is responded too. Just a quick way to check the current load in addition to viewing the actual messaging systems queues.

SC Engine: Part 5 - Linking the Video to the Game


store = PickleFileStore()
repo = Repository(store)

with repo.use() as data:
   data['a'] = 1

When repo.use() is called, it creates a context manager that, on closing, saves off the data. The use decorator that I've utilized on my app just wraps this functionality and adds the resulting data as arguments to the function. The proleague_match_id_announcement method, without the decorator, would look like this...


def proleague_match_id_announcement(self, msg):
    with self.repo.use('games') as games:
        games.add_proleague_match(msg.match_id, msg.team_one, msg.team_two)

...so really, all it helps in doing is making the function use one less tab, which is always a plus in my book.

Also, the "add_repository" method just allows for custom types to be used as the object you're working with, where without this your data would just be a dictionary. By allowing for a custom type (that takes in the dictionary as the argument to the constructor), I can easily wrap complex logic into other objects while still utilizing the implicit saving of the store. I probably did a horrible job explaining this, but I think the important part is what the application is doing anyway.

The typical workflow for this app is as follows:

Somewhere else, a new match is found for the schedule, and an id is given to it. The ProleagueMatchIdAnnouncement is sent out, which gives info regarding the match and the id given to it. The app saves the information that it needs.

A little while after, the GameDataAnnouncement message arrives. It has the match id, and the game number to determine what game in the match it is. The rest of the data is game specifics (id of players, id of map, winner, etc.). Once again, the important data is recorded, joined with the match data.

A few hours to days later, someone posts up a youtube clip of the game. We receive the YouTubeParseAnnouncement message, with data such as ids of the possible players and teams, dates found and game numbers found. We go through our collection of games and find any that match. If more than one does, we send out a "Partial Parse" message (which will eventually allow me to look at these videos manually), but if we only find one, we're (mostly) sure it's it, so we send out a VideoGameLinkAnnouncement to signify this.

You'll also notice that there is this idea of a "manual link". Sometimes I stumble across a video that has something different or wrong with it that makes it very difficult for the engine to find. For example...

Siz)KaL vs Midas [30 November, 2008] 32set @ Proleague

This video's title reads "Siz)KaL vs Midas [30 November, 2008] 32set @ Proleague". I can gather from the title that Siz)Kal and Midas are the players involved, and that it takes place on November 30, 2008. The game in the video is the 3rd game of the match, but a typo (32Set instead of 3Set) means the parser see it as the 32nd game. A future goal is to have the search algorithm better handle such problems, but in the mean-time and for very extraordinary situations, I can manually choose the game.

As for the game search repository itself...

Game Search Repository

As you can see, the GameSearchRepository takes into the constructor the store, which is simply a dictionary object. It saves info about matches and games, and when trying to find games, will run the spec through all the known games and return the results. The repository needs to save the match data so that when the game data comes, it can combine the two together. The GameData object looks like this:

Game Data

Really, there's not much to say here. The game search consists of finding all the game data that return True from matches_spec, meaning that all the info in the spec correlates to the data it has on the game. As O(n), this has some performance potential, but with an entire season's worth of games in memory I haven't noticed a visible slowdown enough to start messing with it now.

Sunday, July 12, 2009

SC Engine: Part 6 - Messaging Middleware


import my_app



name = "my_app"

root_config, config = {}

subscriptions = my_app.make_my_app(name, root_config, config) # Run the apps builder function



bus = ApplicationMessageBus(name, subscriptions)

Once again, the ApplicationMessageBus is similar to applications in that it's not tied down to any specific messaging library. It's purpose is to receive a message, and knowing the subscriptions for the app, pass the message to the correct handler. It also takes the returned messages and properly creates a list of outgoing messages, which it returns to it's caller. This is to make it a bit easier for the caller, since it'll always expect a list.

A few other notes about this object. First, in addition to a handle method, there is an idle method. This gets called every once and awhile and allows me to write apps that, in addition to having handlers for messages, can also have handlers for elapsed time. This means I can write an app like this...


def make_my_app(name, root_config, config):

    app = MyApp()



    return {

        msgs.AMessage : app.handle_a_message,

        timedelta(minutes=5) : app.run_every_five_minutes,

    }



class MyApp(object):

    def handle_a_message(self, msg):

        ...



    def run_every_five_minutes(self):

        ...

Also, let's take a closer look at the part that actually calls the applications callback:


    def _run_callback(self, callback, msg = None):

        log_msg = "Running callback on %s for message %s" % (callback, msg)

        self.logger.info( log_msg )



        root_logger = logging.getLogger()

        logging_handler = CollectionLogHandler()

        root_logger.addHandler(logging_handler)

        try:

            if msg:

                results = callback(msg)

            else:

                results = callback()



        except Exception as e:

            exceptionTraceback = sys.exc_info()[2]

            tb = traceback.extract_tb(exceptionTraceback)

            log = logging_handler.log



            results = AppErrorOccurredMessage(self.app_name, msg, unicode(e), tb, log)

        finally:

            root_logger.removeHandler(logging_handler)



        log_msg = "Results of callback %s: %s" % (callback, results)

        self.logger.info( log_msg )



        return results

First off, before the application's callback, it adds a handler to the root logger. This will catch all log entries that happen inside the application's handler when it's run. Next, the results are called in a try block. If the application doesn't throw an error, the function just returns the results after cleaning up the log interceptor. If it does throw an error, all the logs that were being intercepted, along with the error and traceback information are collected and placed into an AppErrorOccurredMessage. This message is just like any other message, and as such can be returned and expected to be sent out over the wire.

In my closing words about the ApplicationMessageBus, you'll notice that it can also respond to HeartbeatMessage messages. I'm still debating how exactly I want these to work, as I've discovered some problems with the way that I've currently implemented them. Basically, the idea is that I should be able to send out hearbeats, and the applications should send responses, so that I can get an idea of the status of applications. The problem is with applications that are dutifully reacting to messages, but is a bit backlogged, who won't respond to a heartbeat for awhile because the heartbeat message is pretty far down in it's fifo queue. In the mean time, I'll think that the application is somehow down. The original goal of the heartbeat was to determine what applications are up RIGHT NOW, so this doesn't really work too well. However, it may prove useful for determining loads on an application. Anyway, like I said, it's still a work in progress.

Speaking of works in progress, I've arrived at the fun part, the MainMessageBus...

Main Message Bus

This is an object that is potential main-method material. Anyway, here's how you might use it...


import app1

import app2



transport = ...

main = MainMessageBus('main', transport)



root_config = ...

app1_config = ...

app2_config = ...



# Build app1

subscriptions = app1.make_app1('app1', root_config, app1_config)

amb = ApplicationMessageBus('app1', subscriptions)

main.add_app('app1', amb, subscriptions)



# Build app2

subscriptions = app1.make_app2('app2', root_config, app2_config)

amb = ApplicationMessageBus('app2', subscriptions)

main.add_app('app2', amb, subscriptions)



try:

   main.run()

finally:

   main.close()

ITransport looks like this...

The interesting part is the retrieve method, which doesn't just return a message, but instead returns a context manager for the message. This is so that we can easily write code to handle the message, and correctly set up a way for the transport to finish anything it needs to finish once we're sure that the message handling is done. The is shown in the MainMessageBus...


with self.transport.retrieve(name) as msg:

    if not msg:

        continue



    data = (msg.__class__.__name__, name, pformat(msg.to_dict(), indent=5))

    log_msg = 'Handling %s in %s\n%s' % data

    self.messages_logger.info(log_msg)



    results = app_bus.handle(msg)

    handled_count += 1

    outgoing_messages = results



    map(self.send, outgoing_messages)

If for any reason the system were to go down or an exception were to occur during this context, the context manager would not exit in a way to tell the transport to mark the message as done. This is important to make sure that the message will still be there when the system starts up again to be rerun. You'll see how this works for the AMQPTransport...

AMQPTransport

First off, you may notice that I'm constantly opening and closing channels. This is because I've experienced problems with sending messages that had no exchanges set up for them. An exchange is set up when the transport is told to set it up, and this happens when the main message bus sees that an application has subscribed to it. It is possible, however, that an application sends out a message that no one has subscribed to (I just haven't written the application to deal with that message yet). Because of the asynchronous nature of amqp, unless I closed the channel, I might not get the error message until the next time I tried to use the channel. So, instead I've decided to be using a fresh channel in each method. The closing of the channel will force the error message from amqp to be raises in the same method as the code that caused the problem (in my "sending a message that no one subscribed to" problem, this would happen in "send") where I can then trap and handle it.

You see in the retrieve method the context managers being created. What's nice about this ITransport is that it came about as the refactoring point when I wanted to switch from my contained-in-memory prototype transport to AMQP. I still have the "LocalTransport" hanging around:

Local Transport

SC Engine: Part 4 - YouTube Parsing

- Introduction
- Part One: System Overview
- Part Two: System Overview: Messages and Applications
- Part Three: Screen Scraping
- Part Four: YouTube Parsing
- Part Five: Linking the Video to the Game
- Part Six: Messaging Middleware
- Part Seven: The Console
- Part Eight: The Site

I wasn't the first person to try to do a project like this. There have been others...

Starcraft Gaming, sort of a blog-style list of games that probably uses the youtube api for specific youtube accounts.
lart.no/sc, most likely does the same. They seem to have the ability to determine what league that the videos are from. I hear the person who wrote this did it in four hours just for themself.
sc.rpls.info, that is also just a list of videos. They seem to be able to sometimes get the races of the games being played.
SC2GG Vod Tracker, hosted at sc2gg, a site whose community is largely based on english commentaries of pro games. Has a (sometimes buggy) ability to determine the players and maps, and like the some of the other sites support some semblance of a search engine and voting on quality of games. They also have a nice "recently best voted" section, and seem to be one of the few automated solutions that can link together a multi-part commentaries (one game spread over multiple YouTube videos that are meant to be watched in order).

These solutions don't seem to go to the scope of what I was planning, but even with what they were doing, you can see that there were problems getting information from the youtube videos.

First off, they only will look at videos in known youtube accounts. That means they could miss videos by players not on their list.

Secondly, the automated solutions all showed the inability to correctly identify much of the game using just the title and description of the video. To be honest, this is probably an impossible task to do, as most videos don't provide all the details necessary. Many would put the date played, player names, team names (if in a league with teams). The automated solutions that did try to tease out some of that data seemed to have a list of player, team or map names that they would search the title and description for. This resulted in situations where a player named "Great" would show up for a lot of games if the word "great", used in any context, happened to be in the title or description.

To be sure, there is no standard format for a title and description of Starcraft videos for all uploaders to use. Even a single commentator might use a different format in some of their own videos.

Luckily, I didn't need to parse all the data directly from the videos. So long as I had a database of the games played, I needed only to parse just enough of it to be able to link it to the correct game. Then I could use all the data from the schedule database.

So, after viewing a bunch of different "formats" used by the commentators, I came to a few realizations:

Trying to find data in the description is dangerous, because there could be a ton of stuff in the description that doesn't have anything to do with the game.

Finding a date that the game was played goes a LONG way to make finding the game in the database easier. On the most busy days in Starcraft, you can see around 15 total games played, although more likely to be in the 3-8 range. If I can find the date, I might only need one other piece of information (a player name, a game number) to make a link. As such, I wanted to concentrate heavily at making sure that if there's a date in there, I can find it.

Using the published date of the video to try to determine the date played of the match is probably a dead end. Although in most cases, videos are posted within a few days of the match, some videos are posted for epic games that happened years ago.

The video's title typically had data that was more likely to be useful than the video's description, even if there was less characters there. This is kind of obvious, because for a commentator who wants their audience to easily find their video, it would behoove them to put that necessary information there.

Although every commentator had their own "formats", most of them put something along the lines of X vs Y, or X v Y, or X versus Y in the title. X and Y could be teams, players, or both.

Knowing this, I set out to make my parser. First off, parsing dates. The parser currently can parse dates in any of these formats, with the ability to add more formats available...


date_formats = [ 
    make_format('%(day)s/%(month)s/%(year)s'),
    make_format('%(month)s/%(day)s/%(year)s'),
    make_format('%(year)s-%(month)s-%(day)s'),
    make_format('%(month)s-%(day)s-%(year)s'),
    make_format('%(day)s %(month)s %(year)s'),
    make_format('%(month)s %(day)s %(year)s'),
    make_format('%(day)s %(month)s, %(year)s'),
    make_format('%(month)s %(day)s, %(year)s'),
    make_format('%(day)s %(month)s , %(year)s'),
    make_format('%(month)s %(day)s , %(year)s'),
    make_format('%(day)s, %(month)s, %(year)s'),
    ]

Where "day" can be any of 1, 01, or 1st, "month" can be any of 1, 01, Jan or January, and "year" can be any of 09 or 2009. The algorithm will look in the title and description, but if it finds results in the title, it uses that. In fact, as of now, the parsing algorithm will not look for any data other than dates in the description, and rely on the title.

Next is "versus", which is so commonplace I basically rely on it to determine the participants in the match, rather than try to do a full text search for all known players and teams. The regex looks like this:


([a-z0-9\[\]\.\-_\)]+)\s+(?:v|v\.|vs|vs\.|versus)\s+([a-z0-9\[\]\.\-_\)]+)

As you can see, I can support...

X v Y

X v. Y

X vs Y

X vs. Y

X versus Y

...with or without spaces between the versus phrase and the participants. One problem with this method is that if there is a space in the name (typical for teams) you lose some of the data on the team name. However, just one word of the name is often enough to identify a team.

Much of the rest is just regular expressions as well, including the need to look for the videos part number for multi-part videos. Typically, if a game takes longer than 10 minutes (the approximate limit for most youtube videos), the uploader splits it up into two or more videos, and puts some way of saying that this is part x of the entire series...

Part 1

Part 1 of 2

(1/2)

P1 of 2

Testing this parser typically use actual titles and descriptions that I've found online...


    def test_participants_from_title(self):
        r1 = test_data(title="Bacchus OSL Ro36: Gogo v Luxury Set 1")
        self.assertTrue('gogo' in r1['participants'])
        self.assertTrue('luxury' in r1['participants'])

        r2 = test_data(title="Bacchus OSL Ro36: Gogo vs Luxury Set 1")
        self.assertTrue('gogo' in r2['participants'])
        self.assertTrue('luxury' in r2['participants'])

        r3 = test_data(title="Bacchus OSL Ro36: Gogo v. Luxury Set 1")
        self.assertTrue('gogo' in r3['participants'])
        self.assertTrue('luxury' in r3['participants'])

        r4 = test_data(title="Bacchus OSL Ro36: Gogo vs. Luxury Set 1")
        self.assertTrue('gogo' in r4['participants'])
        self.assertTrue('luxury' in r4['participants'])

        r5 = test_data(title="Bacchus OSL Ro36: Gogo versus Luxury Set 1")
        self.assertTrue('gogo' in r5['participants'])
        self.assertTrue('luxury' in r5['participants'])

        r6 = test_data(title='MBC v Woonjin: Light v Zero (P2/2)[Single] 5/17/09')
        self.assertTrue('mbc' in r6['participants'])
        self.assertTrue('woonjin' in r6['participants'])
        self.assertTrue('light' in r6['participants'])
        self.assertTrue('zero' in r6['participants'])

        r7 = test_data(title='InteR.Mind vs Siz)KaL [15 Apil, 2009] 1set')
        self.assertTrue('inter.mind' in r7['participants'])
        self.assertTrue('siz)kal' in r7['participants'])

        r8 = test_data(title='type-b vs Saint[z-zone]')
        self.assertTrue('type-b' in r8['participants'])
        self.assertTrue('saint[z-zone]' in r8['participants'])

Right now, it is not able to distinguish teams from players. Just having the participant list is fine though, as determining what participants are teams or players will happen somewhere else. No need to have to include a ton of data to make the youtube parser any more complex than it needs to be.

So, in the grand scheme of things, the YouTube parser gets in a message saying that a YouTubeVideo was found, which includes the video id, title and description, and tries to get as much info as possible from it. It then sounds out a message with the info it found for someone else to chew on for awhile.

Saturday, July 11, 2009

SC Engine: Part 3 - Screen Scraping

In case I haven't said it before: No, I do not speak Korean. Luckily, I didn't really need to. Some stuff could be determined from this page using only intuition (and the google translate add-on :)

In the page's html, the players and maps are anchor elements that links to a page with more info on the player and map, respectively. That url has a unique id for the players and maps, so I decided to use these id through my engine. The league name and teams are just plain characters, so I grab when I'm scraping.

After fetching the page using urllib2, I cut out some of the cruft of the page, and load the rest into BeautifulSoup. Testing these objects is done by saving actual examples I get from the site to a file, and using those files as the test. When I find a new style, I save that data to another file and write a test for it. Here's an example...


class ProleagueMatchScraperTests(BaseScraperTestCase):
    """ 
    Original =
    http://www.e-sports.or.kr/schedule/daily01_sche.kea?m_code=sche_12&gDate=20090603&gDvs=T&miniCal=2009-06-01
    """
    TEST_FILE = "proleague/match.html"
    SCRAPER = ProleagueMatchScraper

    def test_teams(self):
        self.assertEquals(self.results['team_one'], u'KTF')
        self.assertEquals(self.results['team_two'], u'MBC게임')
        self.assertEquals(self.results['winner'], None)
        self.assertEquals(self.results['winner_score'], None)
        self.assertEquals(self.results['loser_score'], None)

    def test_game(self):
        self.assertEquals(len(self.results['games']), 5)

        game = self.results['games'][0]
        self.assertEquals(game['player_one'], 988)
        self.assertEquals(game['player_two'], 851) 
        self.assertEquals(game['map'], 1193)
        self.assertEquals(game['winner'], None)

    def test_ace_match(self):
        game = self.results['games'][4]

        self.assertEquals(game['player_one'], None)
        self.assertEquals(game['player_two'], None)
        self.assertEquals(game['map'], 1207)
        self.assertEquals(game['winner'], None)

    def test_stage_info(self):
        self.assertEquals(self.results['stage_path'], ['Week 1', 'Day 5'])

Each test class has a TEST_FILE and SCRAPER attribute that are used by the BaseScraperTestCase to run the entire scrape in setUp. The TEST_FILE is the filename that has the html I pulled from the web site for that test, where the scraper is the object that will actually do the scraping. Thus, I can add new tests very easily.

In addition to scraping schedules, I also need to scrape the player page for the players to find out stuff like their names and races. I'll use that as an example for what the actual scraper object looks like, because it's a bit simpler than the schedule scraper. The page looks as follows:

http://www.e-sports.or.kr/teams/player1.kea?m_code=team_24&pGame=1&pCode=1248

Yeah, the guys name is "Great". :P


class PlayerScraper(object):
    NAME_PATH = [0, 1, 3, 1, 3, 1, 0, 15, 1, 0, 9, 7]
    RACE_PATH = [0, 1, 3, 1, 3, 1, 0, 15, 1, 0, 7, 7]

    def __init__(self, soup):
        self.soup = soup

    def scrape(self):
        # If the name is not there, then we have a blank page, so it's not a
        # legit player.
        pre_name_elem = utils.dive_into_soup(self.soup, self.NAME_PATH)
        if len(pre_name_elem.contents) == 0:
            return None

        name = unicode(pre_name_elem.contents[0]).strip()

        pre_race_elem = utils.dive_into_soup(self.soup, self.RACE_PATH)
        if len(pre_race_elem.contents) == 0:
            race = None
        else:
            race = unicode(pre_race_elem.contents[0]).lower()

        return {
            'name' : name,
            'race' : race,
            'aliases' : []
        }

The "soup" constructor argument is the BeautifulSoup object. I've found the easiest way to get at the data I want is to construct a "path" to the html element. The utils.dive_into_soup function looks like this:


def dive_into_soup(soup, content_navigation):
    s = soup

    for index, content in enumerate(content_navigation):
        try:
            s = s.contents[content]
        except IndexError as e:
            raise DiveError(e, content_navigation, index)

    return s

So, basically, if the "path" is [0,3,2,4], then starting from the root element, I look at the 0th child element, then at that elements 3rd child element, etc., until I hit the bottom, then return the element I'm at. To make the creation of these "dive codes" easier, I've written a hacky little function to create them for me based on text that I specify.

Honestly, the idea of a "dive path" is kind of a hack. I'd much rather have the page, and be able to just say, "give me the element next to #player_name". Unfortunately, the entire kespa site uses table layouts, and all the ids and classes are pretty much for layout purposes as well, so this 'dive' approach seems to be a better option.

SC Engine: Part 1 - System Overview

Every once in awhile, a ScheduleFetchRequested message is sent out.

The schedule fetcher application is listening for the ScheduleFetchRequested message, and upon receiving it, goes out and fetches the schedule for the date detailed in the message. This involves running code to go to the website and do some scraping. It then sends out a ScheduleFetchAnnouncement message with the results of what it's found. Some stuff it can determine immediately (player ids, map ids), where others it might not be able to (team names, league and stage that the game is played at ).

Another application gets this ScheduleFetchAnnouncement, and will do lookups on the information we couldn't find directly from the web page (such as the team ids or league name). It gathers the results and spits out a ScheduleParseAnnouncement.

Another application that actually stores the schedules onto disk receives the ScheduleParseAnnouncement message, and goes ahead and saves them. It doesn't need to send out any message.

So why three apps? Why not have all this done in one app? Or the fetch and parsing in one app, and the storing in another?

Although debatable as to it's merits, the applications would then be more complex if we shrunk them down to two or one. The trade-off is that with three apps, now your entire system is more complex.

More importantly, by separating the steps by messages, you can later make other applications that can get at the data in the middle of the flow. For example, we can create an UnknownData application that also listens for the ScheduleParseAnnouncement messages, and if it sees any players or maps that we've never heard of before, can send out appropriate messages to try to fetch information on those players and maps. Adding this new functionality can be done without touching existing components.

Of course, you need to go with your own discretion over how much you separate your apps. There really can't be a hard rule, just experience and preference to help here.

Now, let's look at what happens regarding youtube videos:

Every once in awhile, a YouTubeFetchRequested message is sent out

An application hears this message and runs a query on youtube. Each resulting video's id, title, description, and author are sent in a YouTubeFetchAnnouncement message.

Another application is listening for these YouTubeFetchAnnouncement messages. For each one, they'll run the title and description through a parser to try to grab as much information as possible. Possible info might be team names, player names, game number (game X of a set of 5 games), date played, etc. Sometimes, the parse might not find enough data to even assume that it's actually a starcraft commentary. In that case, the application doesn't send out a message about it. However, let's assume it gets at least SOME data from the parse. The data it does find it will put into a message, YouTubeParseAnnouncement.

The YouTubeParseAnnouncement message is sent to the application we talked about earlier that stored the scheduling data. That application will run the youtube information against the schedules stored and look for possible links. If it can find one, it sends out a VideoGameLinkAnnouncement. If it can't find any results, or it could be one of multiple results, the app will instead send out a YouTubePartalParseAnnouncement.

From here, the VideoGameLinkAnnouncement messages eventually go to the web site, whereas the "partial parses", as I call them, are collected into another application to be sorted through manually by me to see what's up. By manually going through these results, I can come up with changes to the youtube parsing code, or add data to help the search engine in future tries. I then have the option to retry the parse again, or just manually send out the VideoGameLinkAnnouncement myself with the correct data.

There are other apps and messages that I haven't spoken of, such as fetching and updating data on players and maps, the applications that trigger periodic fetches, and what happens to all those "partial parse" messages. But I think this gives a good idea of how things would tend to work.

So, that's the overview of the system. For the most part, it's a bunch of small apps that share data through passing messages. There is no real central database, as each app is responsible for saving the data in the way they feel suitable.

It's a completely different style than the typical system where all the data is in a central database and all parts of the app work on the same data. I've read about this style of programming on blogs like Udi Dahan and Ayende. It's my first time using such a system, and I'm interested in learning about it's strengths/pitfalls.

SC Engine: Part 2 - System Overview: Messages and Applications

Messages

A message is a simple entity meant to store data that is being sent from one part of the system to another. Here is what a message might look like...


class PlayerFetchRequested(BaseMessage):
     def __init__(self, player_id):
         # int
         self.player_id = player_id

For those interetested, you can see all the messages in the source:

As you can see, there's not much meat here. It's just a class that inherits from BaseMessage, with a constructor that has the data. It could've just been implemented as a tuple with a string for the message name and a dictionary for the data, but it was nice to be able to have an error thrown if I don't include all the arguments. BaseMessage also provides a few methods onto the messages that are helpful, and is used by the messaging middleware I wrote (which will be discussed in another part).

See all the messages for SC Engine

Applications

Messages are sent between applications. An "application" is a python object that defines a number of methods which respond to incoming messages and optionally sending out resulting messages. The point is that each application does one job, and sends out messages of the results of what it's done. Where the messages go or who use them is completely not of the concern of the application. Nor does it care where the messages it receives have come from. In this sense, the entire system is implicitly "pub/sub", or "publish and subscribe".

Typically, an application consists of two parts: a builder, and the application itself. Let's look at an application first:

Player Fetch App

This application takes into its constructor a callable (in the end, this callable turns out to be a function that launches the actual fetching and scraping of a web page containing the player data). It is using constructor dependency injection so that I can replace lookup with a mock and easily test the app.

There is one method here that handles incoming messages, and that is player_fetch_requested. This message is called when the application receives a PlayerFetchMessage. An application can support handling of more than just one type of message, just add more methods for each handling operation. The fact that the name of the function is similar to the message name is just a convention, and doesn't have any significance other than to to be self-documenting.

The method takes in one argument; the message to handle. A method that handles messages like this can then return from the function a message or list of messages of it's own, which is how it sends out messages. The point is that this application is run on top of some messaging middleware that passes the message to the correct function, gets the results of the function (which are always messages) and then sends out the resulting messages.

In this case, the lookup is done, and depending on the results sends out a resulting message. In this case, either a messages with the info on the player, or a message saying that the player doesn't exist.

Now, the application itself is useless, since there's no way of telling the middleware that will be using it what messages to map to what methods. I experimented with naming conventions on the methods and decorators, but eventually settled on a simpler solution: a builder function that does this mapping, as well as giving the application object an dependencies...


def make_app(app_name, root_config, config):
    lookup = get_player_info
    app = FetchPlayerApp(lookup)

    return {
        msgs.PlayerFetchRequested: app.player_fetch_requested,
    }

Originally, this method was parameterless, but over time has grown to three parameters. app_name is the name as determined by the message bus for the application. The application shouldn't hard-code this or figure it out by itself because it might have parts prepended or appended to it by other parts of the system. The main reason an application would need this is to use it as the name of the logger or data filenames.

The root_config and config parameters are dictionaries containing items that might be run-time options (typically stored in ini files and parsed by a lower part of the engine). root_config typically contains configuration information for all application (such as the directory to store any data files), while config stores information specific to that application (an example might be the exact url to use when fetching, although I've decided to hardcode this right now).

The application builder uses the arguments to create the application object. It then returns a dictionary that maps the message type to the method on the app that should handle that message. Typically, this method is simple enough that it doesn't need testing.

Notice that the application is a POPO (Plain Ole' Python Object). It does not have any dependencies on messaging systems on it. It just takes in the messages, and returns resulting messages. The actual job of sending and receiving those messages on any sort of message bus is up to the object which calls the applications methods.

This allowed me to easily set up prototypes in the early stages by using a hacked together message bus that would just repeatedly send a message to an apps handler, get the resulting messages, then send those out as well. In fact, I pretty much used this system through most of development, not setting up RabbitMQ or anything like that until much later.

Also, because it's a POPO, this application is extremely simple to test. Here is what the test looks like:


class TestPlayerFetchApp(AppTestBase):
    def setUp(self):
        self.lookup = Mock()
        self.sut = PlayerFetchApp(self.lookup)

    def test_fetches_player(self):
        self.lookup.return_value = {'name' : 'test', 'race' : 'zerg'}

        input = msgs.PlayerFetchRequested(1)
        expected = msgs.PlayerFetchAnnouncement(1, 'test', 'zerg')

        result = self.sut.player_fetch_requested(input)

        self.assertContainsMsg(result, expected)

    def test_fetches_non_existant_player(self):
        self.lookup.return_value = None

        input = msgs.PlayerFetchRequested(1)
        expected = msgs.PlayerFetchNonExistantPlayer(1)

        result = self.sut.player_fetch_requested(input)

        self.assertContainsMsg(result, expected)

assertContainsMsg is a helper function that covers all the cases of an application returning a message (returning the message itself, or returning a list of messages with that message in the list). Also, the messages are easy to spot because the BaseMessage object implements value equality. This means...


>>> a = PlayerFetchAnnouncement(1, 'test', 'zerg')
>>> b = PlayerFetchAnnouncement(1, 'test', 'zerg')
>>> c = PlayerFetchAnnouncement(2, 'test2', 'zerg')
>>> a == b
True
>>> a == c
False

Even though object a and b are different objects, BaseMessage overrides the equality operators to ensure that messages with the same data are equal. The main point is to make testing much easier: just make the message you're expecting, rather than having to check all the individual attributes yourself.

That's a pretty simple overview of what an individual application might look like. Typically, your application is either a simple application in itself, just storing and sending data, or is a front-end to more advanced functionality such as this fetch application. It's easy to test, and while is built with messaging in mind has no dependencies on any messaging framework.

Friday, July 10, 2009

Starcraft Professional VOD Search Engine: Introduction

This is an introduction to a series of posts designed to give you a tour of the project I've been working on for the past month or so, "SC Engine".

The entire series is linked below:

- Introduction
- Part One: System Overview
- Part Two: System Overview: Messages and Applications
- Part Three: Screen Scraping
- Part Four: YouTube Parsing
- Part Five: Linking the Video to the Game
- Part Six: Messaging Middleware
- Part Seven: The Console
- Part Eight: The Site

SC Engine is a series of programs and tools that indexes Professional Starcraft videos found on the net into one easy-to-use web site. This first post gives an idea of what the project is expected to do, with the rest of the posts going into the implementation.

For those not aware, Starcraft is a PC game that was created by Blizzard and originally released in 1998. It has withstood the test of time, and in South Korea even has a following that is as popular as Soccer or Pro Wrestling here in the US, if not greater. Starcraft in South Korea comes complete with their own professional leagues. E-Sports as a whole in South Korea has really taken off, gaining great sponsors and recognition. Starcraft is the game at the forefront of this new form of entertainment, drawing the largest crowds and best ratings.

Games are often cast on TV with commentators just like you'd see in any televised sporting event. Games of the more popular leagues are often uploaded to sites like YouTube or the league's own website for viewing, and often referred to as "VODS" ("Videos On Demand").

Some fans, wishing to view the games with commentators speaking their own language, began dubbing their own commentary over the Korean ones. Here's an example of a game in it's original version, and here's an example with an English commentator, Klazart.

SC Engine started coming together for a number of different reasons:

I was out of work since January (although took some time until April just to relax, living off of savings) and unsure of how well I could express my ability as a python/web developer (much of what I did at my work was in .NET, and even later python stuff I wrote before being laid off never saw it's way into production). Thus, I was looking for a project to show to the world in code what I probably could not easily express in words.

I wanted to work on SOMETHING. My job had me starting to burn out from coding, working at some pretty tedious objectives. After my break, I wanted something new that was interesting, not the same old "django blog" application. Asynchronous messaging via AMQP seemed very interesting.

I enjoyed watching Professional Starcraft VODs, especially with the English commentators, but found it a pain trying to find games, then keep track of how all the games I was watching related to each-other. A person who only watched videos from one or two commentators could only watch a portion of the games, leaving them with gaps in the knowledge of how the league is progressing. Since all uploaders and commentators do this in their spare time, they can only cover so many videos.

Watching games on Youtube, there was a chance of "spoiling" the results of the game. Games can take anywhere from a few minutes to an hour or more to play. Knowing beforehand the length of the game gives clues as to what will happen in the game as you're watching it. Also, because many games are in a "best of x" series, a commentator might upload only the games played, leaving a viewer to be able to easily determine the winner of a match through logic.

Many uploaders will upload "anti-spoiler" videos, that are just small videos that seem like the next game is played, but when you view it just contains a quick message that the match was already decided. I feel like this isn't exactly a great solution: it's more work for the uploaders, and not all uploaders did it anyway.

The result was to create a website that tried to solve these problems.

In it's first interval, the home page would show you all of the leagues, and you would select matches in the league to watch. Then, you could watch each game in that match. Because the entire navigation of the page view games as a first-class citizen (as opposed to watching on YouTube, where the videos are a first-class citizen) you would have a visual guide as to where this game stands in terms of the league (knowing a game took place just before the end of the season, or in the playoffs, or what part of the playoffs, etc.). When actually choosing the game, you would get to pick which uploader's version of the game you want to watch. This allowed you to view more games than you might otherwise have had you just watched the videos of one or a few commentators. Also, since the Starcraft community (a few memebers, specifically) is pretty amazing at their ability to upload most every game played in large events, even if you can't find a commentary in your language you'll probably still have the option to view the original game, albeit with Korean commentary.

Also, I felt it'd be nice to see some other information on the game's page, such as:

The names of the players in the game.

The race that that player plays the game with. In Starcraft, there are one of three possible races to select, each with their own unique traits.

The kespa site occasionally puts the "starting position" of the players on their site, which let's you know where the player will be when you are watching the game.

An image of the "map". Each game takes place on a different map, each with different layouts and base locations to add variety to the game. Being able to see a picture of the entire map as you watch a game helps immensely with understanding how the game is developing, especially if you're not familiar the particular map.

Right now, work is complete on the first iteration. The site still needs a complete redesign (I just hacked some css) and will probably change a ton. You can see what it looks like here:

http://sc.markhildreth.webfactional.com/

There isn't much in terms of data, as I'm still working on the hardware that the engine will run on, but it gives you an idea of the type of things that you might be able to expect.

In any case, that's the background on the project. My remaining posts will dive into the aspects of software implementation.

Tuesday, July 7, 2009

Python virtual console race condition

Here's a fun drinking game you can play. Every time I introduce a bug because I forget to close() something, take a drink.

My last entry talked about how paramiko was giving me fits because I wasn't closing the connection. Today, I had another problem, this time using amqp-lib. I had two virtual consoles open, one with the process that would read from the queue in a tight loop and printing out what it was doing. Another console would run a script to send messages to the queue. I started noticing a problem where the script that would send 9 messages would complete without error, but if I switched over to the first terminal, I saw only a partial number of them would actually be acted on. Sometimes it was one, sometimes it was four or five, sometimes it was all of them. I opened up a third console to look at the queues in RabbitMQ and saw that there were no more messages left. Either it was picking the messages out from the queue and not acting on them correctly, or never being sent to the queue correctly.

Sometimes I could run the script a few times in a row, and they would all pass. Finally, I saw a way to reproduce it... before the script sending the messages ended, switch to the other terminal.

Here's what I imagine was happening..

When I switched from one console to the other on my (slow, old, reserved for use as a server) laptop, the slow ass virtual console didn't have to print out the printlines in the script I was debugging with. Because the script ran faster, it was able to finish before all the data was effectively sent from the socket to RabbitMQ. When the script finished, the garbage collector forced the connection closed, even if it was still working. By adding in the code to manually close the connection, I provided the connection those extra split seconds necessary to really close with everything complete.

Sunday, July 5, 2009

Paramiko hangs.

I was looking at paramiko for use in scripts and an application I was writing where I needed to do some secure ftp. Seems like I should've recognized this, but I was caught by surprise how it would handle closing the connection when python closed. I assumed the connection would close just like an open file (so that upon exiting the python shell, it would close). However, I found that the shell hung. At first I thought it was a bug in the library, but really it's just that paramiko needs you to explicitly close the connection.


def upload_file(host, user, key, remote_filename, data):
    t = paramiko.Transport(host)
    try:
        t.start_client()
        t.auth_publickey(user, key)
        t.open_session()

        sftp = paramiko.SFTPClient.from_transport(t)

        with tempfile.NamedTemporaryFile() as f:
            f.write(data)
            f.flush()
            sftp.put(f.name, remote_filename)
    finally:
        t.close()

Another one of those situations where the answer should be obvious...