Thursday, November 12, 2009

Joining multiple mp4 files

I'm not going to claim to know anything about video codecs and the such, so I'm pretty much at the mercy of google when it comes to modifying video files. Having found part of the answer but not all of it, I figured I'd post the exact command line here for the lazy like me...

A forum hinted me to the software MP4Box, which I was able to install by installing the gpac bundle. From there, it was simply a matter of...


MP4Box a.mp4 -cat b.mp4 -out result.mp4


a is input, with -cat saying that we should concatenate b to the end of a.

This worked well enough for me, but since the video codec world is strange and mysterious, your results may vary.

Monday, September 7, 2009

Running SQLAlchemy scripts off of pylon's paste configurations

I've come across a situation where I want to be able to use cron to launch scripts that automate some tasks (fetch info from 3rd party, scramble their data into my database format, write to my database). I want to be able to use the models I've already created in SQLAlchemy, along with the configuration of SQLAlchemy in my paste configuration files (development.ini, production.ini) in these scripts.

I don't need all of the bells and whistles of pylons, just the SQLAlchemy stuff. Luckily, there are some convenience functions that seem to have been made to deal with a situation just like this one. Here's the code:


import sys
from paste.deploy import appconfig
from sqlalchemy import engine_from_config
from myapp.model import init_model

def setup_environment():
if len(sys.argv) != 2:
print 'Usage: Need to specify config file.'
sys.exit(1)

config_filename = sys.argv[1]
config = appconfig('config:%s' % config_filename, relative_to='.')
engine = engine_from_config(config)
init_model(engine)


Of course, replace "myapp" with the name of your application.

Getting the configuration filename from the command-line arguments could of course be put somewhere else, so I'll just skip talking about that. The two main functions are appconfig and engine_from_config.

appconfig is from paste.deploy, which reads the configuration file and returns a dictionary-like object. engine_from_config is from SQLAlchemy and takes a dictionary-like object, retrieves SQLAlchemy-specific configurations, and uses the configurations to create an engine object.

Friday, August 28, 2009

Praise for IBM

Note: I am not in ANY way affiliated with IBM or Lenovo. This is a purely, satisfied customer.

Pretty much everyone has written about their frustration with customer service X or company Y. I'm here to tell a different tale, about how a company did things RIGHT.

It started off a few months ago when I had diagnosed that my ATI video card was having problems on my Thinkpad T400 and needed to be replaced. Since this computer has two video cards that you can switch between (one for battery life, one for power), I was able to just use the other video card. I didn't feel like being without my computer, so rather than get it fixed, I just held off and only used the integrated card. This means I lost dual-monitor support, but I wasn't dead in the water.

I finally decided the time was right to send it in, so I called up the technical support number. I was on hold for less than five minutes. After giving the typical info, I described in a few sentences what was wrong, including what I did to troubleshoot and diagnose the problem. At this point I was ready for the guy on the other end to start his laundry list of things that they need to check.

"Disconnect all external devices..."
"Let's try restarting the computer..."
"Let's reinstall Windows..."

I've had screens burn out and would need to go through this list of steps before they would allow me to send it in for hardware repairs. Especially since I'm not a business user, it just seemed like other companies paid no special attention to keeping this process simple.

IBM did it right. The person listened to me, instantly recognized that I knew what I was talking about, and first checked to see if he could just send me the replacement card that I could reinstall (with no harm to my existing warranty). Turns out that wasn't an option, but he did immediately determine that the computer needed hardware repair, and started the process of getting my info for where to ship the box. No fuss about checking for software issues, I already did that, I know what needs to be done; let's just do this. It was wonderful.

That was on Sunday night. The box arrived on Tuesday Afternoon, which I promptly sent out. It's now Friday, 10am, and my computer is back with the problem fixed.

No doubt, next PC will be a Thinkpad.

Monday, August 10, 2009

Reusable SQLAlchemy Models

Recently looking at django, I took a liking to their idea of reusable apps. One thing that was common with most of these apps were some model that they would include and which you would call a function to automatically set up a relation to one of your models, thus having a "reusable model". An example would be any object that you'd like to have comments on automatically having a Comment model generated and relationships to your custom-built object automatically set up.

Out of mere curiosity, I wondered how difficult it would be to create a "reusable model" in SQLAlchemy. My end goal was to be able to do something like this...


@commentable
class Post(Base):
__tablename__ = 'posts'
id = sa.Column(sa.Integer, primary_key=True)
text = sa.Column(sa.String)


A class that was decorated as commentable would automatically have a relation defined to contain multiple comments. After trying out a few ideas, I wrote a test for what I would want the end result to look like...


class TestModels(SATestCase):
def test_make_comment(self):
p = Post()
p.text = 'SQLAlchemy is amazing!'

text = 'I agree!'
name = 'Mark'
url = 'http://www.sqlalchemy.org/'

c = Post.Comment()
c.text = text
c.name = name
c.url = url
p.add_comment(c)
Session.add(p)

p = self.reload(p)

self.assertEquals(len(p.comments), 1)
c = p.comments[0]
self.assertEquals(c.text, text)
self.assertEquals(c.name, name)
self.assertEquals(c.url, url)



(As a little note, self.reload() is a helper function that will force a complete reload of the session and return the objects you passed in after the session is opened again)

First off, a little bit about how the table structure would work. AFAIK, django stores comments in one large table and has what you might call a discriminator field to determine the object type that the row relates to. In my case, every type of commentable object (Post, NewItem, etc.) would instead get it's own comment class (PostComment, NewsItemComment) as well as their own table (post_comments, news_items_comments). No real reason to do it this way, I just thought it'd be easier.

In the end, it was actually pretty easy. Here is what the initial results look like...


class BaseComment(object):
pass

def build_comment_model(clazz):
class_table_name = str(class_mapper(clazz).local_table)
metadata = clazz.metadata

comment_class_name = clazz.__name__ + 'Comment'
comment_class = type(comment_class_name, (BaseComment,), {})

comment_table_name = class_table_name + '_comments'
comment_table = sa.Table(comment_table_name, metadata,
sa.Column('id', sa.Integer, primary_key=True),
sa.Column(class_table_name + '_id',
sa.Integer,
sa.ForeignKey(class_table_name + '.id')),

sa.Column('text', sa.String),
sa.Column('name', sa.String(100)),
sa.Column('url', sa.String(255)),
)

mapper(comment_class, comment_table)

return comment_class, comment_table

def commentable(clazz):
comment_class, comment_table = build_comment_model(clazz)

clazz.Comment = comment_class
setattr(clazz, 'comments', relation(comment_class))

def add_comment(self, comment):
self.comments.append(comment)

setattr(clazz, 'add_comment', add_comment)

return clazz


First off, we have the BaseComment class. This is just an empty class, but you could easily see there being logic on here.

In the build_comment_model() function, you see where the creation of the table's metadata takes place. It'll first create a new class that will be used for the new comment model (in the case of a Post model being commentable, it's creating a PostComment class that inherits from BaseComment). Using SQLAlchemy information that we find on the mapper of the class we're decorating, we can determine things such as the name of the table so that we can create our own table (the Post model's table name is "posts", so we'll create a "posts_comments" table).

Finally, the commentable() function finishes by setting the Comment attribute on the model class we're decorating, as well as adding the relation and a class method. The Comment attribute allows us to easily get at the new class that we created, the relation allows us to easily work with the items, and the add_comment is an example of how we can also add extra methods to the model.

Currently, there are some pitfalls:

1.) It assumes that the class has been mapped before running. Because mapping of a class must happen after the class is instantiated, you couldn't use the commentable() function as a decorator for a non-declarative style mapping, since it wouldn't be able to find the mapper. The solution would be to just call the commentable function after the class is mapped. I guess it's not really a pitfall per se, although I haven't actually tried it, so I'm just guessing that it works :)

2.) Right now, it assumes that the foreign key for the Comment object would be a single column that corresponds to the column named "id" on whatever model it's decorating. I still need to work on this, since you might have a different name, different primary key type, or even a multi-column primary key. Basically when the new comment table's metadata is being laid out, it needs to look at the decorated model's primary keys to determine how it should construct it's foreign keys, rather than just jumping to the conclusions that it does now.

Sunday, August 9, 2009

A hint for those new to Django.

After you've done with the django tutorial, you'll probably want to start on your own project. Here's a hint that helped me, coming from a Pylons background, on writing that first project and not getting fed up with it (especially if you're biased and are ready to throw it all away and go back to your old framework at the slightest imperfection)...

In your urls.py, you have the following lines...

# Uncomment the next two lines to enable the admin:
# from django.contrib import admin
# admin.autodiscover()

In order to enable admin, you uncomment those two lines, amongst doing a few other things.

Delete those lines. Do it, right now. And get the admin out of your mind.

In the last week, I kept finding myself making my models, opening the admin, messing around with the model's admin interface, getting annoyed over how silly the relationships work in the admin and how it's not how I would want users to specify them, and giving up.

The problem was that I was equating the django admin with django. In reality, what I was doing in pylons, writing the "admin" pages manually but perfectly honed to what I wanted, could've been done in django as well.

Now, perhaps eventually, once you've gotten fairly sufficient with django, specifically after you've wrapped your head around the forms, you can take a look at the admin and see how it might help you. But starting off using the admin will just distract you.

Monday, August 3, 2009

Shortening bash prompts inside a virtualenv

This post will describe a way to shorten your bash prompt in Linux so that inside the inevitable multiple directories caused by using virtualenv, you can still have a small prompt.

I wanted to take an existing library and split it into it's own library. On my computer, I was going to work wit this library at ~/projects/sc/kespa_scrape.

Of course, that kespa_scrape directory is actually the directory holding the virtual_env...


kespa_scrape/
|-- bin
| |-- activate
| |-- activate_this.py
| |-- easy_install
| |-- easy_install-2.6
| |-- nosetests
| |-- nosetests-2.6
| |-- python
| |-- testall.sh
| `-- to3.sh
|-- include
| `-- python2.6 -> /usr/include/python2.6
|-- kespa_scrape
| `-- kespa_scrape
`-- lib
`-- python2.6


This is what the kespa_scrape directory looks like. You can see it has a kespa_scrape directory inside it as well. This directory is the root to the actual project. This directory will probably also have a kespa_scrape directory which would be the actual library package directory. Obviously, there is redundancy, but it's necessary redundancy to keep all the parts where I want them.

Some people complain about virtualenv because then you need to cd through a whole bunch of directories to start working. I make this easier with a script. This is how it is used...


gobo@gobo:~$ cd projects/sc/
gobo@gobo:~/projects/sc$ activate_project kespa_scrape
(kespa_scrape)gobo@gobo:~/projects/sc/kespa_scrape/kespa_scrape$


As you can see, I change to the directory where the project is, enter "activate_project", and it will put me into the virtual environment as well as change to the root directory of the project.

The problem I still had with this is that now my prompt is huge. I would typically change directories once more into the next kespa_scrape package, and with a large enough library name, have a prompt that easily took up half the screen.

At first, I made sure to choose short library names, but now I've come up with a better solution. It rewrites the bash prompt to replace the virtual environment directory with two tildes, and looks like this...


gobo@gobo:~$ cd projects/sc
gobo@gobo:~/projects/sc$ activate_project kespa_scrape
(kespa_scrape)gobo@gobo:~~$


Check out the difference for yourself...


(kespa_scrape)gobo@gobo:~/projects/sc/kespa_scrape/kespa_scrape$
(kespa_scrape)gobo@gobo:~~$


Here's the full script...


#!/bin/bash

# Remove possible slash at end.
DIR=${1%/}

ACTIVATE_FILE=$1/bin/activate

if [ -z "$DIR" ]; then
echo "usage: $0 directory"
return 1
fi

if [ ! -d "$DIR" ]; then
echo "Directory not found: $DIR"
return 1
fi

if [ ! -d "$DIR/$DIR" ]; then
echo "Child directory not found: $DIR/$DIR"
return 1
fi

if [ ! -e "$DIR/bin/activate" ]; then
echo "Activate file not found. Are you sure this is a virtualenv directory?"
return 1
fi


cd $DIR
source bin/activate
cd $DIR

CUR_DIR_CMD='pwd | sed -e '\''s!'$VIRTUAL_ENV/$DIR'!~~!g'\'' | sed -e '\''s!'$HOME'!~!g'\'
PS1=${PS1//\\w/$\($CUR_DIR_CMD\)}

alias cdp='cd '$VIRTUAL_ENV/$DIR


After checking to make sure all the files exist, it sets a variable called CUR_DIR_CMD. This is basically a variable that contains a bash command to take the current directory, and replace any instances of the virtual directory with ~~. Since virtualenv sets the VIRTUAL_ENV variable, it basically looks for "/projects/sc/kespa_scrape/kespa_scrape" and replaces it with ~~. It will also take the result of that and replace instances of the home directory with ~. This is so that if I change directories to something outside of the virtualenv, I can still have '~' instead of '/home/gobo'

Next, I need to replace all instances of \w in the prompt string with this expression. \w is the special string you'd use when setting your bash prompt to give you the full directory that you're in. Replacing it with my custom command allows me to replace the area where the directory should go with what I want, but leave the rest of the prompt (including virtualenv's addition to the beginning) intact.

The entire process also does pretty well when you change directories to outside the virtualenv, since you're just left with the directory...


(kespa_scrape)gobo@gobo:~~$ cd
(kespa_scrape)gobo@gobo:~$ cd projects/
(kespa_scrape)gobo@gobo:~/projects$


The only thing I want to fix up on it is to allow activate_project to work from ANY directory. Currently, it assumes that the project directory is right in front of you. This is good enough for now though.

----

Edit: Feb 18, 2010

Also, one thing to keep in mind, for those who don't really deal much with bash scripts, is that typically when you launch a script the script runs inside it's own shell, and when it completes, the shell closes. This means you lose things such as the new alias or the changes to the directory. So, you can't run the script in any of these ways:


sh /home/mark/bin/activate_project project_name
/home/mark/bin/activate_project project_name


If you did this, the script would run, but you would not see anything changed, since all the changes would be done in the shell created for the script, which exits without changing the shell that ran the script. To ensure that you run the script on the shell you run the script with, you need to use the source or dot operator.


. /home/mark/bin/activate_project project_name
source /home/mark/bin/activate_project project_name


As a side note, now you really know what that source bin/activate line does! Anyway, you don't want to deal with this hassle, so set up an alias...


alias activate_project="source /home/mark/bin/activate_project"

Friday, July 31, 2009

SC Engine: The Trouble With Testing

This is sort of an ad-hoc post, so I'm not really considering it part of my SC Engine series.

The Mercurial repository shows SC Engine at just under two months old. I guess it's time to look back at some of the things that I've tried and maybe think about stuff that worked vs. stuff that didn't work.

The big change in this project from other things I've done is the architecture. Although I'm sure my implementation was not the best, I really don't know how much I approve of a messaging system for everything. On the one hand, it did great for separating out different parts of the system, and on very large systems I could imagine it being useful when you have multiple systems with different platforms. Also, horizontally scaling would be extremely easy, although this wasn't really a concern for me. Really, the big problem was that testing applications was not as easy as I had hoped.

The goal was simple: make testing apps as easy as passing in data and collecting the data back. Make sure that the data returned was expected. Rinse and repeat...


def test_some_app(self):
msg = IncomingMsg(1, 2)
results = self.app.on_msg(msg)
expected = OutgoingMsg(1,3)
self.assertContains(results, expected)


The problem is with the messages that had a TON of attributes and needed a few earlier messages to set it up. Suddenly, a test could look like...


def test_some_app(self):
incoming_data = {
'data_1' : 1,
'data_2' : 1,
'data_3' : 1,
'data_4' : 1,
'data_5' : 1,
'data_6' : 1,
'data_7' : 1,
}

incoming_data_2 = {
'data_1' : 1,
'data_2' : 1,
'data_3' : 1,
'data_4' : 1,
'data_5' : 1,
}

incoming_msg_1 = IncomingMsg(**incoming_data_1)
incoming_msg_2 = IncomingMsg(**incoming_data_2)

self.app.on_msg(incoming_msg_1)
self.app.on_msg(incoming_msg_2)

incoming_data_3 = {
'data_1' : 1,
'data_2' : 1,
'data_3' : 1,
'data_4' : 1,
'data_5' : 1,
}

results = self.app.on_msg(incoming_msg_1)

expected_outgoing_data = {
'data_1' : 1,
'data_2' : 1,
'data_3' : 1,
'data_4' : 1,
'data_5' : 1,
}

expected_outgoing_msg = OutgoingMsg(**expected_outgoing_data)
self.assertContains(results, expected_outgoing_msg)


Now, sometimes I only needed data_1 and data_2, and the rest I could care less what the incoming data was. However, I often needed to make sure that perhaps data_3, data_4, and data_5 on the outgoing was the same. However, there just seemed to be a theme of code being less logic and more naming attributes. Basically, every app needed to get all the attributes and specify what they were in their own way. This wasn't just a problem in tests, the actual applications themselves were, at times, reeking of a bunch of attribute dictionaries and no real logic. It made the code pretty ugly looking.

In the end, I started writing less unit tests. At first, I thought that this was ok. I didn't necessarily want to write unit tests for every class as I was writing it, since sometimes you really need to write the objects and see how they will work together before you have a clear idea over the best way to implement the objects. Later, I realized it was really because the tests were a pain. For some objects it was understandable, but apps were pretty straightforward in terms of their interface (it's a bunch of methods with a single argument: the msg).

One potential solution would be to split messages up. In all actuality, I did this in some cases. However, in terms of testing, this would mean going from large messages to having more messages, which means the test would be just as long.

Another potential solution was to build helper objects to build the messages, but in the case of tests where all the attributes were indeed important, this wouldn't really help. Plus, often the attributes might not be important for the specific test. This probably would just end up being more work anyway.

At this point, I'm starting to see my problem in the following manner: Even the most simplest of applications are trying to do three things:

1.) Take the data that was passed to the app via messages and convert it into some locally-defined data structure.
2.) Run operations on the data structures to arrive at results.
3.) Convert results into messages to send out.

In trying to create tests for my app to make sure that #2 was being done correctly, I would inevitably end up writing extra code for EACH test to ensure that #1 and #3 were being done as well.

Perhaps I should start thinking of these as separate stages all-together. I can keep much of the infrastructure, but place objects on top that splits the functionality of handling a single message into the three distinct parts.

Edit:

Alternatively, a situation where only enough data to identify important info is being passed in any message, and the rest of the data is stored in a database somewhere outside of the application. Basically, the message would only contain ids.

Wednesday, July 22, 2009

SC Engine: Part 8 - The Site

- Introduction
- Part One: System Overview
- Part Two: System Overview: Messages and Applications
- Part Three: Screen Scraping
- Part Four: YouTube Parsing
- Part Five: Linking the Video to the Game
- Part Six: Messaging Middleware
- Part Seven: The Console
- Part Eight: The Site

In my final planned post for SC Engine, I'll talk about how the data gets from moving around through all these apps onto the web site.

One application that is really no different than any other in terms of how it is made is called 'web.site_interface'. Here's what it looks like...

Web Site App

You'll notice that the app is simply named "App". I recently did a refactoring that allowed me to create apps based on the module name to make bringing together all the apps very simple. Anyway, in the application's builder method, we see a good example of the application getting a configuration (the password to the site). The 'poster' method is a function that's used to send data using an HTTP post to the web site at the specific location. The entire body of the request is json, and consists of the data and a password. This password, combined with SSL, should give me enough protection to allow updates to the site without worrying about a third party easily swooping in and running whatever commands they want.

On the other side, pylons is waiting in it's update controller...

Update Controller


In case you were wondering, this is similar to how the SC Console works, as well. Really, the controller just converts the json into a dictionary of values, and passes it on to the updater which does the heavy lifting...

Updater

In terms of updates, that's all there is to it.

One of the more interesting points is the url generation, such as the following:

http://sc.markhildreth.webfactional.com/ShinhanBankProleague0809/Round1/Week1/Day1/HiteSPARKYZ-v-HwaseungOz/Set1

Currently, the url routes are a bit hackish...


# CUSTOM ROUTES HERE
map.connect('/', controller='main', action='index')
map.connect('/uploader_info', controller='main', action='uploader_info')
map.connect('/contact', controller='main', action='contact')
map.connect('/contact_submit', controller='main', action='contact_submit')
map.connect('/contact_submitted', controller='main', action='contact_submitted')

# Uploads
map.connect('/update/{action}', controller='update')

# Ajax retrieval of commentary
map.connect('/commentary/{game_id}/{author}', controller='commentary', action='view')

# Matches
map.connect('/{league}/{stage_0}/{stage_1}/{stage_2}/{team_one}-v-{team_two}',
controller='league', action='match')
map.connect('/{league}/{stage_0}/{stage_1}/{team_one}-v-{team_two}',
controller='league', action='match')
map.connect('/{league}/{stage_0}/{team_one}-v-{team_two}',
controller='league', action='match')
map.connect('/{league}/{team_one}-v-{team_two}',
controller='league', action='match')

# Match set
map.connect('/{league}/{stage_0}/{stage_1}/{stage_2}/{team_one}-v-{team_two}/Set{game_number}',
controller='league', action='view')
map.connect('/{league}/{stage_0}/{stage_1}/{team_one}-v-{team_two}/Set{game_number}',
controller='league', action='view')
map.connect('/{league}/{stage_0}/{team_one}-v-{team_two}/Set{game_number}',
controller='league', action='view')
map.connect('/{league}/{team_one}-v-{team_two}/Set{game_number}',
controller='league', action='view')

# Stages
map.connect('/{league}/{stage_0}/{stage_1}/{stage_2}', controller='league', action='stage')
map.connect('/{league}/{stage_0}/{stage_1}', controller='league', action='stage')
map.connect('/{league}/{stage_0}', controller='league', action='stage')
map.connect('/{league}', controller='league', action='stage')


Eventually, I'm going to need to do matches other than team_one vs. team_two (some leagues have matches that contain four players in a group, and the url would need to have something like "GroupA"), and I figure that when that time comes I'll rewrite how the rule matching works. But, it works for now, and I can see it in one page, so I'm not too concerned with it (a year ago I would've spent another day getting these routes down to five lines of code, one example of how I think I've matured as a programmer recently).

Right now, the SQLAlchemy code that actually retrieves the data is very simple. It suffers from N+1 Select issues, and at some point will need to be rewritten for performance reasons. I won't show any of it because it really is just a matter of get some data by the id and then keep fetching child data. It'll end up with six or more database calls for each page, and at some point I'll want to reduce that number to one. One thought was to use an application to denormalize the data and upload the data already denormalized, but it probably seems like a bunch of work for what could probably be solved with a few minutes of tweaking the SQLAlchemy queries.

Ignoring specific files or directories with Mercurial

I've switched from using svn to Mercurial about a year ago, and for the most part it has been going well. One gotcha is when trying to ignore files, things may not work as you expect (instead, they work as you SHOULD have expected).

Imagine a directory structure like this...


project
|-- development.ini
|-- run.py
`-- data


Data is a directory that stores the data for your project while running, so you want to ignore it. Opening up your .hgignore file you've had for awhile, you see something like this...


syntax:glob

*.pyc
*.log


...and you decide to do this...


syntax:glob

*.pyc
*.log
data


This is a bad move. Sooner or later, your project is going to look like this...


project
|-- development.ini
|-- run.py
|-- data
`-- src
|-- main.py
`-- parts
|-- graphics
| |-- stuff.py
| `-- stuff2.py
`-- data <-- !!!
|-- loader.py
`-- retrieval.py


And suddenly, when you make that src/parts/data directory, it's going to be ignored by Mercurial. If you want to ignore a specific directory, such as one that is used for data (common when using pylons), it's better to use:


syntax:regexp
^data$


This will ignore any file or directory in your entire repository that goes by the name "data". Note that this is the name relative to the project's root, so our data file in the src directory would not be matched since it's name is 'src/parts/data'. The ^ and $ are used in regular expressions to signify the start and end of the string.

Friday, July 17, 2009

SC Engine: Part 7 - The Console

- Introduction
- Part One: System Overview
- Part Two: System Overview: Messages and Applications
- Part Three: Screen Scraping
- Part Four: YouTube Parsing
- Part Five: Linking the Video to the Game
- Part Six: Messaging Middleware
- Part Seven: The Console
- Part Eight: The Site

For the most part, SC Engine is designed to be automatic: Read the schedules, get the video data, put them together, upload to a site. However, there are plenty of things that need to be manually done:

The engine might not be advanced enough to parse enough data from the title and description of a youtube video to determine the game. If I detect one of these situations, I want to be able to see the video and what the engine DID manage to get from it so I can either fix the engine and retry the video, manually link the video to a game, or just ignore the video. Also, while the messaging middleware should do well to ensure that the engine stays alive during application's raising errors, I do want some way to see what these errors were so I can figure out what needs fixing.

For this, I've written a web site designed to work as a frontend to the workings of the engine, called "SC Console". The console is a simple pylons web site, that looks like this:



The two main things are error reports and youtube partial parses:

Error Reports





The error report lists all of the errors that have been encountered that still need to be dealt with. It gives some quick data for all the errors (in the screenshot, have just one) such as the time the error occurred, the name of the application that the error occurred in, the name of the message that was being handled when the error occurred, and the error message itself (constructed by converting the error raises into a string). There's also the ability to retry (that is, the message will be sent out again) or deleted.

It's important to realize that retrying will send the message out again, so if a message is handled by two applications and one raises an error that we later retry, BOTH applications will be asked to handle the message. Luckily, it typically isn't too difficult to write applications in to gracefully handle the same message more than once, but it is something to be aware of. An application that would keep track of how many games a player has ever played, for example, would need to keep track of the actual games, not just increment a counter every time a message arrives stating that a player is in a game.

Anyway, clicking on "View" gives you more info on the error:



In addition to the data seen in the list screen, we also get to see the actual data that was contained in the message (which is typically a dictionary of values), the traceback of the error, as well as any log messages that were written during the application's execution of handling the message.

Youtube Partial Parses





This screen shows me all the cases where a youtube video came to my attention, had enough data to possibly be a proleague game, but I couldn't determine what game it was for. It shows the date & time of the attempted link, the video id (with a link that I can use to easily open the video), and the title of the video for quick reference Additionally, the dates, players, teams and game numbers parsed from the videos are shown, as well as the number of possible results found. Of course, the same retry and delete buttons.

Clicking on view gives me more info:



The date played links to the kespa website, specifically to the page for that schedule, which is a handy feature. The player ids also have a link to the kespa page.

Down below, information on the games that are seen as possible candidates are shown. These help me figure out if the engine is anywhere in the ballpark in terms of guessing the game, and also give me all the info I need that, if I wanted to, add to the script that tells the engine to manually mark videos for games. However, for most situations (including this one), the solution is to add aliases for players. In this case, players "Baxter" and "[Oops]Cloud" were not recognized. Some sluethwork reveals that "Baxter" is another name for a player typically known as "Killer" (id: 1305), and "[Oops]Cloud" is typically just known as "Cloud" (id: 348). For something like this I would add these two new aliases to the engine, and then hit "retry". This time around, the link would be made, because the one other game would no longer be reported as a candidate. This is a better result than manually linking, because now future games that use these aliases will also be immediately recognized, causing less work for me.

To quickly go over the rest of the menu:

Proleague Partial Fetch PP:



This is a feature I still haven't implemented, but I put it in the menu anyway. It will be similar to YouTube partial parse, except instead of showing ambiguous YouTube videos, it'll show ambiguous game schedules.

Utils:



This contains a bunch of forms that can be filled out to send specific messages, such as a schedule fetch for a certain day. Was very useful when the only messaging system was just a locally storing the messages in a list in a single application, where the only way I could add new messages was to shut down the application and change the code. However, it still works, so I keep it around.

Status:



Will eventually show a "heartbeat" status of all running apps, which is information on how long an application's message queue is by sending out a heartbeat and seeing how long before the heartbeat is responded too. Just a quick way to check the current load in addition to viewing the actual messaging systems queues.

SC Engine: Part 5 - Linking the Video to the Game

- Introduction
- Part One: System Overview
- Part Two: System Overview: Messages and Applications
- Part Three: Screen Scraping
- Part Four: YouTube Parsing
- Part Five: Linking the Video to the Game
- Part Six: Messaging Middleware
- Part Seven: The Console
- Part Eight: The Site

The main part of SC Engine is the ability to take the scheduling data, including the names of players and teams and what games they've played, and linking it with the youtube data. The app name I use for this is called "Video Game Linking". It allows me to specify what youtube id corresponds to what game in the schedule.

Let's just jump right into the code:

Video Game Linking Module

First off, some notes on strange stuff. "repo" stands for "repository", and it's basically my generic dictionary wrapper. The dictionary is wrapped in order to ensure that after I'm done using it, it can be saved to a file. It's not wrapped in a proxy-object sense, rather it is stored so that the proper way to access it is within a context. On it's own, a repository could work like this...


store = PickleFileStore()
repo = Repository(store)

with repo.use() as data:
data['a'] = 1


When repo.use() is called, it creates a context manager that, on closing, saves off the data. The use decorator that I've utilized on my app just wraps this functionality and adds the resulting data as arguments to the function. The proleague_match_id_announcement method, without the decorator, would look like this...


def proleague_match_id_announcement(self, msg):
with self.repo.use('games') as games:
games.add_proleague_match(msg.match_id, msg.team_one, msg.team_two)



...so really, all it helps in doing is making the function use one less tab, which is always a plus in my book.

Also, the "add_repository" method just allows for custom types to be used as the object you're working with, where without this your data would just be a dictionary. By allowing for a custom type (that takes in the dictionary as the argument to the constructor), I can easily wrap complex logic into other objects while still utilizing the implicit saving of the store. I probably did a horrible job explaining this, but I think the important part is what the application is doing anyway.

The typical workflow for this app is as follows:


  1. Somewhere else, a new match is found for the schedule, and an id is given to it. The ProleagueMatchIdAnnouncement is sent out, which gives info regarding the match and the id given to it. The app saves the information that it needs.

  2. A little while after, the GameDataAnnouncement message arrives. It has the match id, and the game number to determine what game in the match it is. The rest of the data is game specifics (id of players, id of map, winner, etc.). Once again, the important data is recorded, joined with the match data.

  3. A few hours to days later, someone posts up a youtube clip of the game. We receive the YouTubeParseAnnouncement message, with data such as ids of the possible players and teams, dates found and game numbers found. We go through our collection of games and find any that match. If more than one does, we send out a "Partial Parse" message (which will eventually allow me to look at these videos manually), but if we only find one, we're (mostly) sure it's it, so we send out a VideoGameLinkAnnouncement to signify this.



You'll also notice that there is this idea of a "manual link". Sometimes I stumble across a video that has something different or wrong with it that makes it very difficult for the engine to find. For example...

Siz)KaL vs Midas [30 November, 2008] 32set @ Proleague

This video's title reads "Siz)KaL vs Midas [30 November, 2008] 32set @ Proleague". I can gather from the title that Siz)Kal and Midas are the players involved, and that it takes place on November 30, 2008. The game in the video is the 3rd game of the match, but a typo (32Set instead of 3Set) means the parser see it as the 32nd game. A future goal is to have the search algorithm better handle such problems, but in the mean-time and for very extraordinary situations, I can manually choose the game.

As for the game search repository itself...

Game Search Repository



As you can see, the GameSearchRepository takes into the constructor the store, which is simply a dictionary object. It saves info about matches and games, and when trying to find games, will run the spec through all the known games and return the results. The repository needs to save the match data so that when the game data comes, it can combine the two together. The GameData object looks like this:

Game Data

Really, there's not much to say here. The game search consists of finding all the game data that return True from matches_spec, meaning that all the info in the spec correlates to the data it has on the game. As O(n), this has some performance potential, but with an entire season's worth of games in memory I haven't noticed a visible slowdown enough to start messing with it now.

Sunday, July 12, 2009

SC Engine: Part 6 - Messaging Middleware

- Introduction
- Part One: System Overview
- Part Two: System Overview: Messages and Applications
- Part Three: Screen Scraping
- Part Four: YouTube Parsing
- Part Five: Linking the Video to the Game
- Part Six: Messaging Middleware
- Part Seven: The Console
- Part Eight: The Site

In the previous posts, I've talked about the various applications that deal with the domain logic, and how they communicate through message passing. They are designed to run on top of a messaging bus. This post will talk about the implementation of the message bus software on top of RabbitMQ using amqplib.

First, the ApplicationmessageBus....

Application Message Bus

One ApplicationMessageBus object is created for each "application" (such as YouTubeFetch, or ScheduleFetch). To get an idea of where this is used in everything, here's some code that could potentially be written:


import my_app

name = "my_app"
root_config, config = {}
subscriptions = my_app.make_my_app(name, root_config, config) # Run the apps builder function

bus = ApplicationMessageBus(name, subscriptions)


Once again, the ApplicationMessageBus is similar to applications in that it's not tied down to any specific messaging library. It's purpose is to receive a message, and knowing the subscriptions for the app, pass the message to the correct handler. It also takes the returned messages and properly creates a list of outgoing messages, which it returns to it's caller. This is to make it a bit easier for the caller, since it'll always expect a list.

A few other notes about this object. First, in addition to a handle method, there is an idle method. This gets called every once and awhile and allows me to write apps that, in addition to having handlers for messages, can also have handlers for elapsed time. This means I can write an app like this...


def make_my_app(name, root_config, config):
app = MyApp()

return {
msgs.AMessage : app.handle_a_message,
timedelta(minutes=5) : app.run_every_five_minutes,
}

class MyApp(object):
def handle_a_message(self, msg):
...

def run_every_five_minutes(self):
...


Also, let's take a closer look at the part that actually calls the applications callback:


def _run_callback(self, callback, msg = None):
log_msg = "Running callback on %s for message %s" % (callback, msg)
self.logger.info( log_msg )

root_logger = logging.getLogger()
logging_handler = CollectionLogHandler()
root_logger.addHandler(logging_handler)
try:
if msg:
results = callback(msg)
else:
results = callback()

except Exception as e:
exceptionTraceback = sys.exc_info()[2]
tb = traceback.extract_tb(exceptionTraceback)
log = logging_handler.log

results = AppErrorOccurredMessage(self.app_name, msg, unicode(e), tb, log)
finally:
root_logger.removeHandler(logging_handler)

log_msg = "Results of callback %s: %s" % (callback, results)
self.logger.info( log_msg )

return results


First off, before the application's callback, it adds a handler to the root logger. This will catch all log entries that happen inside the application's handler when it's run. Next, the results are called in a try block. If the application doesn't throw an error, the function just returns the results after cleaning up the log interceptor. If it does throw an error, all the logs that were being intercepted, along with the error and traceback information are collected and placed into an AppErrorOccurredMessage. This message is just like any other message, and as such can be returned and expected to be sent out over the wire.

In my closing words about the ApplicationMessageBus, you'll notice that it can also respond to HeartbeatMessage messages. I'm still debating how exactly I want these to work, as I've discovered some problems with the way that I've currently implemented them. Basically, the idea is that I should be able to send out hearbeats, and the applications should send responses, so that I can get an idea of the status of applications. The problem is with applications that are dutifully reacting to messages, but is a bit backlogged, who won't respond to a heartbeat for awhile because the heartbeat message is pretty far down in it's fifo queue. In the mean time, I'll think that the application is somehow down. The original goal of the heartbeat was to determine what applications are up RIGHT NOW, so this doesn't really work too well. However, it may prove useful for determining loads on an application. Anyway, like I said, it's still a work in progress.

Speaking of works in progress, I've arrived at the fun part, the MainMessageBus...

Main Message Bus

This is an object that is potential main-method material. Anyway, here's how you might use it...


import app1
import app2

transport = ...
main = MainMessageBus('main', transport)

root_config = ...
app1_config = ...
app2_config = ...

# Build app1
subscriptions = app1.make_app1('app1', root_config, app1_config)
amb = ApplicationMessageBus('app1', subscriptions)
main.add_app('app1', amb, subscriptions)

# Build app2
subscriptions = app1.make_app2('app2', root_config, app2_config)
amb = ApplicationMessageBus('app2', subscriptions)
main.add_app('app2', amb, subscriptions)

try:
main.run()
finally:
main.close()



ITransport looks like this...



The interesting part is the retrieve method, which doesn't just return a message, but instead returns a context manager for the message. This is so that we can easily write code to handle the message, and correctly set up a way for the transport to finish anything it needs to finish once we're sure that the message handling is done. The is shown in the MainMessageBus...


with self.transport.retrieve(name) as msg:
if not msg:
continue

data = (msg.__class__.__name__, name, pformat(msg.to_dict(), indent=5))
log_msg = 'Handling %s in %s\n%s' % data
self.messages_logger.info(log_msg)

results = app_bus.handle(msg)
handled_count += 1
outgoing_messages = results

map(self.send, outgoing_messages)


If for any reason the system were to go down or an exception were to occur during this context, the context manager would not exit in a way to tell the transport to mark the message as done. This is important to make sure that the message will still be there when the system starts up again to be rerun. You'll see how this works for the AMQPTransport...

AMQPTransport

First off, you may notice that I'm constantly opening and closing channels. This is because I've experienced problems with sending messages that had no exchanges set up for them. An exchange is set up when the transport is told to set it up, and this happens when the main message bus sees that an application has subscribed to it. It is possible, however, that an application sends out a message that no one has subscribed to (I just haven't written the application to deal with that message yet). Because of the asynchronous nature of amqp, unless I closed the channel, I might not get the error message until the next time I tried to use the channel. So, instead I've decided to be using a fresh channel in each method. The closing of the channel will force the error message from amqp to be raises in the same method as the code that caused the problem (in my "sending a message that no one subscribed to" problem, this would happen in "send") where I can then trap and handle it.

You see in the retrieve method the context managers being created. What's nice about this ITransport is that it came about as the refactoring point when I wanted to switch from my contained-in-memory prototype transport to AMQP. I still have the "LocalTransport" hanging around:

Local Transport

SC Engine: Part 4 - YouTube Parsing

- Introduction
- Part One: System Overview
- Part Two: System Overview: Messages and Applications
- Part Three: Screen Scraping
- Part Four: YouTube Parsing
- Part Five: Linking the Video to the Game
- Part Six: Messaging Middleware
- Part Seven: The Console
- Part Eight: The Site

I wasn't the first person to try to do a project like this. There have been others...

Starcraft Gaming, sort of a blog-style list of games that probably uses the youtube api for specific youtube accounts.
lart.no/sc, most likely does the same. They seem to have the ability to determine what league that the videos are from. I hear the person who wrote this did it in four hours just for themself.
sc.rpls.info, that is also just a list of videos. They seem to be able to sometimes get the races of the games being played.
SC2GG Vod Tracker, hosted at sc2gg, a site whose community is largely based on english commentaries of pro games. Has a (sometimes buggy) ability to determine the players and maps, and like the some of the other sites support some semblance of a search engine and voting on quality of games. They also have a nice "recently best voted" section, and seem to be one of the few automated solutions that can link together a multi-part commentaries (one game spread over multiple YouTube videos that are meant to be watched in order).

These solutions don't seem to go to the scope of what I was planning, but even with what they were doing, you can see that there were problems getting information from the youtube videos.

First off, they only will look at videos in known youtube accounts. That means they could miss videos by players not on their list.

Secondly, the automated solutions all showed the inability to correctly identify much of the game using just the title and description of the video. To be honest, this is probably an impossible task to do, as most videos don't provide all the details necessary. Many would put the date played, player names, team names (if in a league with teams). The automated solutions that did try to tease out some of that data seemed to have a list of player, team or map names that they would search the title and description for. This resulted in situations where a player named "Great" would show up for a lot of games if the word "great", used in any context, happened to be in the title or description.

To be sure, there is no standard format for a title and description of Starcraft videos for all uploaders to use. Even a single commentator might use a different format in some of their own videos.

Luckily, I didn't need to parse all the data directly from the videos. So long as I had a database of the games played, I needed only to parse just enough of it to be able to link it to the correct game. Then I could use all the data from the schedule database.

So, after viewing a bunch of different "formats" used by the commentators, I came to a few realizations:


  • Trying to find data in the description is dangerous, because there could be a ton of stuff in the description that doesn't have anything to do with the game.

  • Finding a date that the game was played goes a LONG way to make finding the game in the database easier. On the most busy days in Starcraft, you can see around 15 total games played, although more likely to be in the 3-8 range. If I can find the date, I might only need one other piece of information (a player name, a game number) to make a link. As such, I wanted to concentrate heavily at making sure that if there's a date in there, I can find it.

  • Using the published date of the video to try to determine the date played of the match is probably a dead end. Although in most cases, videos are posted within a few days of the match, some videos are posted for epic games that happened years ago.

  • The video's title typically had data that was more likely to be useful than the video's description, even if there was less characters there. This is kind of obvious, because for a commentator who wants their audience to easily find their video, it would behoove them to put that necessary information there.

  • Although every commentator had their own "formats", most of them put something along the lines of X vs Y, or X v Y, or X versus Y in the title. X and Y could be teams, players, or both.



Knowing this, I set out to make my parser. First off, parsing dates. The parser currently can parse dates in any of these formats, with the ability to add more formats available...


date_formats = [
make_format('%(day)s/%(month)s/%(year)s'),
make_format('%(month)s/%(day)s/%(year)s'),
make_format('%(year)s-%(month)s-%(day)s'),
make_format('%(month)s-%(day)s-%(year)s'),
make_format('%(day)s %(month)s %(year)s'),
make_format('%(month)s %(day)s %(year)s'),
make_format('%(day)s %(month)s, %(year)s'),
make_format('%(month)s %(day)s, %(year)s'),
make_format('%(day)s %(month)s , %(year)s'),
make_format('%(month)s %(day)s , %(year)s'),
make_format('%(day)s, %(month)s, %(year)s'),
]


Where "day" can be any of 1, 01, or 1st, "month" can be any of 1, 01, Jan or January, and "year" can be any of 09 or 2009. The algorithm will look in the title and description, but if it finds results in the title, it uses that. In fact, as of now, the parsing algorithm will not look for any data other than dates in the description, and rely on the title.

Next is "versus", which is so commonplace I basically rely on it to determine the participants in the match, rather than try to do a full text search for all known players and teams. The regex looks like this:


([a-z0-9\[\]\.\-_\)]+)\s+(?:v|v\.|vs|vs\.|versus)\s+([a-z0-9\[\]\.\-_\)]+)


As you can see, I can support...


  • X v Y

  • X v. Y

  • X vs Y

  • X vs. Y

  • X versus Y



...with or without spaces between the versus phrase and the participants. One problem with this method is that if there is a space in the name (typical for teams) you lose some of the data on the team name. However, just one word of the name is often enough to identify a team.

Much of the rest is just regular expressions as well, including the need to look for the videos part number for multi-part videos. Typically, if a game takes longer than 10 minutes (the approximate limit for most youtube videos), the uploader splits it up into two or more videos, and puts some way of saying that this is part x of the entire series...


  • Part 1

  • Part 1 of 2

  • (1/2)

  • P1 of 2

  • P1



Testing this parser typically use actual titles and descriptions that I've found online...


def test_participants_from_title(self):
r1 = test_data(title="Bacchus OSL Ro36: Gogo v Luxury Set 1")
self.assertTrue('gogo' in r1['participants'])
self.assertTrue('luxury' in r1['participants'])

r2 = test_data(title="Bacchus OSL Ro36: Gogo vs Luxury Set 1")
self.assertTrue('gogo' in r2['participants'])
self.assertTrue('luxury' in r2['participants'])

r3 = test_data(title="Bacchus OSL Ro36: Gogo v. Luxury Set 1")
self.assertTrue('gogo' in r3['participants'])
self.assertTrue('luxury' in r3['participants'])

r4 = test_data(title="Bacchus OSL Ro36: Gogo vs. Luxury Set 1")
self.assertTrue('gogo' in r4['participants'])
self.assertTrue('luxury' in r4['participants'])

r5 = test_data(title="Bacchus OSL Ro36: Gogo versus Luxury Set 1")
self.assertTrue('gogo' in r5['participants'])
self.assertTrue('luxury' in r5['participants'])

r6 = test_data(title='MBC v Woonjin: Light v Zero (P2/2)[Single] 5/17/09')
self.assertTrue('mbc' in r6['participants'])
self.assertTrue('woonjin' in r6['participants'])
self.assertTrue('light' in r6['participants'])
self.assertTrue('zero' in r6['participants'])

r7 = test_data(title='InteR.Mind vs Siz)KaL [15 Apil, 2009] 1set')
self.assertTrue('inter.mind' in r7['participants'])
self.assertTrue('siz)kal' in r7['participants'])

r8 = test_data(title='type-b vs Saint[z-zone]')
self.assertTrue('type-b' in r8['participants'])
self.assertTrue('saint[z-zone]' in r8['participants'])



Right now, it is not able to distinguish teams from players. Just having the participant list is fine though, as determining what participants are teams or players will happen somewhere else. No need to have to include a ton of data to make the youtube parser any more complex than it needs to be.

So, in the grand scheme of things, the YouTube parser gets in a message saying that a YouTubeVideo was found, which includes the video id, title and description, and tries to get as much info as possible from it. It then sounds out a message with the info it found for someone else to chew on for awhile.

Saturday, July 11, 2009

SC Engine: Part 3 - Screen Scraping

- Introduction
- Part One: System Overview
- Part Two: System Overview: Messages and Applications
- Part Three: Screen Scraping
- Part Four: YouTube Parsing
- Part Five: Linking the Video to the Game
- Part Six: Messaging Middleware
- Part Seven: The Console
- Part Eight: The Site

As I said in my last few posts, one of the things that needed to be done for this project was to gather the data on all schedules. I decided the best way to go about doing this was to scrape info from the KeSPA website. Here is a sample schedule page:

http://www.e-sports.or.kr/teams/player1.kea?m_code=team_24&pGame=1&pCode=1248



In case I haven't said it before: No, I do not speak Korean. Luckily, I didn't really need to. Some stuff could be determined from this page using only intuition (and the google translate add-on :)

In the page's html, the players and maps are anchor elements that links to a page with more info on the player and map, respectively. That url has a unique id for the players and maps, so I decided to use these id through my engine. The league name and teams are just plain characters, so I grab when I'm scraping.

After fetching the page using urllib2, I cut out some of the cruft of the page, and load the rest into BeautifulSoup. Testing these objects is done by saving actual examples I get from the site to a file, and using those files as the test. When I find a new style, I save that data to another file and write a test for it. Here's an example...


class ProleagueMatchScraperTests(BaseScraperTestCase):
"""
Original =
http://www.e-sports.or.kr/schedule/daily01_sche.kea?m_code=sche_12&gDate=20090603&gDvs=T&miniCal=2009-06-01
"""
TEST_FILE = "proleague/match.html"
SCRAPER = ProleagueMatchScraper

def test_teams(self):
self.assertEquals(self.results['team_one'], u'KTF')
self.assertEquals(self.results['team_two'], u'MBC게임')
self.assertEquals(self.results['winner'], None)
self.assertEquals(self.results['winner_score'], None)
self.assertEquals(self.results['loser_score'], None)

def test_game(self):
self.assertEquals(len(self.results['games']), 5)

game = self.results['games'][0]
self.assertEquals(game['player_one'], 988)
self.assertEquals(game['player_two'], 851)
self.assertEquals(game['map'], 1193)
self.assertEquals(game['winner'], None)

def test_ace_match(self):
game = self.results['games'][4]

self.assertEquals(game['player_one'], None)
self.assertEquals(game['player_two'], None)
self.assertEquals(game['map'], 1207)
self.assertEquals(game['winner'], None)

def test_stage_info(self):
self.assertEquals(self.results['stage_path'], ['Week 1', 'Day 5'])



Each test class has a TEST_FILE and SCRAPER attribute that are used by the BaseScraperTestCase to run the entire scrape in setUp. The TEST_FILE is the filename that has the html I pulled from the web site for that test, where the scraper is the object that will actually do the scraping. Thus, I can add new tests very easily.

In addition to scraping schedules, I also need to scrape the player page for the players to find out stuff like their names and races. I'll use that as an example for what the actual scraper object looks like, because it's a bit simpler than the schedule scraper. The page looks as follows:

http://www.e-sports.or.kr/teams/player1.kea?m_code=team_24&pGame=1&pCode=1248



Yeah, the guys name is "Great". :P




class PlayerScraper(object):
NAME_PATH = [0, 1, 3, 1, 3, 1, 0, 15, 1, 0, 9, 7]
RACE_PATH = [0, 1, 3, 1, 3, 1, 0, 15, 1, 0, 7, 7]

def __init__(self, soup):
self.soup = soup

def scrape(self):
# If the name is not there, then we have a blank page, so it's not a
# legit player.
pre_name_elem = utils.dive_into_soup(self.soup, self.NAME_PATH)
if len(pre_name_elem.contents) == 0:
return None

name = unicode(pre_name_elem.contents[0]).strip()

pre_race_elem = utils.dive_into_soup(self.soup, self.RACE_PATH)
if len(pre_race_elem.contents) == 0:
race = None
else:
race = unicode(pre_race_elem.contents[0]).lower()

return {
'name' : name,
'race' : race,
'aliases' : []
}


The "soup" constructor argument is the BeautifulSoup object. I've found the easiest way to get at the data I want is to construct a "path" to the html element. The utils.dive_into_soup function looks like this:


def dive_into_soup(soup, content_navigation):
s = soup

for index, content in enumerate(content_navigation):
try:
s = s.contents[content]
except IndexError as e:
raise DiveError(e, content_navigation, index)

return s



So, basically, if the "path" is [0,3,2,4], then starting from the root element, I look at the 0th child element, then at that elements 3rd child element, etc., until I hit the bottom, then return the element I'm at. To make the creation of these "dive codes" easier, I've written a hacky little function to create them for me based on text that I specify.

Honestly, the idea of a "dive path" is kind of a hack. I'd much rather have the page, and be able to just say, "give me the element next to #player_name". Unfortunately, the entire kespa site uses table layouts, and all the ids and classes are pretty much for layout purposes as well, so this 'dive' approach seems to be a better option.

SC Engine: Part 1 - System Overview

- Introduction
- Part One: System Overview
- Part Two: System Overview: Messages and Applications
- Part Three: Screen Scraping
- Part Four: YouTube Parsing
- Part Five: Linking the Video to the Game
- Part Six: Messaging Middleware
- Part Seven: The Console
- Part Eight: The Site

This first part will give you an overview of the architecture of SC Engine.

First off, for those interested, a snapshot of the code is available online on google code. The point is to allow people to browse through it, as I have no plans of updating that repository regularly (if at all). Also, please note that this is by no means a finished product, and probably has plenty of bugs...

Google Code Repository

Although the main goal is to get youtube videos, sorting and organizing them by game means I need a database of the games. This database I found in the website of the Korean e-Sports Player Association, or KeSPA. It has data on games played, as well as player and map info. While this site is being scraped for data, YouTube is being searched, and if a video can be found that looks like it's talking about a game that we know about, a "link" is made between the two. Eventually, all data that is needed by the final web site is eventually uploaded there as the data comes rolling in.

There's a lot that can be going on, and I felt this app was a nice time to try out using a Message Bus. The idea is to, rather than writing a monolithic application where one part of the app directly makes a call to the next, you write a bunch of small apps that do their simple job and send a message out stating the results. Any application can listen for these messages and do their own computation, also sending out results. The small applications don't know about each other, only about the messages themselves.

So, using a message bus, a workflow for fetching and storing the schedule might look like this:


  • Every once in awhile, a ScheduleFetchRequested message is sent out.

  • The schedule fetcher application is listening for the ScheduleFetchRequested message, and upon receiving it, goes out and fetches the schedule for the date detailed in the message. This involves running code to go to the website and do some scraping. It then sends out a ScheduleFetchAnnouncement message with the results of what it's found. Some stuff it can determine immediately (player ids, map ids), where others it might not be able to (team names, league and stage that the game is played at ).

  • Another application gets this ScheduleFetchAnnouncement, and will do lookups on the information we couldn't find directly from the web page (such as the team ids or league name). It gathers the results and spits out a ScheduleParseAnnouncement.

  • Another application that actually stores the schedules onto disk receives the ScheduleParseAnnouncement message, and goes ahead and saves them. It doesn't need to send out any message.



So why three apps? Why not have all this done in one app? Or the fetch and parsing in one app, and the storing in another?


  1. Although debatable as to it's merits, the applications would then be more complex if we shrunk them down to two or one. The trade-off is that with three apps, now your entire system is more complex.

  2. More importantly, by separating the steps by messages, you can later make other applications that can get at the data in the middle of the flow. For example, we can create an UnknownData application that also listens for the ScheduleParseAnnouncement messages, and if it sees any players or maps that we've never heard of before, can send out appropriate messages to try to fetch information on those players and maps. Adding this new functionality can be done without touching existing components.



Of course, you need to go with your own discretion over how much you separate your apps. There really can't be a hard rule, just experience and preference to help here.

Now, let's look at what happens regarding youtube videos:


  • Every once in awhile, a YouTubeFetchRequested message is sent out

  • An application hears this message and runs a query on youtube. Each resulting video's id, title, description, and author are sent in a YouTubeFetchAnnouncement message.

  • Another application is listening for these YouTubeFetchAnnouncement messages. For each one, they'll run the title and description through a parser to try to grab as much information as possible. Possible info might be team names, player names, game number (game X of a set of 5 games), date played, etc. Sometimes, the parse might not find enough data to even assume that it's actually a starcraft commentary. In that case, the application doesn't send out a message about it. However, let's assume it gets at least SOME data from the parse. The data it does find it will put into a message, YouTubeParseAnnouncement.

  • The YouTubeParseAnnouncement message is sent to the application we talked about earlier that stored the scheduling data. That application will run the youtube information against the schedules stored and look for possible links. If it can find one, it sends out a VideoGameLinkAnnouncement. If it can't find any results, or it could be one of multiple results, the app will instead send out a YouTubePartalParseAnnouncement.



From here, the VideoGameLinkAnnouncement messages eventually go to the web site, whereas the "partial parses", as I call them, are collected into another application to be sorted through manually by me to see what's up. By manually going through these results, I can come up with changes to the youtube parsing code, or add data to help the search engine in future tries. I then have the option to retry the parse again, or just manually send out the VideoGameLinkAnnouncement myself with the correct data.

There are other apps and messages that I haven't spoken of, such as fetching and updating data on players and maps, the applications that trigger periodic fetches, and what happens to all those "partial parse" messages. But I think this gives a good idea of how things would tend to work.

So, that's the overview of the system. For the most part, it's a bunch of small apps that share data through passing messages. There is no real central database, as each app is responsible for saving the data in the way they feel suitable.

It's a completely different style than the typical system where all the data is in a central database and all parts of the app work on the same data. I've read about this style of programming on blogs like Udi Dahan and Ayende. It's my first time using such a system, and I'm interested in learning about it's strengths/pitfalls.

SC Engine: Part 2 - System Overview: Messages and Applications

- Introduction
- Part One: System Overview
- Part Two: System Overview: Messages and Applications
- Part Three: Screen Scraping
- Part Four: YouTube Parsing
- Part Five: Linking the Video to the Game
- Part Six: Messaging Middleware
- Part Seven: The Console
- Part Eight: The Site

This post will go over some of the general implementation for SC Engine.

The system is written in python, and as stated in Part 1, uses a Message Bus to pass messages with information between "applications" inside the engine.

Messages



A message is a simple entity meant to store data that is being sent from one part of the system to another. Here is what a message might look like...


class PlayerFetchRequested(BaseMessage):
def __init__(self, player_id):
# int
self.player_id = player_id


For those interetested, you can see all the messages in the source:

As you can see, there's not much meat here. It's just a class that inherits from BaseMessage, with a constructor that has the data. It could've just been implemented as a tuple with a string for the message name and a dictionary for the data, but it was nice to be able to have an error thrown if I don't include all the arguments. BaseMessage also provides a few methods onto the messages that are helpful, and is used by the messaging middleware I wrote (which will be discussed in another part).

See all the messages for SC Engine

Applications


Messages are sent between applications. An "application" is a python object that defines a number of methods which respond to incoming messages and optionally sending out resulting messages. The point is that each application does one job, and sends out messages of the results of what it's done. Where the messages go or who use them is completely not of the concern of the application. Nor does it care where the messages it receives have come from. In this sense, the entire system is implicitly "pub/sub", or "publish and subscribe".

Typically, an application consists of two parts: a builder, and the application itself. Let's look at an application first:

Player Fetch App

This application takes into its constructor a callable (in the end, this callable turns out to be a function that launches the actual fetching and scraping of a web page containing the player data). It is using constructor dependency injection so that I can replace lookup with a mock and easily test the app.

There is one method here that handles incoming messages, and that is player_fetch_requested. This message is called when the application receives a PlayerFetchMessage. An application can support handling of more than just one type of message, just add more methods for each handling operation. The fact that the name of the function is similar to the message name is just a convention, and doesn't have any significance other than to to be self-documenting.

The method takes in one argument; the message to handle. A method that handles messages like this can then return from the function a message or list of messages of it's own, which is how it sends out messages. The point is that this application is run on top of some messaging middleware that passes the message to the correct function, gets the results of the function (which are always messages) and then sends out the resulting messages.

In this case, the lookup is done, and depending on the results sends out a resulting message. In this case, either a messages with the info on the player, or a message saying that the player doesn't exist.

Now, the application itself is useless, since there's no way of telling the middleware that will be using it what messages to map to what methods. I experimented with naming conventions on the methods and decorators, but eventually settled on a simpler solution: a builder function that does this mapping, as well as giving the application object an dependencies...


def make_app(app_name, root_config, config):
lookup = get_player_info
app = FetchPlayerApp(lookup)

return {
msgs.PlayerFetchRequested: app.player_fetch_requested,
}



Originally, this method was parameterless, but over time has grown to three parameters. app_name is the name as determined by the message bus for the application. The application shouldn't hard-code this or figure it out by itself because it might have parts prepended or appended to it by other parts of the system. The main reason an application would need this is to use it as the name of the logger or data filenames.

The root_config and config parameters are dictionaries containing items that might be run-time options (typically stored in ini files and parsed by a lower part of the engine). root_config typically contains configuration information for all application (such as the directory to store any data files), while config stores information specific to that application (an example might be the exact url to use when fetching, although I've decided to hardcode this right now).

The application builder uses the arguments to create the application object. It then returns a dictionary that maps the message type to the method on the app that should handle that message. Typically, this method is simple enough that it doesn't need testing.

Notice that the application is a POPO (Plain Ole' Python Object). It does not have any dependencies on messaging systems on it. It just takes in the messages, and returns resulting messages. The actual job of sending and receiving those messages on any sort of message bus is up to the object which calls the applications methods.

This allowed me to easily set up prototypes in the early stages by using a hacked together message bus that would just repeatedly send a message to an apps handler, get the resulting messages, then send those out as well. In fact, I pretty much used this system through most of development, not setting up RabbitMQ or anything like that until much later.

Also, because it's a POPO, this application is extremely simple to test. Here is what the test looks like:


class TestPlayerFetchApp(AppTestBase):
def setUp(self):
self.lookup = Mock()
self.sut = PlayerFetchApp(self.lookup)

def test_fetches_player(self):
self.lookup.return_value = {'name' : 'test', 'race' : 'zerg'}

input = msgs.PlayerFetchRequested(1)
expected = msgs.PlayerFetchAnnouncement(1, 'test', 'zerg')

result = self.sut.player_fetch_requested(input)

self.assertContainsMsg(result, expected)

def test_fetches_non_existant_player(self):
self.lookup.return_value = None

input = msgs.PlayerFetchRequested(1)
expected = msgs.PlayerFetchNonExistantPlayer(1)

result = self.sut.player_fetch_requested(input)

self.assertContainsMsg(result, expected)


assertContainsMsg is a helper function that covers all the cases of an application returning a message (returning the message itself, or returning a list of messages with that message in the list). Also, the messages are easy to spot because the BaseMessage object implements value equality. This means...


>>> a = PlayerFetchAnnouncement(1, 'test', 'zerg')
>>> b = PlayerFetchAnnouncement(1, 'test', 'zerg')
>>> c = PlayerFetchAnnouncement(2, 'test2', 'zerg')
>>> a == b
True
>>> a == c
False


Even though object a and b are different objects, BaseMessage overrides the equality operators to ensure that messages with the same data are equal. The main point is to make testing much easier: just make the message you're expecting, rather than having to check all the individual attributes yourself.

That's a pretty simple overview of what an individual application might look like. Typically, your application is either a simple application in itself, just storing and sending data, or is a front-end to more advanced functionality such as this fetch application. It's easy to test, and while is built with messaging in mind has no dependencies on any messaging framework.

Friday, July 10, 2009

Starcraft Professional VOD Search Engine: Introduction

This is an introduction to a series of posts designed to give you a tour of the project I've been working on for the past month or so, "SC Engine".

The entire series is linked below:

- Introduction
- Part One: System Overview
- Part Two: System Overview: Messages and Applications
- Part Three: Screen Scraping
- Part Four: YouTube Parsing
- Part Five: Linking the Video to the Game
- Part Six: Messaging Middleware
- Part Seven: The Console
- Part Eight: The Site


SC Engine is a series of programs and tools that indexes Professional Starcraft videos found on the net into one easy-to-use web site. This first post gives an idea of what the project is expected to do, with the rest of the posts going into the implementation.

For those not aware, Starcraft is a PC game that was created by Blizzard and originally released in 1998. It has withstood the test of time, and in South Korea even has a following that is as popular as Soccer or Pro Wrestling here in the US, if not greater. Starcraft in South Korea comes complete with their own professional leagues. E-Sports as a whole in South Korea has really taken off, gaining great sponsors and recognition. Starcraft is the game at the forefront of this new form of entertainment, drawing the largest crowds and best ratings.

Games are often cast on TV with commentators just like you'd see in any televised sporting event. Games of the more popular leagues are often uploaded to sites like YouTube or the league's own website for viewing, and often referred to as "VODS" ("Videos On Demand").

Some fans, wishing to view the games with commentators speaking their own language, began dubbing their own commentary over the Korean ones. Here's an example of a game in it's original version, and here's an example with an English commentator, Klazart.

SC Engine started coming together for a number of different reasons:


  1. I was out of work since January (although took some time until April just to relax, living off of savings) and unsure of how well I could express my ability as a python/web developer (much of what I did at my work was in .NET, and even later python stuff I wrote before being laid off never saw it's way into production). Thus, I was looking for a project to show to the world in code what I probably could not easily express in words.


  2. I wanted to work on SOMETHING. My job had me starting to burn out from coding, working at some pretty tedious objectives. After my break, I wanted something new that was interesting, not the same old "django blog" application. Asynchronous messaging via AMQP seemed very interesting.

  3. I enjoyed watching Professional Starcraft VODs, especially with the English commentators, but found it a pain trying to find games, then keep track of how all the games I was watching related to each-other. A person who only watched videos from one or two commentators could only watch a portion of the games, leaving them with gaps in the knowledge of how the league is progressing. Since all uploaders and commentators do this in their spare time, they can only cover so many videos.

  4. Watching games on Youtube, there was a chance of "spoiling" the results of the game. Games can take anywhere from a few minutes to an hour or more to play. Knowing beforehand the length of the game gives clues as to what will happen in the game as you're watching it. Also, because many games are in a "best of x" series, a commentator might upload only the games played, leaving a viewer to be able to easily determine the winner of a match through logic.

    Many uploaders will upload "anti-spoiler" videos, that are just small videos that seem like the next game is played, but when you view it just contains a quick message that the match was already decided. I feel like this isn't exactly a great solution: it's more work for the uploaders, and not all uploaders did it anyway.




The result was to create a website that tried to solve these problems.

In it's first interval, the home page would show you all of the leagues, and you would select matches in the league to watch. Then, you could watch each game in that match. Because the entire navigation of the page view games as a first-class citizen (as opposed to watching on YouTube, where the videos are a first-class citizen) you would have a visual guide as to where this game stands in terms of the league (knowing a game took place just before the end of the season, or in the playoffs, or what part of the playoffs, etc.). When actually choosing the game, you would get to pick which uploader's version of the game you want to watch. This allowed you to view more games than you might otherwise have had you just watched the videos of one or a few commentators. Also, since the Starcraft community (a few memebers, specifically) is pretty amazing at their ability to upload most every game played in large events, even if you can't find a commentary in your language you'll probably still have the option to view the original game, albeit with Korean commentary.

Also, I felt it'd be nice to see some other information on the game's page, such as:


  • The names of the players in the game.

  • The race that that player plays the game with. In Starcraft, there are one of three possible races to select, each with their own unique traits.

  • The kespa site occasionally puts the "starting position" of the players on their site, which let's you know where the player will be when you are watching the game.

  • An image of the "map". Each game takes place on a different map, each with different layouts and base locations to add variety to the game. Being able to see a picture of the entire map as you watch a game helps immensely with understanding how the game is developing, especially if you're not familiar the particular map.



Right now, work is complete on the first iteration. The site still needs a complete redesign (I just hacked some css) and will probably change a ton. You can see what it looks like here:

http://sc.markhildreth.webfactional.com/

There isn't much in terms of data, as I'm still working on the hardware that the engine will run on, but it gives you an idea of the type of things that you might be able to expect.

In any case, that's the background on the project. My remaining posts will dive into the aspects of software implementation.