Saturday, July 11, 2009

SC Engine: Part 1 - System Overview

- Introduction
- Part One: System Overview
- Part Two: System Overview: Messages and Applications
- Part Three: Screen Scraping
- Part Four: YouTube Parsing
- Part Five: Linking the Video to the Game
- Part Six: Messaging Middleware
- Part Seven: The Console
- Part Eight: The Site

This first part will give you an overview of the architecture of SC Engine.

First off, for those interested, a snapshot of the code is available online on google code. The point is to allow people to browse through it, as I have no plans of updating that repository regularly (if at all). Also, please note that this is by no means a finished product, and probably has plenty of bugs...

Google Code Repository

Although the main goal is to get youtube videos, sorting and organizing them by game means I need a database of the games. This database I found in the website of the Korean e-Sports Player Association, or KeSPA. It has data on games played, as well as player and map info. While this site is being scraped for data, YouTube is being searched, and if a video can be found that looks like it's talking about a game that we know about, a "link" is made between the two. Eventually, all data that is needed by the final web site is eventually uploaded there as the data comes rolling in.

There's a lot that can be going on, and I felt this app was a nice time to try out using a Message Bus. The idea is to, rather than writing a monolithic application where one part of the app directly makes a call to the next, you write a bunch of small apps that do their simple job and send a message out stating the results. Any application can listen for these messages and do their own computation, also sending out results. The small applications don't know about each other, only about the messages themselves.

So, using a message bus, a workflow for fetching and storing the schedule might look like this:


  • Every once in awhile, a ScheduleFetchRequested message is sent out.

  • The schedule fetcher application is listening for the ScheduleFetchRequested message, and upon receiving it, goes out and fetches the schedule for the date detailed in the message. This involves running code to go to the website and do some scraping. It then sends out a ScheduleFetchAnnouncement message with the results of what it's found. Some stuff it can determine immediately (player ids, map ids), where others it might not be able to (team names, league and stage that the game is played at ).

  • Another application gets this ScheduleFetchAnnouncement, and will do lookups on the information we couldn't find directly from the web page (such as the team ids or league name). It gathers the results and spits out a ScheduleParseAnnouncement.

  • Another application that actually stores the schedules onto disk receives the ScheduleParseAnnouncement message, and goes ahead and saves them. It doesn't need to send out any message.



So why three apps? Why not have all this done in one app? Or the fetch and parsing in one app, and the storing in another?


  1. Although debatable as to it's merits, the applications would then be more complex if we shrunk them down to two or one. The trade-off is that with three apps, now your entire system is more complex.

  2. More importantly, by separating the steps by messages, you can later make other applications that can get at the data in the middle of the flow. For example, we can create an UnknownData application that also listens for the ScheduleParseAnnouncement messages, and if it sees any players or maps that we've never heard of before, can send out appropriate messages to try to fetch information on those players and maps. Adding this new functionality can be done without touching existing components.



Of course, you need to go with your own discretion over how much you separate your apps. There really can't be a hard rule, just experience and preference to help here.

Now, let's look at what happens regarding youtube videos:


  • Every once in awhile, a YouTubeFetchRequested message is sent out

  • An application hears this message and runs a query on youtube. Each resulting video's id, title, description, and author are sent in a YouTubeFetchAnnouncement message.

  • Another application is listening for these YouTubeFetchAnnouncement messages. For each one, they'll run the title and description through a parser to try to grab as much information as possible. Possible info might be team names, player names, game number (game X of a set of 5 games), date played, etc. Sometimes, the parse might not find enough data to even assume that it's actually a starcraft commentary. In that case, the application doesn't send out a message about it. However, let's assume it gets at least SOME data from the parse. The data it does find it will put into a message, YouTubeParseAnnouncement.

  • The YouTubeParseAnnouncement message is sent to the application we talked about earlier that stored the scheduling data. That application will run the youtube information against the schedules stored and look for possible links. If it can find one, it sends out a VideoGameLinkAnnouncement. If it can't find any results, or it could be one of multiple results, the app will instead send out a YouTubePartalParseAnnouncement.



From here, the VideoGameLinkAnnouncement messages eventually go to the web site, whereas the "partial parses", as I call them, are collected into another application to be sorted through manually by me to see what's up. By manually going through these results, I can come up with changes to the youtube parsing code, or add data to help the search engine in future tries. I then have the option to retry the parse again, or just manually send out the VideoGameLinkAnnouncement myself with the correct data.

There are other apps and messages that I haven't spoken of, such as fetching and updating data on players and maps, the applications that trigger periodic fetches, and what happens to all those "partial parse" messages. But I think this gives a good idea of how things would tend to work.

So, that's the overview of the system. For the most part, it's a bunch of small apps that share data through passing messages. There is no real central database, as each app is responsible for saving the data in the way they feel suitable.

It's a completely different style than the typical system where all the data is in a central database and all parts of the app work on the same data. I've read about this style of programming on blogs like Udi Dahan and Ayende. It's my first time using such a system, and I'm interested in learning about it's strengths/pitfalls.

No comments:

Post a Comment