Sunday, July 12, 2009

SC Engine: Part 4 - YouTube Parsing

- Introduction
- Part One: System Overview
- Part Two: System Overview: Messages and Applications
- Part Three: Screen Scraping
- Part Four: YouTube Parsing
- Part Five: Linking the Video to the Game
- Part Six: Messaging Middleware
- Part Seven: The Console
- Part Eight: The Site

I wasn't the first person to try to do a project like this. There have been others...

Starcraft Gaming, sort of a blog-style list of games that probably uses the youtube api for specific youtube accounts.
lart.no/sc, most likely does the same. They seem to have the ability to determine what league that the videos are from. I hear the person who wrote this did it in four hours just for themself.
sc.rpls.info, that is also just a list of videos. They seem to be able to sometimes get the races of the games being played.
SC2GG Vod Tracker, hosted at sc2gg, a site whose community is largely based on english commentaries of pro games. Has a (sometimes buggy) ability to determine the players and maps, and like the some of the other sites support some semblance of a search engine and voting on quality of games. They also have a nice "recently best voted" section, and seem to be one of the few automated solutions that can link together a multi-part commentaries (one game spread over multiple YouTube videos that are meant to be watched in order).

These solutions don't seem to go to the scope of what I was planning, but even with what they were doing, you can see that there were problems getting information from the youtube videos.

First off, they only will look at videos in known youtube accounts. That means they could miss videos by players not on their list.

Secondly, the automated solutions all showed the inability to correctly identify much of the game using just the title and description of the video. To be honest, this is probably an impossible task to do, as most videos don't provide all the details necessary. Many would put the date played, player names, team names (if in a league with teams). The automated solutions that did try to tease out some of that data seemed to have a list of player, team or map names that they would search the title and description for. This resulted in situations where a player named "Great" would show up for a lot of games if the word "great", used in any context, happened to be in the title or description.

To be sure, there is no standard format for a title and description of Starcraft videos for all uploaders to use. Even a single commentator might use a different format in some of their own videos.

Luckily, I didn't need to parse all the data directly from the videos. So long as I had a database of the games played, I needed only to parse just enough of it to be able to link it to the correct game. Then I could use all the data from the schedule database.

So, after viewing a bunch of different "formats" used by the commentators, I came to a few realizations:


  • Trying to find data in the description is dangerous, because there could be a ton of stuff in the description that doesn't have anything to do with the game.

  • Finding a date that the game was played goes a LONG way to make finding the game in the database easier. On the most busy days in Starcraft, you can see around 15 total games played, although more likely to be in the 3-8 range. If I can find the date, I might only need one other piece of information (a player name, a game number) to make a link. As such, I wanted to concentrate heavily at making sure that if there's a date in there, I can find it.

  • Using the published date of the video to try to determine the date played of the match is probably a dead end. Although in most cases, videos are posted within a few days of the match, some videos are posted for epic games that happened years ago.

  • The video's title typically had data that was more likely to be useful than the video's description, even if there was less characters there. This is kind of obvious, because for a commentator who wants their audience to easily find their video, it would behoove them to put that necessary information there.

  • Although every commentator had their own "formats", most of them put something along the lines of X vs Y, or X v Y, or X versus Y in the title. X and Y could be teams, players, or both.



Knowing this, I set out to make my parser. First off, parsing dates. The parser currently can parse dates in any of these formats, with the ability to add more formats available...


date_formats = [
make_format('%(day)s/%(month)s/%(year)s'),
make_format('%(month)s/%(day)s/%(year)s'),
make_format('%(year)s-%(month)s-%(day)s'),
make_format('%(month)s-%(day)s-%(year)s'),
make_format('%(day)s %(month)s %(year)s'),
make_format('%(month)s %(day)s %(year)s'),
make_format('%(day)s %(month)s, %(year)s'),
make_format('%(month)s %(day)s, %(year)s'),
make_format('%(day)s %(month)s , %(year)s'),
make_format('%(month)s %(day)s , %(year)s'),
make_format('%(day)s, %(month)s, %(year)s'),
]


Where "day" can be any of 1, 01, or 1st, "month" can be any of 1, 01, Jan or January, and "year" can be any of 09 or 2009. The algorithm will look in the title and description, but if it finds results in the title, it uses that. In fact, as of now, the parsing algorithm will not look for any data other than dates in the description, and rely on the title.

Next is "versus", which is so commonplace I basically rely on it to determine the participants in the match, rather than try to do a full text search for all known players and teams. The regex looks like this:


([a-z0-9\[\]\.\-_\)]+)\s+(?:v|v\.|vs|vs\.|versus)\s+([a-z0-9\[\]\.\-_\)]+)


As you can see, I can support...


  • X v Y

  • X v. Y

  • X vs Y

  • X vs. Y

  • X versus Y



...with or without spaces between the versus phrase and the participants. One problem with this method is that if there is a space in the name (typical for teams) you lose some of the data on the team name. However, just one word of the name is often enough to identify a team.

Much of the rest is just regular expressions as well, including the need to look for the videos part number for multi-part videos. Typically, if a game takes longer than 10 minutes (the approximate limit for most youtube videos), the uploader splits it up into two or more videos, and puts some way of saying that this is part x of the entire series...


  • Part 1

  • Part 1 of 2

  • (1/2)

  • P1 of 2

  • P1



Testing this parser typically use actual titles and descriptions that I've found online...


def test_participants_from_title(self):
r1 = test_data(title="Bacchus OSL Ro36: Gogo v Luxury Set 1")
self.assertTrue('gogo' in r1['participants'])
self.assertTrue('luxury' in r1['participants'])

r2 = test_data(title="Bacchus OSL Ro36: Gogo vs Luxury Set 1")
self.assertTrue('gogo' in r2['participants'])
self.assertTrue('luxury' in r2['participants'])

r3 = test_data(title="Bacchus OSL Ro36: Gogo v. Luxury Set 1")
self.assertTrue('gogo' in r3['participants'])
self.assertTrue('luxury' in r3['participants'])

r4 = test_data(title="Bacchus OSL Ro36: Gogo vs. Luxury Set 1")
self.assertTrue('gogo' in r4['participants'])
self.assertTrue('luxury' in r4['participants'])

r5 = test_data(title="Bacchus OSL Ro36: Gogo versus Luxury Set 1")
self.assertTrue('gogo' in r5['participants'])
self.assertTrue('luxury' in r5['participants'])

r6 = test_data(title='MBC v Woonjin: Light v Zero (P2/2)[Single] 5/17/09')
self.assertTrue('mbc' in r6['participants'])
self.assertTrue('woonjin' in r6['participants'])
self.assertTrue('light' in r6['participants'])
self.assertTrue('zero' in r6['participants'])

r7 = test_data(title='InteR.Mind vs Siz)KaL [15 Apil, 2009] 1set')
self.assertTrue('inter.mind' in r7['participants'])
self.assertTrue('siz)kal' in r7['participants'])

r8 = test_data(title='type-b vs Saint[z-zone]')
self.assertTrue('type-b' in r8['participants'])
self.assertTrue('saint[z-zone]' in r8['participants'])



Right now, it is not able to distinguish teams from players. Just having the participant list is fine though, as determining what participants are teams or players will happen somewhere else. No need to have to include a ton of data to make the youtube parser any more complex than it needs to be.

So, in the grand scheme of things, the YouTube parser gets in a message saying that a YouTubeVideo was found, which includes the video id, title and description, and tries to get as much info as possible from it. It then sounds out a message with the info it found for someone else to chew on for awhile.

No comments:

Post a Comment