A little while ago, I released an extension that adds a nifty feature to Twitch (Twitch is the video streaming platform for games that was acquired by Amazon last year for $970 million): ReChat for Twitch.
Twitch does not only allow users to see live video streams but also records them so that users can watch them later. The one thing you miss out on when you watch a recorded stream is the chat. This is where ReChat comes into play: ReChat allows you to see the recorded chat messages as if it was a live stream.
Indexing chat messages
Since Twitch does either not record chat messages or at least not make them available through their API, I was forced to build my own indexing system.
Luckily, Twitch’s custom built chat server does not only come with an HTTP front-end, but is also reachable via IRC. IRC, in contrast to the proprietary HTTP chat interface of Twitch, is a well documented and quite simple standard (RFC 1459) and was therefore the natural choice to connect to Twitch chat.
For the storage backend, I chose to go with Elasticsearch. Not only does Elasticsearch a good job at indexing hundreds of documents per second, it also is easily scalable and blazingly fast at finding documents. Every chat message is represented by a JSON document consisting of the actual message, the sender, the chat room, and a timestamp.
All of the browser extension’s source code is available on GitHub.
Based on the unique Twitch video ID, the web application fetches video time information via Twitch API, searches for the matching chat messages in the Elasticsearch index and serves them in chunks with pagination.
Elasticsearch has some nifty aggregation features to analyze the data available. Maybe some sort of statistics page would be a nice addition? It could for example feature a visual representation of the most frequent words (some common english words filtered out):