Scrap Chess Games with R

2021-05-25 300 words 2 minutes

Wanting to stay abreast of recent developments in chess openings theory, I was looking for a database of the most recent high-level chess games. ChessBase GmbH offers the ‘Mega Database 2021’ but the price point is too high for me. The Week in Chess is a free weekly publication in two parts, a text and a games section with a round of the most important chess of the previous week. An archive of all editions is made available by the TWIC curator, Mark Crowthe.

Downloading, unzipping and combining the 400+ files (going back to 2012) by hand would take a long time. Fortunately R can automate the downloading while unzipping and merging can be done with a few command lines instructions.

Norway Chess 2020 - Firouzja vs. Carlsen

The R code below lists all zipped files that can be found in the archive page before downloading them using the rverst package.

library(tidyverse)
library(rvest)
library(stringr)

pgn_list <- read_html("https://theweekinchess.com/twic") %>%
  html_nodes("a") %>%  # find all links in the page
  html_attr("href") %>% # get the url of those links
  str_subset("\\g.zip")  # remove links that are not pgn

# download all files in working directory
walk2(pgn_list, 
      basename(pgn_list), 
      download.file, 
      mode = "wb")

With macOS’s Terminal, these commands will unzip all files before merging them into a unique file.

# replace "/Users/olivier/Documents/Vrac/Echecs" 
# with your working directory
cd /Users/olivier/Documents/Vrac/Echecs
find ./ -name \*.zip -exec unzip {} \; 
rm -f *.zip
cat *.pgn > twic_merged.pgn

The resulting file, which combines all games, can be opened with any chess software (I use HIARCs Chess Explorer). Please note that the final file is a hefty size ranging from 1 to 2 Gb and might be too large to handle on your laptop. You can always remove some of the older files/games before performing the final merge. Enjoy the games!