Scrap Chess Games with R
Wanting to stay abreast of recent developments in chess openings theory, I was looking for a database of the most recent high-level chess games. ChessBase GmbH offers the ‘Mega Database 2021’ but the price point is too high for me. The Week in Chess is a free weekly publication in two parts, a text and a games section with a round of the most important chess of the previous week. An archive of all editions is made available by the TWIC curator, Mark Crowthe.
Downloading, unzipping and combining the 400+ files (going back to 2012) by hand would take a long time. Fortunately R can automate the downloading while unzipping and merging can be done with a few command lines instructions.
The R code below lists all zipped files that can be found in the archive page before downloading them using the rverst
package.
library(tidyverse)
library(rvest)
library(stringr)
pgn_list <- read_html("https://theweekinchess.com/twic") %>%
html_nodes("a") %>% # find all links in the page
html_attr("href") %>% # get the url of those links
str_subset("\\g.zip") # remove links that are not pgn
# download all files in working directory
walk2(pgn_list,
basename(pgn_list),
download.file,
mode = "wb")
With macOS’s Terminal, these commands will unzip all files before merging them into a unique file.
# replace "/Users/olivier/Documents/Vrac/Echecs"
# with your working directory
cd /Users/olivier/Documents/Vrac/Echecs
find ./ -name \*.zip -exec unzip {} \;
rm -f *.zip
cat *.pgn > twic_merged.pgn
The resulting file, which combines all games, can be opened with any chess software (I use HIARCs Chess Explorer). Please note that the final file is a hefty size ranging from 1 to 2 Gb and might be too large to handle on your laptop. You can always remove some of the older files/games before performing the final merge. Enjoy the games!