I have recently started learning R with the help of the Data Science Program by HarvardX. I think this is a great program, and has been very helpful to me in learning R.
As I looked for exercises to practice on my own – I searched to see if there are any chess related packages in R that could be of use. I found bigchess, and I have been playing with it since yesterday, and I really like this package.
In this post I am going to describe how to create a data frame of all your chess games in R, and then use them for analysis.
Install bigchess
Installing bigchess is quite easy as is installing other R packages. You need to run the following two commands:
install.packages(“bigchess”)
library(“bigchess”)
bigchess has the ability to read through a PGN of all your games, and create a data frame out of it. This is really impressive, and I was quite surprised to find that such a package existed.
If you play online on Lichess then you can retrieve a PGN that contains all your games by using the export option on your profile.
I currently have a database of 900 or so games, and I exported all of that. You need to save this file in your users folder in order for the extraction to work. I am sure this is not the only way and you can specify a path, but since I am still learning – this is the only way I know. Rather annoyingly, the users folder is hidden on a Mac, and you have to cmd-shift-h in Finder in order to see it.
Create a data frame from your PGN
Now that you have a database of all your games – the next step is to create a data frame of these games in R.
By default, bigchess only supports five tags so when you import the games with the default command it skips over the ELO of both players. This is quite important to me, and in order to retain the ELOs you have to add them to the command while importing the games.
Here is the code for that:
tal <- read.pgn(“lichess_manshuv_2019-04-14.pgn”, add.tags = c(“WhiteElo”, “BlackElo”))
In the statement above – tal is the name of my dataframe, and lichess_manshuv_2019-04-14.pgn is the name of the file with all my games. Since the default statement without the add.tags only captures five tags, I added the WhiteElo and BlackElo to also record the Elos of the players. This command gives a warning, but adds all the games. To be honest, I am not sure what the coercion warning means here because I expected that all games would have corresponding Elos.
This is how the statement executed in R studio:
> tal <- read.pgn(“lichess_manshuv_2019-04-14.pgn”, add.tags = c(“WhiteElo”, “BlackElo”))
2019-04-14 18:15:04, successfully imported 901 games
2019-04-14 18:15:04, N moves computed
2019-04-14 18:15:04, extract moves done
2019-04-14 18:15:05, stat moves computed
Warning messages:
1: In read.pgn(“lichess_manshuv_2019-04-14.pgn”, add.tags = c(“WhiteElo”, :
NAs introduced by coercion
2: In read.pgn(“lichess_manshuv_2019-04-14.pgn”, add.tags = c(“WhiteElo”, :
NAs introduced by coercion
Now, I want to see whether these games were loaded correctly or not, and in order to do that you can use the very helpful command in R Studio – View() so that you can see the data in a tabular format.
The command is:
View(tal)
Personally, I am only interested in the analysis of my classical games on Lichess so I filtered my data frame on that, but you don’t need to do that. Here is the command for that.
tal <- filter(tal, Event == “Rated Classical game”)
Information about your bigchess R data frame
I was quite confused by this data frame when I first saw it, and I was rather amazed by it when I understood it!
bigchess creates a data frame with 51 variables! What are these 51 columns?!
bigchess extracts everything out of your PGN (excluding the tags you didn’t add), and then creates a few extra columns with this data that you didn’t think of yourself. It stores the total number of moves, total number of moves for each piece, the first ten moves of white and black in individual columns, and of course the entire PGN of each and every game. This is very helpful for opening analysis, and other type of analysis as well.
Once you have the data frame you can use other R goodness to conduct analysis, create graphs etc. The strength of the bigchess package is in extracting all your games and providing it in a data frame that is extremely powerful, and amenable to analysis. There are some other features of this package as well, but I have not explored them yet. I will do future posts on them when I use them.
Analysis with R
Time for some analysis now! This is a simple chart that I created to see how my opponents were rated when I was black.

Code:
data <- filter(tal, Black == “manshuv”)
>ggplot(data, aes(x=WhiteElo)) +
+ geom_histogram(fill=”skyblue”, alpha=0.5) +
+ ggtitle(“White’s Opening Move”) +
+ theme_minimal()
This was fairly simple, and now I’d like to see how many times my opponent played each opening against me. I know e4 must figure as the most popular, but don’t know what the distribution is. This is slightly trickier because the column W1 which holds white’s first move contains alphanumeric values such as e4, d4 etc. and you can’t use the code given above straightaway. You may think you need to aggregate this data, but you don’t – you just need to use geom_bar instead of geom_histogram which is quite fantastic!
This yields the following chart:
This can be achieved with the following code:
ggplot(data, aes(x=W1)) +
+ geom_bar(fill=”skyblue”, alpha=0.5) +
+ ggtitle(“White’s Opening Move”) +
+ theme_minimal()
As you can see and probably expected – e4 is played much more frequently than anything else. I didn’t think that it was played twice as many times as d4 though, so this is interesting to me.
Finally, I am very happy to have found the big chess package, and be able to combine two of my hobbies to learn something new in both of them at the same time! I am certainly hopeful that this will be the first of many more posts on R, and that R may even help my chess!