Fun Feature Engineering with Gen 1 Pokémon

I had the idea behind this blog post for a while, but tidytuesday was doing Pokemon data this week I had to do it. I have been thinking about Feature Engineering for a while now, and the idea of using data about something I like and know about seems like a perfect match.

The general idea in this post is to apply different transformations to the data and see how reveals structure and similarity in the data. Since we don’t have a specific target in mind, I will be using UMAP on the different data sets. Then we can look at the placements to see if we can recognize what happened in the transformation.

Note

I’m fully aware of the pros and cons of using UMAP. This is a toy example and you will find that it serves its purpose fine for this post.

For ease of visualization, I will only be using Generation 1 (first 151) Pokemon. Partly to avoid overplotting and because those are the most well-known Pokemon for the readers.

Packages and data

We load tidyverse for wrangling and plotting, ggimage to add pokemon as sprites, recipes, and embed to perform feature engineering including UMAP, pokemon for the data sources, fs for file system operations since I will be working with some images as well, patchwork to combine some charts, png to read png files.

library(tidyverse)
library(ggimage)
library(recipes)
library(embed)
library(pokemon)
library(fs)
library(patchwork)
library(png)

Then we filter the data down to only include generation 1, The generation_id column comes in handy for this.

pokemon <- pokemon |>
  filter(generation_id == 1)
glimpse(pokemon)

Rows: 151
Columns: 22
$ id              <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ pokemon         <chr> "bulbasaur", "ivysaur", "venusaur", "charmander", "cha…
$ species_id      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…
$ height          <dbl> 0.7, 1.0, 2.0, 0.6, 1.1, 1.7, 0.5, 1.0, 1.6, 0.3, 0.7,…
$ weight          <dbl> 6.9, 13.0, 100.0, 8.5, 19.0, 90.5, 9.0, 22.5, 85.5, 2.…
$ base_experience <dbl> 64, 142, 236, 62, 142, 240, 63, 142, 239, 39, 72, 178,…
$ type_1          <chr> "grass", "grass", "grass", "fire", "fire", "fire", "wa…
$ type_2          <chr> "poison", "poison", "poison", NA, NA, "flying", NA, NA…
$ hp              <dbl> 45, 60, 80, 39, 58, 78, 44, 59, 79, 45, 50, 60, 40, 45…
$ attack          <dbl> 49, 62, 82, 52, 64, 84, 48, 63, 83, 30, 20, 45, 35, 25…
$ defense         <dbl> 49, 63, 83, 43, 58, 78, 65, 80, 100, 35, 55, 50, 30, 5…
$ special_attack  <dbl> 65, 80, 100, 60, 80, 109, 50, 65, 85, 20, 25, 90, 20, …
$ special_defense <dbl> 65, 80, 100, 50, 65, 85, 64, 80, 105, 20, 25, 80, 20, …
$ speed           <dbl> 45, 60, 80, 65, 80, 100, 43, 58, 78, 45, 30, 70, 50, 3…
$ color_1         <chr> "#78C850", "#78C850", "#78C850", "#F08030", "#F08030",…
$ color_2         <chr> "#A040A0", "#A040A0", "#A040A0", NA, NA, "#A890F0", NA…
$ color_f         <chr> "#81A763", "#81A763", "#81A763", NA, NA, "#DE835E", NA…
$ egg_group_1     <chr> "monster", "monster", "monster", "monster", "monster",…
$ egg_group_2     <chr> "plant", "plant", "plant", "dragon", "dragon", "dragon…
$ url_icon        <chr> "//archives.bulbagarden.net/media/upload/7/7b/001MS6.p…
$ generation_id   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ url_image       <chr> "https://raw.githubusercontent.com/HybridShivam/Pokemo…

To make the visualizations easy to read I’m going to be using small sprites or each Pokemon as their plotting. I’ll download them for easy use using the below code.

sprites_urls <- paste0("https:", pokemon$url_icon)

sprites_path <- fs::path(
  "sprites", 
 pokemon$pokemon, 
 ext = "png"
)

walk2(sprites_urls, sprites_path, slowly(download.file))

Then we create a tibble of all the sprites we just downloaded with matching names.

sprites_tbl <- tibble(
 pokemon = pokemon$pokemon,
 sprite = sprites_path
)
sprites_tbl

# A tibble: 151 × 2
   pokemon    sprite                
   <chr>      <fs::path>            
 1 bulbasaur  sprites/bulbasaur.png 
 2 ivysaur    sprites/ivysaur.png   
 3 venusaur   sprites/venusaur.png  
 4 charmander sprites/charmander.png
 5 charmeleon sprites/charmeleon.png
 6 charizard  sprites/charizard.png 
 7 squirtle   sprites/squirtle.png  
 8 wartortle  sprites/wartortle.png 
 9 blastoise  sprites/blastoise.png 
10 caterpie   sprites/caterpie.png  
# ℹ 141 more rows

Now we are ready to get to plotting.

Random - how to plot Pokemon sprites

I’m using the ggimage package to add the Pokemon as sprites. With this, we can use geom_image() which takes an aesthetic image which should be a path to an image. Lastly, we need to change the size used in geom_image() as the default is way too large for what we are trying to do.

set.seed(1234)

sprites_tbl |>
  mutate(
 x = rnorm(151),
 y = rnorm(151),
 ) |>
  ggplot(aes(x, y)) + 
  geom_image(aes(image = sprite), size = .1)

Since the axes won’t have any meaning because we are using UMAP, We will be using theme_void() to remove everything, along wit theme() and labs() to add a simple title.

set.seed(1234)
sprites_tbl |>
  mutate(
 x = rnorm(151),
 y = rnorm(151),
 ) |>
  ggplot(aes(x, y)) + 
  geom_image(aes(image = sprite), size = .1) +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(title = "Random Position")

The only difference moving forward will be the data we pass in, so I’ll create a little helper function to do our plotting. Using aes(UMAP1, UMAP2) instead of aes(x, y) since I know that is the name that will come out of recipes by default.

pokemon_umap_plot <- function(data, title) {
 data |>
    ggplot(aes(UMAP1, UMAP2)) + 
    geom_image(aes(image = sprite), size = .1) +
    theme_void() +
    theme(plot.title = element_text(hjust = 0.5)) +
    labs(title = title)
}

Stats

First, we will see what happens when we just look at the 6 main stats.

Code

set.seed(1)
pokemon |>
  select(pokemon, hp:speed) |>
  recipe() |>
  step_umap(all_numeric()) |>
  prep() |>
  bake(NULL) |>
  left_join(sprites_tbl, by = join_by(pokemon)) |>
  pokemon_umap_plot("Stats")

Remember that UMAP mostly cares about local distances. So left-right up-down doesn’t mean much, instead, we look at if neighboring Pokemon have similar data which for this plot is stats.

We can verify this by seeing that all legendary Pokemon, Articuno, Zapdos, Moltres, and Mewtwo are hanging out on the right side near a lot high stated pokemon. And the left-hand side appears to have some low stated Pokemon with Weedle, Caterpie, and Magikarp.

Stats - normalized

Before we use the stats by themselves, let us see if something happens if we normalize them. Normalizing here is specifically meant as a part of the total stat distribution. This should hopefully mean that a Pokemon that is very fast would be near other fast Pokemon, regardless of whether its total stats are high or low.

Code

set.seed(1)
pokemon |>
  select(pokemon, hp:speed) |>
   mutate(total = rowSums(across(where(is.numeric)))) |>
  mutate(across(hp:speed, \(x) x / total)) |>
  select(-total) |>
  recipe() |>
  step_umap(all_numeric()) |>
  prep() |>
  bake(NULL) |>
  left_join(sprites_tbl, by = join_by(pokemon)) |>
  pokemon_umap_plot("Stats normalized by total")

The main difference I see is that Pokemon from their evolutionary family stay together. Krabby and Kingler, Ponyta and Rapidash, Omanyte and Omastar. This feels right as you would expect many Pokemon to preserve their stat spread when evolving.

Moves

I went ahead a scraped from The RESTful Pokémon API some more data than what came in the package. I wanted to look at their moves and see what happens with them. The data set contains one row for each Pokemon and one column for each move. A 1 indicates whether the Pokemon can learn that move.

moves <- read_csv(
  "moves.csv", 
 show_col_types = FALSE
)

moves

# A tibble: 151 × 593
   name  razor_wind swords_dance   cut  bind vine_whip headbutt tackle body_slam
   <chr>      <dbl>        <dbl> <dbl> <dbl>     <dbl>    <dbl>  <dbl>     <dbl>
 1 bulb…          1            1     1     1         1        1      1         1
 2 cate…          0            0     0     0         0        0      1         0
 3 volt…          0            0     0     0         0        1      1         0
 4 elec…          0            0     0     0         0        1      1         0
 5 exeg…          0            1     0     0         0        1      0         0
 6 exeg…          0            1     0     0         0        1      0         1
 7 cubo…          0            1     0     0         0        1      0         1
 8 maro…          0            1     0     0         0        1      0         1
 9 hitm…          0            1     0     0         0        1      1         1
10 hitm…          0            1     0     0         0        1      1         1
# ℹ 141 more rows
# ℹ 584 more variables: take_down <dbl>, double_edge <dbl>, growl <dbl>,
#   strength <dbl>, mega_drain <dbl>, leech_seed <dbl>, growth <dbl>,
#   razor_leaf <dbl>, solar_beam <dbl>, poison_powder <dbl>,
#   sleep_powder <dbl>, petal_dance <dbl>, string_shot <dbl>, toxic <dbl>,
#   rage <dbl>, mimic <dbl>, double_team <dbl>, defense_curl <dbl>,
#   light_screen <dbl>, reflect <dbl>, bide <dbl>, sludge <dbl>, …

We use the same UMAP on these predictors now.

Code

set.seed(1)
moves |>
  recipe() |>
  step_umap(all_numeric()) |>
  prep() |>
  bake(NULL) |>
  left_join(sprites_tbl, by = c("name" = "pokemon")) |>
  pokemon_umap_plot("Moves")

And it appears that it is working as intended. We again see evolution likes close to each other which makes sense, we also see a little bit of a color clustering which could be explained a little by the fact that a lot of Pokemon of the same types have similar colors. The last thing I wanted to point out is that it perfectly captures the Pokemon that doens’t learn many moves including; Ditto, Kakuna, and Magikarp.

Moves types

Each move has metadata associated with it. Let’s see what we can pull out of this.

moves_meta <- read_csv(
  "moves-meta.csv", 
 show_col_types = FALSE
)

moves_meta

# A tibble: 592 × 4
   moves        type   priority category      
   <chr>        <chr>     <dbl> <chr>         
 1 razor_wind   normal        0 damage        
 2 swords_dance normal        0 net-good-stats
 3 cut          normal        0 damage        
 4 bind         normal        0 damage+ailment
 5 vine_whip    grass         0 damage        
 6 headbutt     normal        0 damage        
 7 tackle       normal        0 damage        
 8 body_slam    normal        0 damage+ailment
 9 take_down    normal        0 damage        
10 double_edge  normal        0 damage        
# ℹ 582 more rows

We look at how many of each type a Pokemon can learn. E.i. how many grass type moves does it learn, how many fire time moves does it learn, etc etc.

set.seed(5)
moves |>
  pivot_longer(cols = -name, names_to = "moves") |>
  filter(value == 1) |>
  left_join(moves_meta, by = join_by(moves)) |>
  count(name, type) |>
  pivot_wider(names_from = type, values_from = n, values_fill = 0) |>
  recipe() |>
  step_umap(all_numeric()) |>
  prep() |>
  bake(NULL) |>
  left_join(sprites_tbl, by = c("name" = "pokemon")) |>
  pokemon_umap_plot("Number of Moves by Type")

This one is very skewed by the handful of Pokemon that don’t know any moves.

Move types normalized

We do the same normalization we talked about earlier.

Code

set.seed(2)
moves |>
  pivot_longer(cols = -name, names_to = "moves") |>
  filter(value == 1) |>
  left_join(moves_meta, by = join_by(moves)) |>
  count(name, type) |>
  pivot_wider(names_from = type, values_from = n, values_fill = 0) |>
  mutate(total = rowSums(across(where(is.numeric)))) |>
  mutate(across(where(is.numeric), \(x) x / total)) |>
  select(-total) |>
  recipe() |>
  step_umap(all_numeric()) |>
  prep() |>
  bake(NULL) |>
  left_join(sprites_tbl, by = c("name" = "pokemon")) |>
  pokemon_umap_plot("Number of Moves by Type - Normalized")

There might be a bit more separation. One thing to point out here is that it looks very similar to the moves embedding, but this is done on 18 columns (1 for each type) which is a lot less than moves alone which contains 592 columns.

Move Category

I extracted a category field as well. This one lets us see what type of move it is in broad categories.

moves_meta |>
  count(category)

# A tibble: 15 × 2
   category               n
   <chr>              <int>
 1 ailment               29
 2 damage               245
 3 damage+ailment        62
 4 damage+heal            7
 5 damage+lower          39
 6 damage+raise          16
 7 field-effect          12
 8 force-switch           2
 9 heal                  10
10 net-good-stats        52
11 ohko                   4
12 swagger                2
13 unique                79
14 whole-field-effect    15
15 <NA>                  18

We will use it the same as before.

Code

set.seed(5)
moves |>
  pivot_longer(cols = -name, names_to = "moves") |>
  filter(value == 1) |>
  left_join(moves_meta, by = join_by(moves)) |>
  count(name, category) |>
  pivot_wider(names_from = category, values_from = n, values_fill = 0) |>
  recipe() |>
  step_umap(all_numeric()) |>
  prep() |>
  bake(NULL) |>
  left_join(sprites_tbl, by = c("name" = "pokemon")) |>
  pokemon_umap_plot("Number of Moves by Category")

I find this one harder to analyze, I’m not able to many trends in this data set beyond our useless friends.

Move Category normalized

If we normalize the above we get the following.

Code

set.seed(5)
moves |>
  pivot_longer(cols = -name, names_to = "moves") |>
  filter(value == 1) |>
  left_join(moves_meta, by = join_by(moves)) |>
  count(name, category) |>
  pivot_wider(names_from = category, values_from = n, values_fill = 0) |>
  mutate(total = rowSums(across(where(is.numeric)))) |>
  mutate(across(where(is.numeric), \(x) x / total)) |>
  select(-total) |>
  recipe() |>
  step_umap(all_numeric()) |>
  prep() |>
  bake(NULL) |>
  left_join(sprites_tbl, by = c("name" = "pokemon")) |>
  pokemon_umap_plot("Number of Moves by Category - normalized")

What I like about this one is that you see Pokemon from the same evolution lines near each other. As well as the legendary birds together in the middle.

Sprites

Lastly, let’s see what we can do with the sprites. This will be a very rough approach. I’m going to count the number of colors for each pixel in the images from the Pokemon data set and use those counts as the input to UMAP.

Below is the code used to download the images.

image_urls <- pokemon$url_image

image_path <- fs::path(
  "images", 
 pokemon$pokemon, 
 ext = "png"
)

walk2(image_urls, image_path, slowly(download.file))

I wrote a little helper function to get pixel color counts.

count_colors <- function(path, name) {
 png <- readPNG(path)
 color <- rgb(png[,,1], png[,,2], png[,,3])
 res <- tibble(color) |>
    count(color)
    
  bind_cols(
 name = name,
 res
 )
}

Now we are ready to use the data.

Code

set.seed(1)
map2(
 image_path, 
 pokemon$pokemon, 
 count_colors
) |>
  list_rbind() |>
  filter(color != "#000000") |>
  pivot_wider(names_from = color, values_from = n, values_fill = 0) |>
  recipe() |>
  step_umap(all_numeric()) |>
  prep() |>
  bake(NULL) |>
  left_join(sprites_tbl, by = c("name" = "pokemon")) |>
  pokemon_umap_plot("Number of Moves by Type")

What we see right away is that something didn’t work right. A likely explanation is that the precise colors were too specific, and we thus didn’t see enough overlap between the Pokemon for UMAP to pick up on.

Let’s try to round the colors a bit after there are 16,777,216 (16 ^ 6) colors. Lets instead round them, so instead of using two hex values per color, we just use the more important one. This will leave us with 4096 (16 * 16 * 16) colors.

Code

set.seed(1)
map2(
 image_path, 
 pokemon$pokemon, 
 count_colors) |>
  list_rbind() |>
  mutate(color = paste0(
    str_sub(color, 1, 2), 
    str_sub(color, 3, 3),
    str_sub(color, 5, 5)
 )
 ) |>
    mutate(color = str_replace_all(
 color, 
      c(
        "0" = "0", 
        "1" = "0",
        "2" = "0",
        "3" = "0",
        "4" = "1", 
        "5" = "1",
        "6" = "1",
        "7" = "1",
        "8" = "2", 
        "9" = "2",
        "A" = "2",
        "B" = "2",
        "C" = "3", 
        "D" = "3",
        "E" = "3",
        "F" = "3"
 )
 )
 ) |>
  count(name, color, wt = n) |>
  filter(color != "#000") |>
  pivot_wider(names_from = color, values_from = n, values_fill = 0) |>
  recipe() |>
  step_umap(all_numeric()) |>
  prep() |>
  bake(NULL) |>
  left_join(sprites_tbl, by = c("name" = "pokemon")) |>
  pokemon_umap_plot("Number of Moves by Type")

The results are better as there now is a color separation between the Pokemon. Let’s see what would happen if we discretize the colors even more. Below each color channel (red, green, blue) is made to only contain 4 different values, leaving us with 64 unique colors to be counted.

Code

set.seed(1)
map2(
 image_path, 
 pokemon$pokemon, 
 count_colors) |>
  list_rbind() |>
  mutate(color = paste0(
    str_sub(color, 1, 2), 
    str_sub(color, 3, 3),
    str_sub(color, 5, 5)
 )
 ) |>
    mutate(color = str_replace_all(
 color, 
      c(
        "0" = "0", 
        "1" = "0",
        "2" = "0",
        "3" = "0",
        "4" = "1", 
        "5" = "1",
        "6" = "1",
        "7" = "1",
        "8" = "2", 
        "9" = "2",
        "A" = "2",
        "B" = "2",
        "C" = "3", 
        "D" = "3",
        "E" = "3",
        "F" = "3"
 )
 )
 ) |>
  count(name, color, wt = n) |>
  filter(color != "#000") |>
  pivot_wider(names_from = color, values_from = n, values_fill = 0) |>
    mutate(total = rowSums(across(where(is.numeric)))) |>
  mutate(across(where(is.numeric), \(x) x / total)) |>
  select(-total) |>
  recipe() |>
  step_umap(all_numeric()) |>
  prep() |>
  bake(NULL) |>
  left_join(sprites_tbl, by = c("name" = "pokemon")) |>
  pokemon_umap_plot("Number of Moves by Type")

This worked pretty well considering how limited RGB color space is for these types of analysis.

Further ideas

In no particular order

co-occurrence in moves
Resistances
Body style, exp rates, potential (what they can be envolve into)
types
- Full resistances
- combinations
text embedding
image embeddings
- so many things to do here
- actual embeddings
- different color spaces
what types can this Pokemon hit with the moves it can learn
- can it hit or not for each type
- how many moves can hit count vs percentage
- highest dmg value for the attack that hits each type

Other similar work

https://gibsramen.github.io/gUMAP/docs/pokemon.html
https://minimaxir.com/2024/06/pokemon-embeddings/