library(tidyverse)
Hanukkah of Data is a data/coding challenge, with one puzzle released daily throughout the eight days of Hanukkah. It has a lot in common with the much better known Advent of Code, telling a story over a series of days. Hanukkah of Data uses the same data set for each puzzle, representing a database, and our task is to find a certain piece of information in it each day. It takes exploring the data and also a certain amount of investigation beyond the data, as we’ll see throughout the puzzles.
In this post, I’ll record my solutions for each puzzle, and how the code links to the puzzle text. However, getting to the solution for each day can be pretty involved, exploring several avenues then needing to think again, to find another clue in the puzzle text. The whole process is a lot of fun, but I won’t record in this post everything I did. My full workings are documented in commented R scripts. I’m using the tidyverse to solve the puzzles. This, for me, is a nice complement to Advent of Code, which I’m primarily attempting to solve in base R this year, and keeping a record of my solutions on a dedicated site.
Also of interest with Hanukkah of Data is the looming challenge of the speed-run: after the eight nights, we get to go through all the puzzles again. We’re told that the puzzles will be the same, but that the underlying data set is more “challenging”. However, according to the website, we “should be able to complete the puzzles handily and in record time–if the solutions you crafted the first time weren’t just lucky guesses!” So, in anticipation of the speed-run, I wrapped all my code for each day inside a function, to make it easy to rerun. It turned out that the text for the speedrun wasn’t exactly the same. I had to refactor my functions slightly as I went through again. The functions in the post below are my final solutions that give the right answers for both the initial game and the speedrun.
The shamash
We’re in 2023, and the Hebrew year is 5784, so it might be tempting to think we need to find six years earlier than that, 5778. However, Rosh Hashanah (Jewish New Year), always falls in September or October of the Gregorian calendar (in 2017 Rosh Hashanah began on September 20th). Since we need the Hebrew year at the start of 2017, the answer is 5777.
The shamash is lit. Happy Hanukkah!
The data
The answer above, 5777, also decrypts the USB drive which gives us access to the database. It contains four .csv files. Since they’re the same for all puzzles, let’s read them in and take a glimpse at them now.
<- read_csv("data/noahs-customers.csv")
customers <- read_csv("data/noahs-orders.csv")
orders <- read_csv("data/noahs-orders_items.csv")
orders_items <- read_csv("data/noahs-products.csv") products
glimpse(customers)
Rows: 8,260
Columns: 9
$ customerid <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 101…
$ name <chr> "Jacqueline Alvarez", "Julie Howell", "Christopher Ali", …
$ address <chr> "105N Elizabeth St", "185-1 Linden St", "174-28 Baisley B…
$ citystatezip <chr> "Manhattan, NY 10013", "Brooklyn, NY 11221", "Jamaica, NY…
$ birthdate <date> 1958-01-23, 1956-12-03, 2001-09-20, 1959-07-10, 1988-09-…
$ phone <chr> "315-377-5031", "680-537-8725", "315-846-6054", "516-275-…
$ timezone <chr> "America/New_York", "America/New_York", "America/New_York…
$ lat <dbl> 40.71817, 40.69426, 40.68902, 40.84939, 40.72804, 40.6809…
$ long <dbl> -73.99747, -73.92167, -73.77347, -73.90916, -73.98712, -7…
glimpse(orders)
Rows: 213,232
Columns: 6
$ orderid <dbl> 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 1010,…
$ customerid <dbl> 6878, 6375, 8045, 5385, 9235, 8414, 4285, 5725, 5294, 1801,…
$ ordered <dttm> 2017-01-31 02:56:45, 2017-01-31 04:13:35, 2017-01-31 04:45…
$ shipped <dttm> 2017-01-31 09:00:00, 2017-01-31 12:15:00, 2017-01-31 10:45…
$ items <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ total <dbl> 0.99, 13.59, 1.23, 2.10, 13.23, 2.03, 3.58, 1.32, 23.63, 10…
glimpse(orders_items)
Rows: 426,541
Columns: 4
$ orderid <dbl> 1001, 1002, 1002, 1003, 1004, 1005, 1005, 1005, 1005, 1005,…
$ sku <chr> "PET4571", "PET4491", "TOY7498", "PET5509", "PET3929", "PET…
$ qty <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
$ unit_price <dbl> 0.99, 1.08, 12.51, 1.23, 2.10, 1.67, 1.84, 1.16, 1.23, 1.64…
glimpse(products)
Rows: 1,278
Columns: 4
$ sku <chr> "PET0002", "PET0006", "PET0019", "PET0023", "PET0027", …
$ desc <chr> "Wet Cat Food, Tuna & Tuna", "Vegan Adult Cat Food, Chi…
$ wholesale_cost <dbl> 0.89, 1.41, 0.89, 1.47, 1.03, 28.32, 2487.35, 0.97, 33.…
$ dims_cm <chr> "16.1|5.5|3.2", "19.9|16.9|0.3", "18.6|8.7|7.2", "17.4|…
Candle 1
Toggle the code
<- function(customers) {
candle1 |>
customers select(name, phone) |>
separate_wider_delim(name, delim = " ",
names = c("first", "last"),
too_many = "merge") |>
mutate(phone_numbers = str_remove_all(phone, "-")) |>
mutate(last_phone = str_replace_all(last, "[a-cA-C]", "2")) |>
mutate(last_phone = str_replace_all(last_phone, "[d-fD-F]", "3")) |>
mutate(last_phone = str_replace_all(last_phone, "[g-iG-I]", "4")) |>
mutate(last_phone = str_replace_all(last_phone, "[j-lJ-L]", "5")) |>
mutate(last_phone = str_replace_all(last_phone, "[m-oM-O]", "6")) |>
mutate(last_phone = str_replace_all(last_phone, "[p-sP-S]", "7")) |>
mutate(last_phone = str_replace_all(last_phone, "[t-vT-V]", "8")) |>
mutate(last_phone = str_replace_all(last_phone, "[w-zW-Z]", "9")) |>
filter(phone_numbers == last_phone) |>
pull(phone)
}
<- candle1(customers)
investigator investigator
[1] "826-636-2286"
Candle 2
Toggle the code
<- function(customers, orders, orders_items, products,
candle2
first_initial_regex, second_initial_regex) {|>
customers separate_wider_delim(name, delim = " ",
names = c("first", "last"),
too_many = "merge") |>
filter(str_detect(first, first_initial_regex)) |>
filter(str_detect(last, second_initial_regex)) |>
inner_join(orders, by = "customerid") |>
filter(str_detect(ordered, "2017")) |>
inner_join(orders_items, by = "orderid") |>
inner_join(products, by = "sku") |>
select(phone, orderid, desc) |>
group_by(phone, orderid) |>
filter(any(str_detect(desc, "Coffee")) &
any(str_detect(desc, "Bagel"))) |>
distinct(phone) |>
pull(phone)
}
<- candle2(customers, orders, orders_items, products, "^J", "^P")
contractor contractor
[1] "332-274-4185"
The ability to filter on any pair of initials was introduced for the speedrun.
I noticed the clue about ‘coffee and bagels’ before absorbing the one about the cleaning supplies. When I read some other solutions after completing my own, I realised that there is also “Rug Cleaner” amongst the products that Noah sells, and that would have been a slightly easier way to the same solution. I’m not adjusting my solution, though, because I like coffee and bagels more than I like cleaning!
Candle 3
Toggle the code
<- function(customers, contractor, years, first_month) {
candle3
<- customers |>
neighborhood filter(phone == contractor) |>
pull(citystatezip)
|>
customers filter(year(birthdate) %in% years) |>
filter(month(birthdate) == first_month & day(birthdate) >= 21 |
month(birthdate) == (first_month + 1) & day(birthdate) <= 22) |>
filter(citystatezip == neighborhood) |>
pull(phone)
}
<- c(1939, 1951, 1963, 1975, 1987, 1999, 2011)
years_rabbit <- candle3(customers, contractor, years_rabbit, 6)
neighbor neighbor
[1] "917-288-9635"
The ability to filter on any of the 12-year cycle of animals that appear in the Chinese Zodiac, as well as for any astrological sign, was introduced for the speedrun.
I like being able to use the year()
and month()
functions from the lubridate package to be able to filter on particular parts of a Date
without needing to separate them out into their own columns first. This was new to me.
Candle 4
Toggle the code
<- function(customers, orders, orders_items) {
candle4 |>
orders filter(hour(shipped) == 4) |> # received between 4am and 5am
inner_join(orders_items, by = "orderid") |>
filter(str_detect(sku, "BKY")) |> # bakery items
summarise(n_pastries = sum(qty), .by = "orderid") |>
filter(n_pastries > 1) |> # has bought more than one pastry in the order
inner_join(orders, by = "orderid") |>
count(customerid) |>
slice_max(n) |> # the person who has done this most often
inner_join(customers, by = "customerid") |>
pull(phone)
}
<- candle4(customers, orders, orders_items)
early_bird early_bird
[1] "607-231-3605"
I had fun today learning more about traditional Jewish pastries. I was already a fan of rugelach and babka and that was a jumping off point for deeper exploration (hamentash, sufganiah, mandelbrot). Yum! Got to love a coding challenge that gives credit for deli consulting.
Candle 5
Toggle the code
<- function(customers, orders, orders_items, products) {
candle5 |>
customers inner_join(orders, by = "customerid") |>
inner_join(orders_items, by = "orderid") |>
inner_join(products, by = "sku") |>
filter(str_detect(desc, "Senior Cat")) |>
summarise(qty_cat_food = sum(qty), .by = c(phone, orderid))|>
select(-orderid) |>
distinct() |>
slice_max(qty_cat_food) |>
pull(phone)
}
<- candle5(customers, orders, orders_items, products)
cat_lady cat_lady
[1] "631-507-6048"
We’re told she’s from Staten Island, and I originally filtered the customers for that, which got the right answer first time round, but there’s no mention of where the cat lady is from in the speedrun, so I took it out (leaving it in gives the wrong answer on the speedrun).
There’s also a big red herring in the text today, the fact that she’s wearing a sweatshirt from Noah’s Market. I originally looked for someone who had bought both a “Noah’s Jersey” and cat food, but that didn’t give any correct answers. In fact, it turns out the Cat Lady hasn’t purchased a Noah’s sweatshirt (perhaps she’d been gifted one). This was the day I realised that I needed to pay attention to the title of the puzzle: the cats are the important thing today.
In retrospect, I suspect that the mention of Noah’s sweatshirt was to encourage us to see in the product list an item that comes in many colours. That would prove useful for Candle 7.
Candle 6
Toggle the code
<- function(customers, orders, orders_items, products) {
candle6 inner_join(orders_items, products, by = "sku") |>
mutate(shop_price = qty * unit_price,
wholesale_price = qty * wholesale_cost) |>
summarise(order_shop_price = sum(shop_price),
order_wholesale_price = sum(wholesale_price),
.by = orderid) |>
filter(order_wholesale_price > order_shop_price) |>
inner_join(orders, by = "orderid") |>
count(customerid) |>
slice_max(n) |>
inner_join(customers, by = "customerid") |>
pull(phone)
}
<- candle6(customers, orders, orders_items, products)
bargain_hunter bargain_hunter
[1] "585-838-9161"
Candle 7
Toggle the code
<- function(customers, orders, orders_items, products, bargain_hunter) {
candle7
# get info about the bargain hunter
<- customers |>
bargain_hunter_id filter(phone == bargain_hunter) |>
pull(customerid)
<- customers |>
bargain_hunter_orders filter(phone == bargain_hunter) |>
inner_join(orders, by = "customerid")
<- bargain_hunter_orders |>
bargain_hunter_orders_dates mutate(date = date(shipped)) |>
distinct(date) |>
pull(date)
# Make regex for colours
<- products |>
colours filter(str_detect(desc, "Jersey")) |> # 12 colours, all with COL in sku
arrange(desc) |>
separate_wider_regex(desc, c(item = ".*", " ", colour = ".*")) |>
pull(colour) |>
paste(collapse = "|") |>
str_remove_all("\\(|\\)")
# "amber|azure|blue|green|magenta|mauve|orange|puce|purple|red|white|yellow"
# All orders with colours on the same date as the bargain hunter purchases
# (including the bargain hunter's purchases)
<- orders |>
same_date_orders_colours mutate(shipped_date = date(shipped)) |>
filter(shipped_date %in% bargain_hunter_orders_dates) |>
select(orderid, customerid, shipped, shipped_date) |>
left_join(orders_items, by = "orderid") |>
left_join(products, by = "sku") |>
select(-qty, -unit_price, -wholesale_cost, -dims_cm) |>
filter(str_detect(desc, colours)) |>
separate_wider_regex(desc, c(item = ".*", " ", colour = ".*"))
# the last line seperates the `desc` column at the final deliminator of " ".
# separate into bargain hunter and possible meet cute
<- same_date_orders_colours |>
bh filter(customerid == bargain_hunter_id) |>
select(customerid, shipped_date, shipped, item, colour)
<- same_date_orders_colours |>
possible_meet_cute filter(customerid != bargain_hunter_id) |>
select(customerid, shipped_date, shipped, item, colour)
# Now join these together by item and date
# .x is bargain hunter, .y is possible meet cute
# find for row where colour is different and time is closest
# that's the meet cute, so join with customers and pull phone
inner_join(bh, possible_meet_cute, by = c("shipped_date", "item")) |>
filter(colour.x != colour.y) |>
mutate(time_diff = abs(shipped.x - shipped.y)) |>
slice_min(time_diff) |>
left_join(customers, join_by(customerid.y == customerid)) |>
pull(phone)
}
<- candle7(customers, orders, orders_items, products, bargain_hunter)
meet_cute meet_cute
[1] "838-335-7157"
This was the trickiest day, and my longest solution. It was also the day with the biggest ‘gotcha’ for the speedrun, which turned out to be a delicious twist from the puzzle authors, and which it retrospect I’m kicking myself for not having accounted for.
When I was investigating where colours appear in the products
table, I knew from Candle 5 that the ‘Jersey’ comes in multiple colours. Narrowing in on one of those colours, red
, I found there were seven types of item, then that all of those seven had the same twelve colours. Their descriptions all have the structure "Noah's Item (colour)"
(for each combination of item and colour). Moreover, all 84 of those items have a sku
that starts with "COL"
; a shorthand for ‘colour’, it was easy to presume. I double checked this by looking at all products that have a sku
starting with "COL"
and found 85 products. The extra one is "Noah's Ark Model (HO Scale)"
, no colour, though with a similar extra descriptor in ()
after the final deliminator of " "
. Still, I took a shortcut and instead of checking for items with a colour by regex, I initially filtered on having "COL"
in the sku
instead, knowing the one additional item wasn’t going to cause a problem, since even if both the bargain hunter and her meet cute bought this item, we would filter it out as having the same “colour” (the regex to separate out items and colours puts the (HO Scale)
in the colour column). I made a note at the time that I should come back and be more robust about filtering items with colours. I wish I’d taken my own advice! Running my original function on the speedrun gave the wrong answer. In the more “challening” dataset, there are additional items with colours that don’t have "COL"
in their sku
, and it’s one of these that is the common item in this case. At least I knew where the problem in my code was likely to be, and how to fix it.
The thing that makes it all the more frustrating that I didn’t fix this before the speedrun is that there’s a massive clue later, as we’ll soon see, that the assumption that "COL"
stands for colour was wrong.
Candle 8
Toggle the code
<- function(customers, orders, orders_items, products) {
candle8 |>
customers inner_join(orders, by = "customerid") |>
inner_join(orders_items, by = "orderid") |>
inner_join(products, by = "sku") |>
filter(str_detect(desc, "Noah")) |>
count(phone, sort = TRUE) |> # unique phones, so equivalent to customer id
slice_max(n) |>
pull(phone)
}
<- candle8(customers, orders, orders_items, products)
collector collector
[1] "212-547-3518"
We saw when working on the Candle 7 puzzle that there are items that have the name “Noah” in their description. It seemed reasonable to assume that these are the collectibles and the code above got me the correct answers both first time and in the speedrun. I now realise, though, that I may have got doubly lucky and that my assumption was probably incorrect. But let’s look at the rug/epilogue first, to realise how tantilisingly close I came to figuring it out properly, and to revel in just how fine the storytelling and clue-dropping was.
The rug
The epilogue wraps up the story, and ties together a few clues. We’re hearing from the collector. There’s a mention of the Noah’s Ark collectible—that 85th item found in the product list during our hunt for the meet cute for candle 7. At this point, it was there to see: "COL"
in the sku
is for collectible, not colour. So, instead of incorrectly filtering on this in my original candle 7 solution, I probably should have filtered on this in my candle 8 solution, not on the description containing the word "Noah"
. There are several items that have "Noah"
in their description that might not be collectibles, hence now considering myself lucky that my code for candle 8 worked.
I expect that the COL/collectible/colour confusion was a deliberate “gotcha” on the part of Saul Pwanson, the Hannukah of Data creator, Although it was initally frustrating that my speedrun wasn’t entirely smooth, I’m also kind-of pleased I feel into the trap, because it’s made me appreciate just how expertly the whole challenge has been crafted.
On the subject of expert crafting, one of the delights of Hanukkah of Data is the animated ASCII art. When we land on the homepage we see a menorah, the nine-pronged candelabra that holds one candle for each night of Hanukkah plus the shamash which is used as the kindling light. At first, there’s just the shamash, then on each night of Hanukkah, another candle appears. Solving a day’s puzzle lights the associated candle. The picture I’ve used to accompany this post shows the state of the menorah at the beginning of the second night. As each candle gets lit, it illuminates some of an image above. It’s such a treat to see the picture reveal itself, and link to the text. I didn’t want to spoil the surprise for anyone by making the final artwork immediately visible, but for those who want to see it now, they can do so here.
The speedrun
<- read_csv("data/speedrun/noahs-customers.csv")
customers <- read_csv("data/speedrun/noahs-orders.csv")
orders <- read_csv("data/speedrun/noahs-orders_items.csv")
orders_items <- read_csv("data/speedrun/noahs-products.csv")
products
<- function(customers, orders, orders_items, products) {
speedrun <- candle1(customers)
investigator <- candle2(customers, orders, orders_items, products, "^D", "^S")
contractor <- c(1931, 1943, 1955, 1967, 1979, 1991, 2003, 2015)
years_goat <- candle3(customers, contractor, years_goat, 9)
neighbor <- candle4(customers, orders, orders_items)
early_bird <- candle5(customers, orders, orders_items, products)
cat_lady <- candle6(customers, orders, orders_items, products)
bargain_hunter <- candle7(customers, orders, orders_items, products, bargain_hunter)
meet_cute <- candle8(customers, orders, orders_items, products)
collector
cat("Investigator:", investigator, "\n")
cat("Contractor:", contractor, "\n")
cat("Neighbor:", neighbor, "\n")
cat("Early Bird:", early_bird, "\n")
cat("Cat Lady:", cat_lady, "\n")
cat("Bargain Hunter:", bargain_hunter, "\n")
cat("Meet Cute:", meet_cute, "\n")
cat("Collector:", collector, "\n")
}speedrun(customers, orders, orders_items, products)
Investigator: 767-365-7269
Contractor: 838-351-0370
Neighbor: 914-594-5535
Early Bird: 716-789-4433
Cat Lady: 347-835-2358
Bargain Hunter: 838-295-7143
Meet Cute: 516-544-4187
Collector: 516-638-9966
Conclusion
I thoroughly enjoyed participating in Hanukkah of Data. I found the puzzles to be a lot of fun and enjoyed the storytelling that linked them. I loved the twist of the speedrun as a motivator for thinking about how to make my solutions robust. I learnt how to use new tidyverse functions, such as the separate_wider_*()
family and join_by()
, and I’ve been inspired to do some baking too! There was a small group in the R4DS (R for Data Science) Slack working through the puzzles daily, and being part of that chat added an enjoyable sense of community and camaraderie. I very much hope the talented team run it again next year. I’m already looking forward to it.
Last updated
2023-12-21 08:39:44.38971
Details
Session info
Toggle
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.2 (2023-10-31)
os macOS Sonoma 14.1
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/London
date 2023-12-21
pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
quarto 1.4.526 @ /usr/local/bin/quarto
─ Packages ───────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
P dplyr * 1.1.4 2023-11-17 [?] CRAN (R 4.3.1)
P forcats * 1.0.0 2023-01-29 [?] CRAN (R 4.3.0)
P ggplot2 * 3.4.4 2023-10-12 [?] CRAN (R 4.3.1)
lubridate * 1.9.2 2023-02-10 [1] CRAN (R 4.3.0)
purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)
P readr * 2.1.4 2023-02-10 [?] CRAN (R 4.3.0)
P sessioninfo * 1.2.2 2021-12-06 [?] CRAN (R 4.3.0)
P stringr * 1.5.0 2022-12-02 [?] CRAN (R 4.3.0)
P tibble * 3.2.1 2023-03-20 [?] CRAN (R 4.3.0)
P tidyr * 1.3.0 2023-01-24 [?] CRAN (R 4.3.0)
P tidyverse * 2.0.0 2023-02-22 [?] CRAN (R 4.3.0)
[1] /Users/ellakaye/Rprojs/mine/ellakaye-quarto/renv/library/R-4.3/aarch64-apple-darwin20
[2] /Users/ellakaye/Library/Caches/org.R-project.R/R/renv/sandbox/R-4.3/aarch64-apple-darwin20/ac5c2659
P ── Loaded and on-disk path mismatch.
──────────────────────────────────────────────────────────────────────────────
Reuse
Citation
@online{kaye2023,
author = {Kaye, Ella},
title = {Hanukkah of {Data,} 5784},
date = {2023-12-07},
url = {https://ellakaye.co.uk/posts/2023-12-07_hanukkah-of-data-5784/},
langid = {en}
}