recommended
Stuff I find interesting
Statistics
-
Improving Your Statistical Inferences
Before reading this set of notes, I knew how to compute p-values and confidence intervals and I understood how to set up a Neyman-Pearson hypothesis test. However, I had only vague ideas of what these concepts exactly meant. Even worse, some of these ideas turned out to be flat wrong. Carefully reading the first seven sections of these notes helped me clear up a lot of confusion and ill-formed notions about statistical concepts. The "test yourself" questions at the end of each section are sufficiently difficult to check whether you grasped the main messages, and sufficiently simple so that you don't really have an excuse for skipping them.
Physics
Scientific Integrity
-
Leakage and the Reproducibility Crisis in ML-based Science
-
Why Hypothesis Testers Should Spend Less Time Testing Hypotheses
Deep Learning
-
Diffusion Is Spectral Autoregression
Insightful blog post drawing parallels between autoregression and diffusion models, especially in image generation. I found the "radially averaged power spectral density" figures especially interesting, in which it is shown that there is a typical power law behavior of frequency power spectra in natural images (the author also provides a heuristic explanation). This frequency power spectrum perspective is then used to argue that diffusion models start by constructing large-scale features, and gradually fill in perceptual details in subsequent diffusion steps, which leads to the analogy with autoregressive models.
-
Intro to Large Language Models
If you want to get a rough understanding of the way ChatGPT works, this video is a great starting point. An especially useful notion to internalize is that language models are essentially trained to 'dream' text, and that we nudge its dreams into useful answers through our prompts. It may also provide some intuition for why ChatGPTs basic reasoning capabilities are sometimes so surprisingly bad in comparison to how well it works for other tasks.
-
Let's Build GPT: From Scratch, in Code, Spelled Out
The highly technical companion to the video above. I used to have a hard time trying to understand the Transformer architecture, even after reading some nice expository blog posts. In this video, you get to build a simple decoder-only Transformer from the ground up (that is, if you consider PyTorch as sea level), all the while seeing how the different elements in the setup contribute to a progressively better performance. I suggest to watch the entire thing from start to finish first at 1.5x speed, in order to get a general impression of where you're going. Then, restart from the beginning, and pause every now and then to type out (and run) the code yourself. Experiment with different hyperparameter values (e.g. for block size, batch size, embedding dimension, etc) to get a feel for the tradeoffs you have to deal with here. Another nice exercise is to change the tokenization so that the tokens are bigrams instead of single characters.
-
tinygrad
A simple and lightweight autograd/tensor library for constructing deep learning models. Its syntax resembles PyTorch, although it's not completely identical. I found it instructive to redo the GPT From Scratch tutorial, swapping out PyTorch for tinygrad. One downside: it's under rapid development, so some features are not that stable — for example, the MPS backend was broken for some time on my MacOS version. The upside: it's under rapid development, so this issue was fixed within a few weeks. They also offer cash bounties for certain code contributions, so fixing open issues can put a little cash in your pocket in addition to being somewhat educational.
-
"Attention", "Transformers", in Neural Network "Large Language Models"
A no-nonsense investigation into the transformer architecture. The author is noticeably annoyed with the lack of precise and carefully worded discussions on transformers and self-attention, and he does a great job at partially fixing this without resorting to handwavy explanations. Also, he's brutally honest about the parts he still doesn't understand (like, why layer-norm?). Something that especially stuck with me: the dot-product attention matrix is essentially some kind of kernel smoothing in the embedding space.