Research Papers

My summaries and notes from some research papers I found interesting.

GAN

⭐⭐⭐⭐

Generative Adversarial Nets

This paper introduces Generative Adversarial Networks (GANs), a new way to generate samples from a learned distribution. Fundamentally it learns a mapping function from a standard Gaussian distribution to the real data distribution.

The main idea is to set up a game between two neural networks. The first is a "Generator" that takes random noise as input and tries to produce realistic-looking data, like an art forger. The second is a "Discriminator" that's trained to tell the difference between real data from the training set and the fake data from the Generator, like an art critic. They are trained at the same time: the Generator gets better by fooling the Discriminator, and the Discriminator gets better by catching the Generator's fakes. The main contribution is this adversarial framework itself. It's a clever idea because it turns the hard problem of generative modeling into a more straightforward supervised learning problem that can be trained with standard backpropagation, getting rid of the slow and complex methods like Markov chains that older models needed.

The competition between the discriminator ( $D$ ) and the generator ( $G$ ) is formalized by a value function $V(D,G)$ that D tries to maximize and G tries to minimize. The math is well covered in the paper.

The really neat idea is how this game provides a mechanism for transforming a simple noise distribution into the complex data distribution.

(a) Initial State: We start with the real data distribution $p_{data}$ (black dotted line) and the generator's initial, random distribution $p_g$ (green solid line). They are very different. The discriminator $D$ (blue dashed line) has learned to mostly distinguish them, outputting higher probabilities for the mode of $p_{data}$ and lower probabilities for the mode of $p_g$ .
(b) Discriminator Update: We hold the generator fixed and train the discriminator. It gets better at its job. The blue line gets sharper, converging towards the theoretically optimal discriminator, which is $D^*(x) = \frac{p_{data}(x)}{p_{data}(x) + p_g(x)}$ . It now has high confidence that samples from $p_{data}$ are real and low confidence that samples from $p_g$ are real.
(c) Generator Update: Now, we hold the discriminator fixed and train the generator. The generator receives a gradient signal from the discriminator. To maximize $D(G(z))$ , the generator adjusts its parameters to produce samples that fall in regions where the discriminator's output (the blue line) is high. You can see the arrows showing how the mapping $G(z)$ is changed to "push" the probability mass of $p_g$ towards the region where $p_{data}$ is concentrated. The generator is literally chasing the high-probability regions of the data distribution, guided by the discriminator's gradient.
(d) Equilibrium: This process repeats. The generator's distribution $p_g$ gets closer to $p_{data}$ . As it does, the discriminator's job gets harder. Eventually, if the training works perfectly, $p_g$ becomes identical to $p_{data}$ . At this point, the discriminator is completely fooled. For any given sample, it has no information to tell if it's real or fake, so its best strategy is to guess. The optimal discriminator becomes $D(x) = \frac{1}{2}$ for all $x$ . The game has converged, and the generator has learned to replicate the true data distribution.

Unfortunately, training it is very unstable. You have to carefully balance the Generator and Discriminator. If the Discriminator gets too good too quickly, the Generator gets no useful feedback (vanishing gradients) and stops learning. This instability often leads to "mode collapse," where the Generator finds a few easy-to-make samples that can fool the Discriminator and just produces those over and over, failing to learn the full variety of the training data. Another significant issue is that the model doesn't give you a way to calculate the probability of a given sample, which makes it hard to quantitatively evaluate how good the model actually is.

⭐⭐⭐⭐

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

This paper introduces DCGAN, a set of architectural guidelines to make deep convolutional GANs stable to train, which was a major problem at the time. Their approach gets rid of pooling layers, using strided convolutions in the discriminator and fractional-strided convolutions (or "deconvolutions") in the generator to handle downsampling and upsampling. They also remove fully connected layers and add Batch Normalization to most layers, which they found was critical for training deep models. The main contribution is providing a reliable blueprint for building GANs that can generate high-resolution, realistic images.

Beyond just generation, they demonstrate that the network learns meaningful, hierarchical representations from unlabeled data. They prove this by using the discriminator's learned features to achieve competitive results on classification benchmarks like CIFAR-10 and SVHN, and by showing that vector arithmetic on the generator's input noise vector (e.g., 'smiling woman' - 'neutral woman' + 'neutral man' results in a 'smiling man' image) produces semantically meaningful changes in the output.

This doesn't completely solve training instability. The primary failure mode is a partial or full "mode collapse," where the generator learns to produce a very limited variety of samples, essentially getting stuck in a rut and ignoring the full diversity of the training data. This can happen as training progresses, with the model suddenly collapsing to producing nonsensical or repetitive images.

The generated images themselves, while impressive for their time, still show artifacts and signs of under-fitting, such as repeating noise textures across different samples. Furthermore, while the vector arithmetic on the latent space is a powerful demonstration, the authors note it was unstable when using single examples and required averaging the latent vectors from multiple samples to work reliably, indicating the learned manifold isn't perfectly smooth or linear.

⭐⭐⭐

Improved Techniques for Training GANs

This paper tackles the problem that GANs are notoriously unstable and difficult to train. They introduce a collection of five practical techniques to make the training process more stable and prevent common failure modes. The key ideas are feature matching, which changes the generator's objective to match the statistics of real data inside the discriminator's network instead of just trying to fool its final output, and minibatch discrimination, which allows the discriminator to look at a whole batch of samples at once to prevent the generator from collapsing and producing the same image over and over. They combine these with smaller tweaks like historical averaging, one-sided label smoothing, and a new type of batch normalization (VBN).

This works, but it's a bag of tricks rather than a single, theoretically-grounded solution. The authors admit their contributions are practical and lack a rigorous theoretical understanding of why this combination of heuristics leads to convergence. It's engineering, not a fundamental insight into the underlying game theory.

⭐⭐⭐⭐

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

This paper presents InfoGAN, a clever extension to Generative Adversarial Networks (GANs) designed to learn interpretable features from data without any labels. The core problem with a standard GAN is that its latent space is a tangled mess; the generator $G$ can use the input noise vector $z$ in any way it wants, so individual dimensions of $z$ rarely correspond to meaningful features. InfoGAN's main approach is to fix this by enforcing a clear structure. It splits the input to the generator into two parts: the standard incompressible noise $z$ , and a new vector of "latent codes" $c$ . The goal is to force $c$ to represent the salient, semantic features of the data (like digit type, rotation, etc.). To do this, they add a regularization term to the GAN objective that maximizes the mutual information between the latent codes $c$ and the generated images $G(z,c)$ .

Mutual information, $I(X;Y)$ , measures the reduction in uncertainty about a variable $X$ after observing $Y$ . It's defined as $I(X;Y) = H(X) - H(X|Y)$ , where $H(X)$ is the entropy (uncertainty) of $X$ . Maximizing $I(c; G(z,c))$ means that if you see a generated image, you should have very little uncertainty about the code $c$ that produced it. The problem is that calculating the conditional entropy $H(c|G(z,c))$ requires knowing the posterior probability $P(c|x)$ , which is intractable. To get around this, they use a technique called Variational Information Maximization. They introduce an auxiliary network, $Q(c|x)$ , to approximate the true posterior. This allows them to derive a tractable lower bound on the mutual information: $L_I(G,Q) = \mathbb{E}_{c \sim P(c), x \sim G(z,c)}[\log Q(c|x)] + H(c)$ . Maximizing this lower bound pushes the generator $G$ to produce images where the code $c$ is easily recoverable by the $Q$ network. In practice, $Q$ is implemented efficiently by sharing all the convolutional layers with the discriminator $D$ , adding only a final, separate fully-connected layer that outputs the predicted parameters for the distribution of $c$ (e.g., softmax probabilities for a categorical code, or mean and standard deviation for a continuous Gaussian code).

This produces a simple, elegant, and fully unsupervised method for learning disentangled representations. By just maximizing this information-theoretic term, InfoGAN successfully discovers semantic features like digit identity versus writing style on MNIST, and pose, lighting, and even the presence of glasses on more complex face datasets, all without a single label. This was a significant step beyond previous methods that required some form of weak supervision.

However, the approach has significant problems from an engineering perspective. First, its reliability is questionable. The paper admits that for the 3D faces and chairs datasets, they presented the best results from 5 random runs, which implies that in a typical run, the model may fail to disentangle the desired factors or may focus on irrelevant features. This makes the process feel more like a lucky discovery than a robust engineering tool. Second, the model's success is demonstrated on datasets with a single, relatively centered object. It's not clear how the concept of a few latent codes controlling global semantics would scale to complex, multi-object scenes where "disentanglement" is a much more ambiguous and difficult problem. Finally, like all GANs, it can be unstable to train, and the addition of the hyperparameter $\lambda$ to balance the information loss adds another tuning knob that can be sensitive, especially when mixing continuous and discrete codes.

⭐⭐⭐⭐

Coupled Generative Adversarial Networks

Coupled Generative Adversarial Networks (CoGANs) generate corresponding image pairs from two different domains, like color and depth images, without ever seeing paired examples during training. The main approach is to use two separate GANs, one for each domain, but the two are "coupled" by sharing weights. The first few layers of the two generator networks share the same weights, and the last few layers of the discriminator networks also share weights. The idea is that the shared generator layers learn a common high-level representation (e.g., "a face looking left") from a shared random input vector, while the later, unshared layers learn to render that concept into a specific domain (e.g., a color photo or a depth map). They demonstrate that this simple architectural constraint is enough to force the model to learn the joint distribution from two separate, unpaired datasets. This is a big deal because collecting paired data is often expensive or impossible.

⭐⭐⭐⭐

Progressive Growing of GANs for Improved Quality, Stability, and Variation

This paper introduces a new way to train GANs, called progressive growing. Instead of training a massive network to generate 1024 × 1024 images from scratch, they begin with tiny 4 × 4 images. Once the network gets good at that, they add new layers to both the generator and discriminator to double the resolution to 8 × 8, smoothly fading in the new layers to not disrupt the already learned features. They repeat this process - stabilize, add layers, fade in - until they reach the final high resolution. By building in stages, from coarse structure to fine details, they can generate good high-resolution then-SOTA faces.

Beyond the core progressive growing method, the authors introduce a few other key tricks that are crucial for stability. To increase variation in the generated images, they add a "minibatch standard deviation" layer, which calculates the standard deviation of features across a batch and feeds it as an extra channel to the discriminator. This gives the discriminator a clue about the variety of the whole batch. They also use two novel normalization techniques to stop the generator and discriminator from getting into an escalating fight: an "equalized learning rate" that dynamically scales weights to ensure all parts of the network learn at the same speed, and a "pixelwise feature vector normalization" in the generator to prevent signal magnitudes from exploding. The most significant problem is that these components are not optional. The progressive growing idea alone is not enough; it fails badly with the small minibatches required for high-resolution training. You need the whole package of tricks for the method to work, making it a bit of a complex, multi-part solution rather than a single elegant fix.

The smooth fade-in of new layers is critical to avoid destabilizing the network. When transitioning from one resolution to the next (e.g., 16 × 16 to 32 × 32), the output is a convex combination of the old, upscaled layer and the new layer, controlled by a parameter $\alpha$ that increases from 0 to 1. The generator's output image $I$ is formed as:

$I = (1 - \alpha) \cdot \text{toRGB}(\text{Upsample}(x_{prev})) + \alpha \cdot \text{toRGB}(\text{ConvBlock}_{new}(\text{Upsample}(x_{prev})))$

where $x_{prev}$ is the feature map from the final layer of the previous resolution. A similar blending is applied to the real images fed to the discriminator. To prevent signal escalation in the generator, they apply pixelwise normalization after each convolutional layer. For a feature vector $a_{x,y}$ at pixel $(x,y)$ with $N$ channels, the normalized feature vector $b_{x,y}$ is calculated as:

$b_{x,y} = \frac{a_{x,y}}{\sqrt{\frac{1}{N} \sum_{j=0}^{N-1} (a^j_{x,y})^2 + \epsilon}}$

where $\epsilon = 10^{-8}$ is a small constant to ensure numerical stability. This forces the feature vectors at each pixel to have unit length.

⭐⭐⭐⭐

A Style-Based Generator Architecture for Generative Adversarial Networks

Summary

The paper introduces StyleGAN, a new generator architecture that borrows ideas from style transfer to gain more control over the image synthesis process. Instead of feeding a latent code $z$ directly into the generator network, they first pass it through a mapping network to produce an intermediate latent code $w$ . This $w$ code is then used to control the "style" (mean and variance) of the feature maps at each resolution level of the generator using a mechanism called Adaptive Instance Normalization (AdaIN). The generator starts from a learned constant input, not the latent code, and also receives explicit noise at each layer. The main contribution is that this architecture automatically separates high-level attributes like pose and identity (controlled by $w$ ) from stochastic features like hair placement and freckles (controlled by the noise inputs). This allows for intuitive, scale-specific editing by mixing styles from different $w$ codes.

To prove their architecture is better, the authors introduce two new metrics: Perceptual Path Length (PPL) and linear separability. These metrics show that their intermediate latent space $W$ is more disentangled and less "curved" than the traditional input latent space $Z$ . However, the approach isn't perfect. The paper shows that the mixing regularization they use to improve style localization actually makes the latent space slightly more entangled (higher PPL). More significantly, they observe that as training progresses and image quality (FID) improves, the latent space entanglement (PPL) tends to get worse, indicating a fundamental trade-off between image fidelity and the linearity of the latent space.

Adaptive Instance Normalization (AdaIN)

AdaIN is the core mechanism that allows the intermediate latent code $w$ to control the synthesis network. For each feature map $x_i$ in a given convolutional layer, AdaIN first normalizes it, wiping out its mean and variance. Then, it applies new scale ( $y_s$ ) and bias ( $y_b$ ) parameters that are derived from the style code $w$ .

The operation is defined as:

$\text{AdaIN}(\mathbf{x}_i, \mathbf{y}) = y_{s,i} \frac{\mathbf{x}_i - \mu(\mathbf{x}_i)}{\sigma(\mathbf{x}_i)} + y_{b,i}$

where

$\mathbf{x}_i$ is the $i$ -th input feature map to the AdaIN layer.
$\mu(\mathbf{x}_i)$ and $\sigma(\mathbf{x}_i)$ are the mean and standard deviation of that feature map, calculated across the spatial dimensions. The term $\frac{\mathbf{x}_i - \mu(\mathbf{x}_i)}{\sigma(\mathbf{x}_i)}$ is just standard instance normalization.
$\mathbf{y} = (y_{s,i}, y_{b,i})$ is the "style" for that layer. It's a pair of scalars (a scale $y_{s,i}$ and a bias $y_{b,i}$ ) that are generated from the latent code $w$ via a learned affine transformation.

In short, AdaIN resets the style of each feature map and then imposes a new style based on $w$ . This is done at each resolution level, giving $w$ control over the entire synthesis process.

Disentanglement

The key idea behind StyleGAN's improved disentanglement is the separation of the input latent space $Z$ from the intermediate latent space $W$ .

The input latent space $Z$ (typically a standard Gaussian) has to be sampled according to the probability density of the training data. If the real-world data has correlations (e.g., gender and hair length are correlated), the generator must learn a "warped" or entangled mapping from $Z$ to avoid generating unlikely combinations.
The intermediate latent space $W$ is not required to follow any fixed distribution. The mapping network $f: Z \to W$ can learn to "unwarp" the space. The paper argues that the synthesis network $g$ finds it easier to generate realistic images from a linear, disentangled representation. Therefore, the model is implicitly incentivized to learn a mapping $f$ that produces a "flatter," more disentangled $W$ .

Perceptual Path Length (PPL)

PPL is a metric designed to measure the "curvedness" of the latent space. The intuition is that if you interpolate between two latent codes, a small step in the latent space should correspond to a small, perceptually linear change in the output image. If the image changes dramatically or features pop in and out unexpectedly, the space is entangled.

PPL measures the average perceptual distance between images generated from close latent codes along an interpolation path. It's calculated as:

$l_W = \mathbb{E} \left[ \frac{1}{\epsilon^2} d(g(\text{lerp}(\mathbf{w}_1, \mathbf{w}_2; t)), g(\text{lerp}(\mathbf{w}_1, \mathbf{w}_2; t + \epsilon))) \right]$

where

$\mathbf{w}_1, \mathbf{w}_2$ are two random latent codes in $W$ (derived from $z_1, z_2$ ).
$\text{lerp}(\cdot)$ is linear interpolation.
$t \sim U(0,1)$ is the interpolation fraction.
$\epsilon$ is a very small step ( $10^{-4}$ ).
$g(\cdot)$ is the synthesis network.
$d(\cdot, \cdot)$ is a perceptual image distance metric (e.g., LPIPS, based on VGG features).
$\mathbb{E}[\cdot]$ is the expectation, averaged over many random paths.

A lower PPL score means the latent space is perceptually smoother and less entangled.

Linear Separability

This metric quantifies how well a single factor of variation (e.g., "has glasses") corresponds to a linear direction in the latent space. If the space is well-disentangled, you should be able to find a simple hyperplane that separates latent codes based on a specific semantic attribute.

The process is:

Generate a large set of images and their corresponding latent vectors ( $w$ for StyleGAN).
Use a pre-trained classifier to label the images for a binary attribute (e.g., male/female).
Train a linear Support Vector Machine (SVM) on the latent vectors to predict the labels.
The separability is then calculated based on the conditional entropy $H(Y|X)$ , where $Y$ is the label from the classifier and $X$ is the label predicted by the SVM.

The final score is calculated as:

$\text{Separability} = \exp\left(\sum_i H(Y_i|X_i)\right)$

A lower score indicates less uncertainty, meaning the linear SVM is a good separator. This implies the latent space has more consistent linear directions for different attributes, and is therefore more disentangled.

RAG

⭐⭐⭐

From Local to Global: A GraphRAG Approach to Query-Focused Summarization

The GraphRAG paper proposes a graph-based RAG approach that builds a knowledge graph from documents using an LLM to extract entities and relationships, then uses community detection to partition the graph into hierarchical clusters. The system pre-generates summaries for each community of related entities, and at query time uses map-reduce to generate partial answers from relevant community summaries before combining them into a final global answer. The main contribution is enabling "global sensemaking" queries that require understanding an entire corpus rather than just retrieving specific facts, addressing a key limitation of traditional vector RAG systems. They also introduce an adaptive benchmarking approach using LLM-generated personas and tasks to create corpus-specific evaluation questions.

The most significant problem is the high computational cost - graph indexing took 281 minutes for the podcast dataset using GPT-4-turbo, making it expensive for large corpora. Worse, this cost must be repaid when new documents are added to the corpus because the clusters might have changed and the whole thing needs to be reindexed.

⭐⭐⭐⭐

HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models

HippoRAG basically chunks the documents (the paper calls these "passages"), extracts simple triples from each passage, puts them in the big KG, stores embeddings of the node names and which chunk that node came from. Then entities are extracted from the query, the entity names are embedded and vector-compared against KG node names, those with the highest overlap are selected as the starting nodes for PPR, then the final PPR nodes have their linked passages picked out and these are your passages to give to the LLM.

The main contribution is enabling single-step multi-hop retrieval that integrates information across passage boundaries, solving complex questions that require connecting disparate facts. The most significant problem is that their entity-centric design creates a concept-context tradeoff - the system focuses heavily on named entities and concepts while ignoring contextual cues, which accounts for 48% of errors in their analysis. Actually I wrote a whole blog post about this. Also the PPR part is weak. It just wanders around the graph which means that if the graph is too densely connected or has too many useless nodes it won't work.

⭐⭐⭐

From RAG to Memory: Non-Parametric Continual Learning for Large Language Models

This paper introduces HippoRAG 2, a RAG framework that aims to better mimic human long-term memory for large language models. The main approach builds on the original HippoRAG by enhancing its PPR algorithm with three key improvements:

dense-sparse integration that adds passage nodes to the knowledge graph alongside phrase nodes
deeper contextualization through query-to-triple matching instead of just entity extraction
a recognition memory component using an LLM to filter irrelevant triples before graph search

The system constructs an open knowledge graph offline by extracting triples from passages, then during online retrieval it uses embedding models to find seed nodes, filters them with an LLM, and runs PPR to retrieve the most relevant passages for question answering.

The main contribution is achieving comprehensive performance across three types of memory tasks - factual (simple QA), sense-making (discourse understanding), and associative (multi-hop QA) - something previous structure-augmented RAG methods failed to do.

⭐⭐

PathRAG: Pruning Graph-based Retrieval Augmented Generation with Relational Paths

This paper argues that current graph-based RAG systems are too noisy because they retrieve entire communities or all immediate neighbors of relevant nodes, flooding the LLM with redundant or irrelevant information. Their solution, PathRAG, takes a more focused approach. It first identifies a handful of key nodes in the knowledge graph based on the query. Then, instead of grabbing everything around those nodes, it specifically finds the most important paths that connect them. The main contribution is a "flow-based pruning" algorithm that efficiently discovers these paths and assigns them a reliability score. This allows the system to feed the LLM a much cleaner, more focused context that explicitly shows the relationships between key entities, which they claim leads to more logical and coherent answers.

Of course this relies heavily on the quality of the initial indexing graph. The paper doesn't cover how this graph is built, and if the entity and relationship extraction is poor, the whole system fails. Garbage in, garbage out. Secondly, their path-finding algorithm is a simple, non-learning heuristic. While fast, it might miss more complex or semantically relevant paths that a trainable model could identify.