Vibe Blending creates coherent hybrids that merge the relevant shared attributes between images.
Our method identifies and blends the relevant visual attributes—the "vibe"—that connect distinct concepts.
Example: When blending a map and a chicken:
We illustrate our method using a 2D latent space example. Our approach curves around the non-linear manifold.
(1) Example Data Manifold
(2) Path Finding Process
(3) How Gradient of Ncut Vectors Guides Path Finding
We construct a graph diffusion space that captures the manifold's intrinsic geometry. Path finding proceeds in two steps: (1) linearly interpolate Ncut vectors \(\mathbf{\Psi}_t(\mathbf{x}_\alpha) = (1-\alpha)\mathbf{\Psi}_t(\mathbf{x}_A) + \alpha\mathbf{\Psi}_t(\mathbf{x}_B)\), then (2) use inverse mapping to recover the corresponding point in the original space: $$ \gamma(\alpha) = \arg\min_{\mathbf{x^*}} \big\| \mathbf{\Psi}_t(\mathbf{x^*}) - \mathbf{\Psi}_t(\mathbf{x}_\alpha) \big\|_2^2 $$
Ncut vectors (eigenvectors \(\mathbf{\Psi}\)) are computed before encoder training. In this 2D example, step (2) uses gradient descent (visualized in (3)) because the gradient \(\nabla_{\mathbf{x}^*}\mathbf{\Psi}_t(\mathbf{x}^*)\) is directly computable. For real images, we instead use the learned Vibe Space encoder-decoder to obtain the optimal point.
Forward Mapping:
Inverse Mapping:
Application to Real Images:
Graph Laplacian eigenvectors capture geometry at different scales: leading eigenvectors describe global structure, while higher-order eigenvectors encode local variations. Truncating to a fixed number \(m\) of eigenvectors selects one scale, leading to paths that focus on too many or too few attributes.
We use a flag space, a hierarchy of nested embeddings \(\mathbf{\Psi}^{1:m_1} \subset \mathbf{\Psi}^{1:m_2} \dots \subset \mathbf{\Psi}^{1:m_M}\), to encapsulate both coarse and fine manifold structures. The multi-scale path is recovered by minimizing the average reconstruction error across scales: $$ \gamma(\alpha) = \arg\min_{\mathbf{x^*}} \frac{1}{|\mathcal{M}|} \sum_{m_k \in \mathcal{M}} \big\| \mathbf{\Psi}_t^{1:m_k}(\mathbf{x^*}) - \mathbf{\Psi}_t^{1:m_k}(\mathbf{x}_\alpha) \big\|_2^2 $$ This enforces consistency across global and local geometry when finding paths between points.
Unlike discrete shortest path algorithms like Dijkstra's algorithm, which operate on a fixed graph and find paths through existing nodes, our approach using graph diffusion is robust to leaks and outliers in the data manifold.
Dijkstra's algorithm fails when data contains noise and leaks because it relies on direct graph connectivity. A single noisy edge or outlier can create spurious shortcuts that lead the algorithm astray. In contrast, graph diffusion captures the global structure of the manifold through spectral decomposition, effectively filtering out local noise and outliers.
For real images, gradient descent on the full graph is computationally expensive. Instead, we train two lightweight MLP networks to approximate the forward and inverse mappings:
Training takes under 30 seconds and ensures that Euclidean distances and linear paths in Vibe Space correspond to geodesic distances and semantic paths on the underlying manifold. This enables efficient path finding without gradient descent at inference time.
We compute the latent space from only 2-4 image examples rather than the entire training dataset. This context-specific manifold captures the dominant shared attributes—the "vibe"—between the specific images being blended.
Our approach: By constructing the graph Laplacian from only the input images, the leading eigenvectors naturally capture the most relevant attributes for that specific pair.
Alternative approaches (e.g., Yu et al.): Methods that compute the latent space from the entire training dataset create a global manifold encoding all possible relationships. This global perspective cannot identify which attributes are most relevant for blending a specific pair, as it mixes irrelevant relationships from the full dataset.
Left: Local Vibe Space (Few Examples) — focused path capturing dominant attributes
Right: Global Latent Space (All Data) — dense manifold with multiple paths obscuring specific relationships
Vibe Space enables a range of creative applications beyond simple interpolation. Below we demonstrate key capabilities that showcase the flexibility and power of discovering semantic manifolds between visual concepts.
With the discovered vibe, we can extrapolate to nontrivial but related concepts, enabling creative analogies that go beyond simple interpolation. For example, we can morph Leonardo DiCaprio's face into a playing card.
Vibe attributes are implicitly extracted by Vibe Space. The blending pair defines desired vibes, while negative pairs define vibes to suppress. By subtracting the negative vibe, we can control which attributes are blended.
The blending pair defines desired vibes (rotation + style). The negative pair defines vibes to suppress (style). Blending without negative examples transfers both attributes. Subtracting the negative vibe, only rotation is blended.
Vibe Space can extrapolate beyond the input images to generate related concepts. By extending the vibe path, we can create novel visual concepts that maintain semantic coherence.
Although two images suffice to train the Vibe Space and identify the dominant attributes, adding related exemplars can enhance the dominant attributes and suppress spurious ones. This allows for more robust blending when dealing with complex scenes or when certain attributes need to be emphasized.
In this example, adding extra images helps the model to learn the dominant attributes (glass texture vs sand texture) and suppress spurious ones (triangle vs round shape). So that the blend can create a pyramid with glass texture.
Vibe Space can blend multiple images simultaneously, discovering shared attributes across multiple concepts.
Cognitive psychology suggests that more creative individuals can connect more distant or weakly linked concepts. They move through intermediate associations, following nonlinear paths across clusters of related concepts.
Rather than jumping directly between remote ideas (e.g., apple → house), they move through intermediate associations (e.g., apple → tree → wood → house), following nonlinear paths across clusters of related concepts.
To gauge human perceptions of creativity, we ask raters to compare different image pairs along two related but distinct axes:
We hypothesize that blending conceptually distant pairs involves traversing longer, curved paths in pretrained feature spaces, compared to blending nearby pairs.
To estimate the difficulty of blending two images, we define a Path Nonlinearity Score (PNS):
This PNS metric serves as a computational proxy for conceptual distance: higher PNS means greater blend difficulty.
Negative vibe blending fails when desired and undesired attributes are entangled in feature space. The method assumes vibes can be separated into distinct subspaces, but entangled attributes cannot be cleanly filtered.
Example: Positive inputs capture both style (car type) and color changes. Negative inputs target only color, but style and color are entangled, making separation difficult.
Extrapolating beyond α > 1 does not always produce meaningful attribute exaggeration. The transformation may not continue as expected, limiting reliable extrapolation.
Vibe Space relies on unsupervised region-level correspondence matching between DINO token clusters. This determines which semantic regions should be merged, but the matching is not always reliable.
Impact: When correspondence is incorrect (varies by random seeds), blends degrade significantly, producing incoherent or mismatched object-part combinations.
Our method depends on IP-Adapter (Stable Diffusion) to generate images from dense CLIP features. However, IP-Adapter reliability varies:
@article{yang2025vibespace,
title={Vibe Spaces for Creatively Connecting and Expressing Visual Concepts},
author={Yang, Huzheng and Xu, Katherine and Lu, Andrew and Grossberg, Michael D. and Bai, Yutong and Shi, Jianbo},
year={2025}
}