Vibe Spaces for Creatively Connecting and Expressing Visual Concepts

Consider blending a musician playing a violin with one playing a guitar. Different approaches identify different relevant attributes:

LLMs (Gemini, GPT) might focus on object parts or style transfer
Musicians would attend to the instrument and how it is played

The intuitive process of identifying and fusing meaningful attributes—the "vibe"¹—reveals creative connections between distinct concepts.

¹ The term "vibe," short for "vibration," originated in 1960s jazz slang to describe the mood or feeling conveyed by music, a person, or space. ↩

Vibe Blending creates coherent hybrids that merge the relevant shared attributes between images.

Our method identifies and blends the relevant visual attributes—the "vibe"—that connect distinct concepts.

Example: When blending a map and a chicken:

Our method recognizes that the shared "vibe" is shape (not color or texture)
Baselines (CLIP Avg, Gemini, GPT) often fail to identify key attributes or fuse them effectively

Intuition

We illustrate our method using a 2D latent space example. Our approach curves around the non-linear manifold.

(1) Example Data Manifold

(2) Path Finding Process

(3) How Gradient of Ncut Vectors Guides Path Finding

We construct a graph diffusion space that captures the manifold's intrinsic geometry. Path finding proceeds in two steps: (1) linearly interpolate Ncut vectors $\mathbf{\Psi}_t(\mathbf{x}_\alpha) = (1-\alpha)\mathbf{\Psi}_t(\mathbf{x}_A) + \alpha\mathbf{\Psi}_t(\mathbf{x}_B)$, then (2) use inverse mapping to recover the corresponding point in the original space: $$ \gamma(\alpha) = \arg\min_{\mathbf{x^*}} \big\| \mathbf{\Psi}_t(\mathbf{x^*}) - \mathbf{\Psi}_t(\mathbf{x}_\alpha) \big\|_2^2 $$

Ncut vectors (eigenvectors $\mathbf{\Psi}$) are computed before encoder training. In this 2D example, step (2) uses gradient descent (visualized in (3)) because the gradient $\nabla_{\mathbf{x}^*}\mathbf{\Psi}_t(\mathbf{x}^*)$ is directly computable. For real images, we instead use the learned Vibe Space encoder-decoder to obtain the optimal point.

Overview

Forward Mapping:

(a.) Compute affinity graph W and graph Laplacian L
(b.) Compute Ncut eigenvectors Ψ(x) of L as manifold coordinates

Inverse Mapping:

(c.) Interpolate in manifold space and invert to recover the path in feature space

Application to Real Images:

(d.) Extract DINO patch tokens as graph nodes, compute affinity W
(e.) Top m eigenvectors yield co-salient segments and manifold coordinates
(f.) Apply inverse mapping to CLIP space, generate images via Stable Diffusion IP-Adapter

Why Flag Space Multi-Scale Loss

Graph Laplacian eigenvectors capture geometry at different scales: leading eigenvectors describe global structure, while higher-order eigenvectors encode local variations. Truncating to a fixed number $m$ of eigenvectors selects one scale, leading to paths that focus on too many or too few attributes.

We use a flag space, a hierarchy of nested embeddings $\mathbf{\Psi}^{1:m_1} \subset \mathbf{\Psi}^{1:m_2} \dots \subset \mathbf{\Psi}^{1:m_M}$, to encapsulate both coarse and fine manifold structures. The multi-scale path is recovered by minimizing the average reconstruction error across scales: $$ \gamma(\alpha) = \arg\min_{\mathbf{x^*}} \frac{1}{|\mathcal{M}|} \sum_{m_k \in \mathcal{M}} \big\| \mathbf{\Psi}_t^{1:m_k}(\mathbf{x^*}) - \mathbf{\Psi}_t^{1:m_k}(\mathbf{x}_\alpha) \big\|_2^2 $$ This enforces consistency across global and local geometry when finding paths between points.

Why Path Finding by Graph Diffusion

Unlike discrete shortest path algorithms like Dijkstra's algorithm, which operate on a fixed graph and find paths through existing nodes, our approach using graph diffusion is robust to leaks and outliers in the data manifold.

Dijkstra's algorithm fails when data contains noise and leaks because it relies on direct graph connectivity. A single noisy edge or outlier can create spurious shortcuts that lead the algorithm astray. In contrast, graph diffusion captures the global structure of the manifold through spectral decomposition, effectively filtering out local noise and outliers.

How Vibe Space is Trained to Approximate Inverse Mapping

For real images, gradient descent on the full graph is computationally expensive. Instead, we train two lightweight MLP networks to approximate the forward and inverse mappings:

Encoder: Maps input features $\mathbf{x}$ to diffusion coordinates $\mathbf{\Psi}_t(\mathbf{x})$, simulating the forward mapping
Decoder: Maps diffusion coordinates back to feature space, approximating the inverse mapping

Training takes under 30 seconds and ensures that Euclidean distances and linear paths in Vibe Space correspond to geodesic distances and semantic paths on the underlying manifold. This enables efficient path finding without gradient descent at inference time.

Why Computing Latent Space from Few Examples Enables Vibe Identification

We compute the latent space from only 2-4 image examples rather than the entire training dataset. This context-specific manifold captures the dominant shared attributes—the "vibe"—between the specific images being blended.

Our approach: By constructing the graph Laplacian from only the input images, the leading eigenvectors naturally capture the most relevant attributes for that specific pair.

Alternative approaches (e.g., Yu et al.): Methods that compute the latent space from the entire training dataset create a global manifold encoding all possible relationships. This global perspective cannot identify which attributes are most relevant for blending a specific pair, as it mixes irrelevant relationships from the full dataset.

Local Vibe Space vs Global Latent Space Comparison

Left: Local Vibe Space (Few Examples) — focused path capturing dominant attributes
Right: Global Latent Space (All Data) — dense manifold with multiple paths obscuring specific relationships

Vibe Space enables a range of creative applications beyond simple interpolation. Below we demonstrate key capabilities that showcase the flexibility and power of discovering semantic manifolds between visual concepts.

Vibe Analogy

With the discovered vibe, we can extrapolate to nontrivial but related concepts, enabling creative analogies that go beyond simple interpolation. For example, we can morph Leonardo DiCaprio's face into a playing card.

Negative Vibe Control

Vibe attributes are implicitly extracted by Vibe Space. The blending pair defines desired vibes, while negative pairs define vibes to suppress. By subtracting the negative vibe, we can control which attributes are blended.

The blending pair defines desired vibes (rotation + style). The negative pair defines vibes to suppress (style). Blending without negative examples transfers both attributes. Subtracting the negative vibe, only rotation is blended.

Extrapolation

Vibe Space can extrapolate beyond the input images to generate related concepts. By extending the vibe path, we can create novel visual concepts that maintain semantic coherence.

Training with Extra Images

Although two images suffice to train the Vibe Space and identify the dominant attributes, adding related exemplars can enhance the dominant attributes and suppress spurious ones. This allows for more robust blending when dealing with complex scenes or when certain attributes need to be emphasized.

In this example, adding extra images helps the model to learn the dominant attributes (glass texture vs sand texture) and suppress spurious ones (triangle vs round shape). So that the blend can create a pyramid with glass texture.

N-Image Blending

Vibe Space can blend multiple images simultaneously, discovering shared attributes across multiple concepts.

What Makes a Blend Creative?

Cognitive psychology suggests that more creative individuals can connect more distant or weakly linked concepts. They move through intermediate associations, following nonlinear paths across clusters of related concepts.

Creativity diagram: more creative vs less creative paths

Rather than jumping directly between remote ideas (e.g., apple → house), they move through intermediate associations (e.g., apple → tree → wood → house), following nonlinear paths across clusters of related concepts.

How do humans rate the creativity of a blend?

To gauge human perceptions of creativity, we ask raters to compare different image pairs along two related but distinct axes:

Creative Potential: How much the content of the image pair allows for an interesting or compelling blend.
Blend Difficulty: How challenging it would be to create a coherent hybrid that fuses the shared attributes.

How to estimate blend difficulty?

We hypothesize that blending conceptually distant pairs involves traversing longer, curved paths in pretrained feature spaces, compared to blending nearby pairs.

To estimate the difficulty of blending two images, we define a Path Nonlinearity Score (PNS):

PNS measures how much a path decoded from Vibe Space deviates from linear interpolation in CLIP space
Specifically, we quantify excess path length and directional changes
Higher PNS indicates the path traverses multiple intermediate regions of feature space

This PNS metric serves as a computational proxy for conceptual distance: higher PNS means greater blend difficulty.

Entangled Vibes

Negative vibe blending fails when desired and undesired attributes are entangled in feature space. The method assumes vibes can be separated into distinct subspaces, but entangled attributes cannot be cleanly filtered.

Example: Positive inputs capture both style (car type) and color changes. Negative inputs target only color, but style and color are entangled, making separation difficult.

Extrapolation Uncertainty

Extrapolating beyond α > 1 does not always produce meaningful attribute exaggeration. The transformation may not continue as expected, limiting reliable extrapolation.

Correspondence Failure

Vibe Space relies on unsupervised region-level correspondence matching between DINO token clusters. This determines which semantic regions should be merged, but the matching is not always reliable.

Impact: When correspondence is incorrect (varies by random seeds), blends degrade significantly, producing incoherent or mismatched object-part combinations.

Reconstruction Failure

Our method depends on IP-Adapter (Stable Diffusion) to generate images from dense CLIP features. However, IP-Adapter reliability varies:

In-distribution: Consistently reconstructs (e.g., human faces) across random seeds
Out-of-distribution: Produces unstable and inconsistent reconstructions

BibTeX

@article{yang2025vibespacescreativelyconnecting,
      title={Vibe Spaces for Creatively Connecting and Expressing Visual Concepts}, 
      author={Huzheng Yang and Katherine Xu and Andrew Lu and Michael D. Grossberg and Yutong Bai and Jianbo Shi},
      year={2025},
      eprint={2512.14884},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.14884}, 
}