"I Know It When I See It":
Mood Spaces for Connecting and Expressing Visual Concepts

Huzheng Yang¹, Katherine Xu¹, Michael D. Grossberg², Yutong Bai³, Jianbo Shi¹,

¹UPenn ²CUNY ³UC Berkeley

GPT-4o model fails to blend the violin and guitar.

GPT-4o is heavily by natural language, GPT-4o cannot capture that a lute is a mix of a violin and a guitar, nor how the lute should be held.

GPT-4o fails to complete the visual analogy of riding a bike to a horse.

GPT-4o focuses on object details but misses the overall concept. Riding a bike or horse requires reasoning about multiple attributes. GPT-4o gets the idea of riding the horse backwards, along with the intended horse pose change.

Visual concepts go beyond natural language descriptions.

Can you prompt ChatGPT generate creative images?

What's the missing creativity in ChatGPT?

ChatGPT can copy-paste existing image parts, e.g. dog head on fish body.
But it's hard to blend the existing parts to create a new part, that looks like a hybrid of the two parts.

What makes creativity difficult?

Visual concepts (e.g. duck, toilet paper) are disconnected in the embedding space.
Creative blending requires finding a path to connect the disconnected concepts.

How to improve creativity? Mood Space

Mood Space connects disconnected visual concepts, e.g. duck -> toilet paper
Mood Space only picks up most relevant concepts from Mood Board

Mood Space Introduction
A low-dimensional dense space that connects disconnected visual concepts.

Mood Space Interpolation
Linear interpolation in Mood Space transforms to a curved path in CLIP space, avoiding holes.

Blending Creativity: connecting two distinct concepts

Combinational Creativity: path lifting

Given Image A1 → Image B1, What is Image A2 → Image B2?

Mood Space Implementation
Transformed by learned point-wise MLP, trained with spectral clustering and reconstruction loss.

[Click to expand ▼] Spectral Clustering Details

NCut is differentiable.
SVD Eigvec can be shifted or rotated, we use invariant loss (outter product of eigvec).
We use partially shifted eigvec to prevent large window shifts.

How to control creativity? Mood Board

Mood Board is a set of 2~20 images that are used to train the Mood Space.
Mood Space only picks up most relevant concepts from Mood Board.

e.g. 20 faces images Mood Board -> face is the most relevant concept
e.g. 20 first person images Mood Board -> hand is the most relevant concept

Mood Board Interface
Control the Mood Space through context images to compute affinity and pick up relevant features.

How to generate the created visual concept?

Image Conditioned Diffusion Model: Using IP-Adapter and CLIP embeddings for interpolation

How to interpolate?
Image conditioned Diffusion Model (IP-Adapter), CLIP embedding are interpolated.

[Click to expand ▼] Why VQ-VAE interpolation doesn't work well

Why does VQVAE interpolation work poorly?
VQVAE embedding 1. is not seperating image semantics, 2. does not repair errors.
Why does Diffusion CLIP interpolation work better?
CLIP embedding is seperating image semantics, and Diffusion Model can repair errors.

Limitations of CLIP Space
There's holes in the CLIP embedding space, especially for objects that are un-related (e.g. duck and toilet paper).

[Click to expand ▼] Spatial Interpolation Details

How to interpolate pixel location?
Step1. cluster into super-pixels (DINO feature clustering).
Step2. match super-pixels (Hungarian matching).
Step3. interpolate super-pixels (CLIP embeddings).
How to move object spatially?
Because CLIP embedding in Diffusion cross-attention is permutation invariant,
We can just interpolate the super-pixels so that the object is spatially moved.
Diffusion model relies on the intrinsic spatial feature inside the CLIP embedding.

Results

Interpolation

Given Image A1 and Image B1, interpolate between them.

Interpolation

Given Image A1 and Image B1, interpolate between them.

Visual Analogy by Path Lifting

Given Image A1 → Image B1, What is Image A2 → Image B2?
A1 → B1 defines a path in the Mood Space.
A2 → B2 use the same path defined as A1 → B1.

Analysis

When Baseline CLIP Space Fails

Baseline CLIP space interpolation fails when dealing with semantically disconnected concepts:

e.g. Duck → Toilet paper
e.g. Duck → Pixel art duck

When Baseline CLIP Space Works

CLIP space interpolation works well for semantically connected concepts:

e.g. Duck → Dinosaur (same pixel art style)

Takeaways

Mood Space finds a dense low-dimensional space to compress CLIP space
Mood Space connects disconnected concepts (e.g. duck -> toilet paper)
Mood Board is an interpolation interface to control the Mood Space
Mood Space only picks up most relevant concepts from Mood Board

Abstract

Expressing complex concepts is easy when they can be labeled or quantified, but many ideas are hard to define yet instantly recognizable. We propose a Mood Board, where users convey abstract concepts with examples that hint at the intended direction of attribute changes.

We compute an underlying Mood Space that 1) factors out irrelevant features and 2) finds the connections between images, thus bringing relevant concepts closer. We invent a fibration computation to compress/decompress pre-trained features into/from a compact space, 50-100x smaller. The main innovation is learning to mimic the pairwise affinity relationship of the image tokens across exemplars. To focus on the coarse-to-fine hierarchical structures in the Mood Space, we compute the top eigenvector structure from the affinity matrix and define a loss in the eigenvector space.

The resulting Mood Space is locally linear and compact, allowing image-level operations, such as object averaging, visual analogy, and pose transfer, to be performed as a simple vector operation in Mood Space. Our learning is efficient in computation without any fine-tuning, needs only a few (2-20) exemplars, and takes less than a minute to learn.

BibTeX

@misc{yang2025iknowiit,
      title={"I Know It When I See It": Mood Spaces for Connecting and Expressing Visual Concepts}, 
      author={Huzheng Yang and Katherine Xu and Michael D. Grossberg and Yutong Bai and Jianbo Shi},
      year={2025},
      eprint={2504.15145},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.15145}, 
}

"I Know It When I See It": Mood Spaces for Connecting and Expressing Visual Concepts

GPT-4o model fails to blend the violin and guitar.

GPT-4o fails to complete the visual analogy of riding a bike to a horse.

Visual concepts go beyond natural language descriptions.

Can you prompt ChatGPT generate creative images?

What's the missing creativity in ChatGPT?

What makes creativity difficult?

How to improve creativity? Mood Space

Mood Space Introduction A low-dimensional dense space that connects disconnected visual concepts.

Mood Space Interpolation Linear interpolation in Mood Space transforms to a curved path in CLIP space, avoiding holes.

Blending Creativity: connecting two distinct concepts

Combinational Creativity: path lifting

Mood Space Implementation Transformed by learned point-wise MLP, trained with spectral clustering and reconstruction loss.

[Click to expand ▼] Spectral Clustering Details

How to control creativity? Mood Board

Mood Board Interface Control the Mood Space through context images to compute affinity and pick up relevant features.

How to generate the created visual concept?

How to interpolate? Image conditioned Diffusion Model (IP-Adapter), CLIP embedding are interpolated.

[Click to expand ▼] Why VQ-VAE interpolation doesn't work well

Limitations of CLIP Space There's holes in the CLIP embedding space, especially for objects that are un-related (e.g. duck and toilet paper).

[Click to expand ▼] Spatial Interpolation Details

Results

Interpolation

Interpolation

Visual Analogy by Path Lifting

Analysis

When Baseline CLIP Space Fails

When Baseline CLIP Space Works

Takeaways

Abstract

BibTeX

"I Know It When I See It":
Mood Spaces for Connecting and Expressing Visual Concepts

Mood Space Introduction
A low-dimensional dense space that connects disconnected visual concepts.

Mood Space Interpolation
Linear interpolation in Mood Space transforms to a curved path in CLIP space, avoiding holes.

Mood Space Implementation
Transformed by learned point-wise MLP, trained with spectral clustering and reconstruction loss.

Mood Board Interface
Control the Mood Space through context images to compute affinity and pick up relevant features.

How to interpolate?
Image conditioned Diffusion Model (IP-Adapter), CLIP embedding are interpolated.

Limitations of CLIP Space
There's holes in the CLIP embedding space, especially for objects that are un-related (e.g. duck and toilet paper).