AlignedCut: Visual Concepts Discovery on Brain-Guided Universal Feature Space

Visual Concepts Discovery on Brain-Guided Universal Feature Space

University of Pennsylvania

Overview

Modern vision models such as CLIP, DINO, and MAE learn rich internal representations, but their feature spaces are not directly comparable. Channels from different architectures and layers often encode similar information but live in different coordinate systems.

AlignedCut discovers shared visual concepts across models by combining two ideas:

Channel Alignment — Align feature channels from different networks into a brain-guided universal feature space.
Normalized Cut Spectral Clustering — Apply spectral clustering in the aligned space to reveal recurring visual concepts.

The resulting concepts expose how visual information is organized across models and layers, e.g., figure–ground visual concepts and category-specific visual concepts.

Normalized Cut + Channel Align → Visual Concepts

The Key Problem

Not aligned: Even when models represent similar concepts, their channels are not directly comparable. Distances between features from different models (DINO, CLIP) are not meaningful.

No descriptor: What does the model hidden units mean? plotting the 768-dimensional feature vector is not informative. We need to find a feature space that each hidden unit has a meaningful descripter.

Channel Alignment

To enable joint analysis across models, we align feature channels into a shared representation space.

Brain-guided universal feature space

We use the human visual cortex as a universal reference frame. Given an image:

Extract features from a vision model
Predict brain responses scanned while viewing the image
Learn a linear transform from model features to brain activation

Once trained, features from different models to be expressed in the same brain-referenced space.

Joining graphs across models

After transforming features to the brain-referenced space, features from different models can be compared.

Cosine similarity between features from different models (DINO, CLIP) in the brain-referenced space.

Descriptor of each hidden unit

Each hidden unit in the brain-referenced space has a meaningful descriptor, e.g., there's low-level brain regions (V1), body-related regions (EBA), face-related regions (FFA).

Visualize Visual Concepts by Spectral Clustering

Once channels are aligned:

Construct an gigantic joint affinity graph over all layers and models
Apply Normalized Cut, with linear complexity Nystrom approximation
Visualize the top eigenvectors in 3D RGB cube using spectral-tSNE

Visual Concepts Across Models

Emergence of Figure–Ground Structure

One of the most consistent patterns discovered by AlignedCut is figure–ground separation. In CLIP, DINO, and MAE, foreground and background pixels cluster into distinct spectral groups.

Across models, the same visual concept Figure-Ground correspond to similar brain activation patterns on brain space.

Emergence of Category Structure

Beyond figure–ground separation, later layers exhibit category-specific clusters.

Across models, the same visual concept Category correspond to similar brain activation patterns on brain space.

Tracing Information Through Network Layers

AlignedCut also allows us to visualize how representations evolve through layers in the network. By embedding tokens from all layers into the same brain-referenced space, we can track trajectories of the tokens across layers.