How Does Umap Work? | From Neighbors To Maps

UMAP turns high-dimensional data into a lower-dimensional map by preserving nearby relationships and laying them out with minimal distortion.

UMAP stands for Uniform Manifold Approximation and Projection. That name sounds heavy, but the core idea is friendly: start with a crowded data space, find which points sit near each other, then build a smaller map that keeps those near points close. The result is often a 2D or 3D view you can inspect with your eyes, or a compact set of features you can pass into later modeling steps.

People use UMAP for gene-expression data, text embeddings, image features, customer behavior, and any job where each row has lots of columns. It is not just a charting trick. It is a non-linear dimensionality reduction method, which means it can keep curved local structure that linear tools may flatten away.

What UMAP Is Trying To Do

Think of a dataset as a cloud of points in a space with many dimensions. Each dimension is one feature. With two or three features, plotting is easy. With 300 or 3,000, it is not. UMAP tries to create a smaller map where the local shape of that cloud still makes sense.

That local focus matters. UMAP cares most about who each point lives near. If two points are neighbors in the original data, the lower-dimensional map should place them near each other too. Faraway points matter less, though they still influence the final layout.

The Three Ideas Behind The Method

UMAP rests on a small set of ideas described in the official documentation and software paper. It treats the data as if it lies on a manifold, assumes the local geometry is stable enough to estimate from nearby points, and then builds a graph that captures those neighborhood ties.

  • Local neighborhoods: each point gets a set of nearest neighbors.
  • Weighted graph: neighbor links receive strengths, not just yes-or-no labels.
  • Low-dimensional layout: the algorithm places points so the new map matches those link strengths as closely as possible.

That is why UMAP often groups similar items well. It is not trying to preserve every global distance exactly. It is trying to keep local relations believable while still producing a readable layout.

How Does UMAP Work In Real Data?

The workflow has two big stages. First, UMAP builds a picture of neighborhood structure in the original space. Next, it searches for a lower-dimensional arrangement that reflects that structure.

Stage 1: Find The Nearest Neighbors

UMAP starts by choosing a distance metric such as Euclidean, cosine, or another metric that suits the data. Then it finds the nearest neighbors for each point. This is controlled by n_neighbors, one of the main settings in UMAP’s official parameter reference.

A smaller neighborhood size makes the map pay more attention to fine local detail. A larger neighborhood size pushes the result toward broader structure. Neither is always right. The better choice depends on whether you care more about tiny pockets or overall shape.

Stage 2: Turn Neighborhoods Into A Weighted Graph

After neighbors are found, UMAP gives each connection a strength. Close neighbors get stronger links. Weaker links still count, but less. That creates a fuzzy graph, which is just a way of saying each pairwise tie can be partial rather than all-or-nothing.

This matters because real data is messy. Boundaries are not always sharp. A point may sit close to several groups. Weighted links let UMAP express that uncertainty instead of forcing every point into a rigid local pattern too early.

Stage 3: Build The Smaller Map

Once the graph is ready, UMAP places the points in 2D, 3D, or another chosen dimension. It then adjusts those positions again and again so points with strong links stay close and points with weak links do not pile together. The official page on how UMAP works walks through this process in more detail.

The end result is an embedding: a new coordinate set with fewer dimensions. That embedding is easier to plot, faster to work with, and often good enough to keep useful structure for downstream work.

What Each Step Is Doing

It helps to see the moving parts side by side.

Step What Happens Why It Matters
Pick A Metric UMAP decides how distance or similarity will be measured. The wrong metric can make neighbor choices misleading.
Find Neighbors Each point gets a local neighborhood. These neighbor sets define the local shape of the data.
Scale Local Distances Distances are adjusted per point. This helps dense and sparse regions coexist on one map.
Build Weighted Links Neighbor ties get strengths. Partial ties capture soft boundaries between groups.
Merge Local Views All local graphs are combined into one fuzzy graph. The full graph becomes the target structure to preserve.
Initialize Layout Points get starting positions in lower dimensions. A decent start helps the later layout settle well.
Adjust Positions The algorithm pulls linked points together and pushes weakly linked points apart. This creates the final embedding people plot or model with.
Return Embedding New coordinates are produced. You can visualize them or use them in later steps.

Why UMAP Often Looks Better Than A Plain PCA Plot

PCA is linear. It projects the data onto directions with high variance. That works well for many tasks, but it may miss curved or folded structure. UMAP is non-linear, so it can separate local groups that PCA leaves tangled.

That does not mean UMAP is always “better.” PCA is faster, simpler, and easier to explain. UMAP shines when neighborhood structure matters more than preserving one global straight-line view of the data.

scikit-learn’s manifold learning documentation places UMAP in that wider family of methods built to preserve structure that lives on lower-dimensional manifolds inside high-dimensional spaces.

Settings That Change The Shape Of The Map

A UMAP plot can look calm and broad, or tight and broken into clusters, depending on a few settings. These settings do not just tweak appearance. They change what structure the embedding favors.

n_neighbors

This controls how many nearby points each point pays attention to. Small values often reveal small local pockets. Larger values usually smooth things out and bring in more overall structure.

min_dist

This controls how tightly points can pack together in the embedding. Lower values let clusters compress more. Higher values spread the points out. A very tight map may look clean, but it can also make separations feel stronger than they really are.

Metric

The metric tells UMAP what “close” means. Euclidean distance fits many numeric datasets. Cosine distance is common for text embeddings and other vector spaces where direction matters more than magnitude. A poor metric can ruin the map before layout even begins.

Setting Low Or Small Value High Or Large Value
n_neighbors More local detail, more fragmentation More broad structure, smoother groups
min_dist Tighter clusters, denser patches More spacing, softer cluster edges
n_components 2D is easy to plot 3D or more may keep extra structure
metric Can fit local geometry well Can distort neighbor choices if mismatched

What UMAP Does Well

UMAP is popular for good reasons. It is often fast, works on large datasets, and can preserve local structure in a way that makes cluster-like regions easy to inspect. It also supports supervised variants and can transform new data after fitting, which makes it handy in applied workflows.

  • Good at showing neighborhood structure
  • Often faster than t-SNE on larger datasets
  • Flexible with distance metrics
  • Useful for both plotting and feature reduction

Where People Get Tripped Up

The neatest-looking UMAP plot is not always the truest one. Clear gaps do not always prove real classes. Tight blobs do not always mean clean separability in the original space. The embedding is a reduced view shaped by settings, metric choice, and random initialization.

That is why UMAP should be read with care. Use labels when you have them. Compare runs. Check whether the structure stays stable when you change settings a bit. Pair the plot with domain knowledge instead of treating it as a final verdict.

Common Mistakes

  • Reading global distances too literally
  • Treating every visible gap as a real class boundary
  • Using the default metric when the data needs another one
  • Skipping preprocessing such as scaling or clean feature selection

When UMAP Makes Sense

UMAP fits best when you need a readable lower-dimensional map of complex data and you care about local relationships. It is a strong pick for embedding vectors, image features, single-cell data, and any task where local neighborhoods carry the story.

If you need a straight, stable, easily interpretable linear reduction, PCA may fit better. If you need a compact map that keeps local structure and you are willing to tune a few settings, UMAP is often a smart pick.

So, how does UMAP work? It builds a weighted neighbor graph in the original space, then arranges a lower-dimensional map that mirrors that graph as closely as it can. That simple frame explains most of what you see on a UMAP plot: points stay near their nearest neighbors, broader geometry is only partly preserved, and the final shape depends on the settings you choose.

References & Sources

  • UMAP Documentation.“Basic UMAP Parameters.”Lists core settings such as n_neighbors and min_dist, plus how those settings change the embedding.
  • UMAP Documentation.“How UMAP Works.”Explains the method’s neighborhood graph, fuzzy set ideas, and low-dimensional layout process.
  • scikit-learn.“Manifold Learning.”Places UMAP within the broader family of non-linear dimensionality reduction methods and clarifies the manifold-learning context.