Comment by nurettin - Hacker Neue

nurettin Oct 26, 2025 parent

> And so that raises the question of what “nearest” means

To me, the definition of "nearest" is just a technicality.

The real question is: what is K?

Spivak Oct 26, 2025

The total number of clusters. Determining this algorithmically is a fun open problem https://en.wikipedia.org/wiki/Determining_the_number_of_clus....

For the data I work with at $dayjob I've found the Silhouette algorithm to perform best but I assume it will be extremely field specific. Clustering your data and taking a representative sample of each cluster is such a powerful trick to make big data small but finding an appropriate K is an art more than a science.

dcl Oct 26, 2025

At a previous $dayjob at a very large financial institution, it's however many clusters are present in the strategy that was agreed to by the exec team and their highly paid consultants.

You find that many clusters and shoehorn the consultant provided categories on to the k clusters you obtain.

3abiton Oct 27, 2025

To be fair finding K is highly domain dependent and I would argue should not be for the analyst (solely) to decide, but with a feedback from domain experts.

seanhunter Oct 26, 2025

K is whatever you want it to be. You want 5 clusters k=5. If you don’t know the right number of clusters try a few different values of k and see which partitions your sample in a way that’s good for your problem

mentalgear Oct 26, 2025

Have you tried HDBSCAN (DBSCAN variant) or Hierarchical Clustering (HAC) ?

nurettin OP Oct 26, 2025

Me? I probably tried every classification algorithm and their H variants. I still think "What is K?" is a profound question.

keeeba Oct 28, 2025

I agree it is a profound question. My thesis is fairly boring.

For any given clustering task of interest, there is no single value of K.

Clustering & unsupervised machine learning is as much about creating meaning and structure as it is about discovering or revealing it.

Take the case of biological taxonomy, what K will best segment the animal kingdom?

There is no true value of K. If your answer is for a child, maybe it’ 7 corresponding to what we’re taught in school - mammals, birds, reptiles, amphibians, fish, and invertebrates.

If your answer is for a zoologist, obviously this won’t do.

Every clustering task of interest is like this. And I say of interest because clustering things like digits in the classic MNIST dataset is better posed as a classification problem - the categories are defined analytically.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous