Full Code of haoyuhu/cluster-analysis for AI

master 58f3bed7d269 cached
14 files
91.4 KB
29.1k tokens
49 symbols
1 requests
Download .txt
Repository: haoyuhu/cluster-analysis
Branch: master
Commit: 58f3bed7d269
Files: 14
Total size: 91.4 KB

Directory structure:
gitextract_3ss8tfqs/

├── .github/
│   └── workflows/
│       └── ci.yml
├── .gitignore
├── README.ja.md
├── README.md
├── README.zh-CN.md
├── agglomerative_hierarchical.py
├── clustering_utils.py
├── dbscan.py
├── fuzzy_c_means.py
├── k_means_plus_plus.py
├── optics.py
├── requirements.txt
├── spectral_clustering.py
└── tests/
    └── test_repository.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/workflows/ci.yml
================================================
name: CI

on:
  push:
  pull_request:

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        python-version: ["3.9", "3.x"]

    steps:
      - name: Check out repository
        uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ matrix.python-version }}

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt

      - name: Run smoke tests
        run: pytest


================================================
FILE: .gitignore
================================================
__pycache__/
.pytest_cache/
.venv/
*.pyc


================================================
FILE: README.ja.md
================================================
# Clustering Playbook: 3言語クラスタリングチュートリアル

```bash
echo "                                                                                ";
echo " ██████╗██╗     ██╗   ██╗███████╗████████╗███████╗██████╗ ██╗███╗   ██╗ ██████╗ ";
echo "██╔════╝██║     ██║   ██║██╔════╝╚══██╔══╝██╔════╝██╔══██╗██║████╗  ██║██╔════╝ ";
echo "██║     ██║     ██║   ██║███████╗   ██║   █████╗  ██████╔╝██║██╔██╗ ██║██║  ███╗";
echo "██║     ██║     ██║   ██║╚════██║   ██║   ██╔══╝  ██╔══██╗██║██║╚██╗██║██║   ██║";
echo "╚██████╗███████╗╚██████╔╝███████║   ██║   ███████╗██║  ██║██║██║ ╚████║╚██████╔╝";
echo " ╚═════╝╚══════╝ ╚═════╝ ╚══════╝   ╚═╝   ╚══════╝╚═╝  ╚═╝╚═╝╚═╝  ╚═══╝ ╚═════╝ ";
echo "                                                                                ";
echo "      ██████╗ ██╗      █████╗ ██╗   ██╗██████╗  ██████╗  ██████╗ ██╗  ██╗       ";
echo "      ██╔══██╗██║     ██╔══██╗╚██╗ ██╔╝██╔══██╗██╔═══██╗██╔═══██╗██║ ██╔╝       ";
echo "      ██████╔╝██║     ███████║ ╚████╔╝ ██████╔╝██║   ██║██║   ██║█████╔╝        ";
echo "      ██╔═══╝ ██║     ██╔══██║  ╚██╔╝  ██╔══██╗██║   ██║██║   ██║██╔═██╗        ";
echo "      ██║     ███████╗██║  ██║   ██║   ██████╔╝╚██████╔╝╚██████╔╝██║  ██╗       ";
echo "      ╚═╝     ╚══════╝╚═╝  ╚═╝   ╚═╝   ╚═════╝  ╚═════╝  ╚═════╝ ╚═╝  ╚═╝       ";
echo "                                                                                ";
```

言語: [English](README.md) | [简体中文](README.zh-CN.md) | [日本語](README.ja.md)

このリポジトリは、クラスタリング手法を学ぶために全面的に整理し直した教材用プロジェクトです。最初の 4 つの代表的な手法は読みやすい手書き実装のまま残し、さらに 2 つの実務的な手法を追加しました。古いブログ風の文章は、GitHub 上で読みやすく、そのまま実行できる三言語ドキュメントへ作り直しています。

## このリポジトリでできること

- Python 3.9 以降で、そのまま実行できるサンプルスクリプト。
- 手書き実装の 4 手法: K-Means++、Fuzzy C-Means、凝集型階層クラスタリング、DBSCAN。
- 追加の 2 手法: OPTICS、スペクトラルクラスタリング。
- すべてのスクリプトで共通の CLI: `--seed`、`--samples`、`--output`、`--show`。
- 共通ユーティリティ: [clustering_utils.py](clustering_utils.py)。
- スモークテスト: [tests/test_repository.py](tests/test_repository.py)。
- CI: [.github/workflows/ci.yml](.github/workflows/ci.yml)。

## クラスタリングを一目でつかむ

クラスタリングは、ラベルのないデータを「似ているもの同士」でまとめる作業です。ただし、何をもって似ていると考えるかは手法ごとに違います。中心に近いことを重視する方法もあれば、密度、併合の履歴、近傍グラフの構造を重視する方法もあります。

```mermaid
flowchart TD
    A["ラベルなしの生データ"] --> B{"どの基準で同じグループとみなすか?"}
    B --> C["中心への近さ<br/>K-Means++, Fuzzy C-Means"]
    B --> D["近いクラスタを順に併合<br/>Agglomerative Hierarchical"]
    B --> E["局所密度と到達可能性<br/>DBSCAN, OPTICS"]
    B --> F["近傍グラフの構造<br/>Spectral Clustering"]
```

## クイックスタート

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

任意のデモをそのまま実行できます。

```bash
python3 k_means_plus_plus.py
python3 fuzzy_c_means.py
python3 agglomerative_hierarchical.py
python3 dbscan.py
python3 optics.py
python3 spectral_clustering.py
```

共通オプション:

- `--seed`: 乱数シード。
- `--samples`: サンプル数。
- `--output`: 出力画像の保存先。
- `--show`: 保存後に図を表示。

テストの実行:

```bash
pytest
```

## リポジトリ構成

| ファイル | 役割 |
| --- | --- |
| [clustering_utils.py](clustering_utils.py) | データ生成、描画、引数検証、k-means++ 初期化などの共通処理。 |
| [k_means_plus_plus.py](k_means_plus_plus.py) | 手書きの K-Means++ デモ。 |
| [fuzzy_c_means.py](fuzzy_c_means.py) | 手書きの Fuzzy C-Means デモ。 |
| [agglomerative_hierarchical.py](agglomerative_hierarchical.py) | 手書きの凝集型階層クラスタリングのデモ。 |
| [dbscan.py](dbscan.py) | 手書きの DBSCAN デモ。 |
| [optics.py](optics.py) | scikit-learn を使った OPTICS デモと reachability plot。 |
| [spectral_clustering.py](spectral_clustering.py) | scikit-learn を使ったスペクトラルクラスタリングのデモ。 |
| [requirements.txt](requirements.txt) | 実行とテストに必要な最小依存関係。 |

## アルゴリズム比較

| アルゴリズム | 系統 | `k` を事前に決める必要 | ノイズに強いか | 非凸形状に強いか | 典型的な計算量 | 向いている用途 |
| --- | --- | --- | --- | --- | --- | --- |
| K-Means++ | 原型 / 分割型 | はい | いいえ | 基本的には弱い | 時間 `O(nkt)`、空間 `O(n + k)` | 球状に近いクラスタの高速な基準線 |
| Fuzzy C-Means | ソフト分割 | はい | いいえ | 基本的には弱い | 時間 `O(nkt)`、空間 `O(nk)` | 重なりを membership として残したい場合 |
| Agglomerative Hierarchical | 階層型 | 併合中は不要、最後に切るレベルを決める | 限定的 | 場合による | 教材版は時間 `O(n^3)`、空間 `O(n^2)` | 小規模データで階層構造を見たい場合 |
| DBSCAN | 密度ベース | いいえ | はい | はい | 教材版は時間 `O(n^2)`、空間 `O(n^2)` | 外れ値を含む不規則形状クラスタ |
| OPTICS | 密度順序付け | いいえ | はい | はい | 平均的には `O(n log n)` に近く、最悪 `O(n^2)` | 密度が場所ごとに異なるデータ |
| Spectral Clustering | グラフ / スペクトル | はい | 明示的なノイズモデルなし | はい | 時間はしばしば `O(n^3)`、空間 `O(n^2)` | グラフ構造や非凸多様体の分離 |

## ざっくりした選び方

- クラスタ数が分かっていて、クラスタが比較的コンパクトなら K-Means++。
- 境界があいまいで、所属度も見たいなら Fuzzy C-Means。
- どのようにクラスタが段階的にまとまるかを見たいなら凝集型階層クラスタリング。
- ノイズ除去と任意形状クラスタが重要なら DBSCAN。
- DBSCAN が 1 つの密度閾値に敏感すぎるなら OPTICS。
- 中心より近傍グラフの構造が重要ならスペクトラルクラスタリング。

## デモの読み方

- このリポジトリのスクリプトは、各アルゴリズムの核心が見えるように、あえて小さく読みやすく保っています。
- 図は最終的なクラスタ割り当てを示し、下の「コアコード抜粋」はその手法を特徴づける最小の計算ステップを示します。
- 比較するときは、ノイズを分離できるか、曲がった形を保てるか、クラスタ数を事前に決める必要があるか、の 3 点を見ると違いが掴みやすいです。

## 選択ガイド表

| 問題がこんな性質なら | まず試す手法 | 最初の候補として向いている理由 |
| --- | --- | --- |
| 球状に近いクラスタで個数も分かっている | K-Means++ | 高速で安定し、説明もしやすい |
| 点が重なり、1 つのラベルでは情報が足りない | Fuzzy C-Means | membership として曖昧さを残せる |
| 粗い粒度から細かい粒度まで見たい | Agglomerative Hierarchical | 併合の履歴そのものに意味がある |
| 形が不規則で外れ値も重要 | DBSCAN | 密度にもとづいてノイズを自然に分けられる |
| 領域ごとに密度がかなり違う | OPTICS | 1 つの閾値に縛られにくい |
| リングや月形などグラフ構造が重要 | Spectral Clustering | 中心より近傍関係を重視できる |

## パラメータ早見表

| アルゴリズム | 重要パラメータ | 大きくすると起こりやすいこと | 小さくすると起こりやすいこと |
| --- | --- | --- | --- |
| K-Means++ | `--clusters`, `--tolerance` | クラスタ数が増えると細かく分かれる。許容誤差が大きいと早く止まる | クラスタ数が減ると統合されやすい。許容誤差が小さいと長く回る |
| Fuzzy C-Means | `--clusters`, `--fuzziness` | fuzziness を上げると membership がより柔らかくなる | fuzziness を下げると硬い割り当てに近づく |
| Agglomerative | `--clusters`, `--linkage` | 目標クラスタ数が多いと細かな構造を保つ | 目標クラスタ数が少ないとより多く併合される |
| DBSCAN | `--eps`, `--min-samples` | `eps` を大きく、または `min-samples` を小さくすると結合しやすい | `eps` を小さく、または `min-samples` を大きくすると分裂やノイズが増えやすい |
| OPTICS | `--xi`, `--min-cluster-size` | 最小クラスタサイズが大きいと保守的で大きめのクラスタ寄りになる | 最小クラスタサイズが小さいと細かなクラスタを拾いやすいが断片化しやすい |
| Spectral | `--clusters`, `--neighbors` | 近傍数を増やすとグラフがより滑らかになる | 近傍数を減らすと局所構造が強く出るが分断されやすい |

## K-Means++

**考え方**

K-Means++ は k-means の目的関数そのものは変えず、初期中心の選び方だけを賢くした手法です。初期中心を離して配置することで、悪い局所解に落ちにくくなります。

**手順**

1. 最初の中心を 1 つランダムに選ぶ。
2. 既存の中心から遠い点ほど選ばれやすい確率で次の中心を選ぶ。
3. 各点を最も近い中心へ割り当てる。
4. 割り当てられた点の平均で中心を更新する。
5. 中心がほとんど動かなくなるまで繰り返す。

**プロセス図**

```mermaid
flowchart TD
    A["k-means++ で中心を初期化"] --> B["各点を最近中心へ割り当てる"]
    B --> C["クラスタ平均で中心を更新する"]
    C --> D{"中心移動は十分小さいか?"}
    D -- "いいえ" --> B
    D -- "はい" --> E["ラベル・中心・軌跡を返す"]
```

**主なパラメータ**

- `--clusters`: クラスタ数。
- `--max-iter`: 最大反復回数。
- `--tolerance`: 収束判定のしきい値。

**長所**

- 高速で分かりやすい。
- コンパクトなクラスタに強い。
- ランダム初期化より安定しやすい。

**短所**

- クラスタ数を事前に決める必要がある。
- ノイズや外れ値に弱い。
- 曲がった形や入れ子構造には向かない。

**計算量**

- 時間: `O(nkt)`
- 空間: `O(n + k)`

**実行例**

```bash
python3 k_means_plus_plus.py --samples 320 --clusters 4 --output k_means_plus_plus.png
```

**コアコード抜粋**

```python
def run_kmeans(points, n_clusters, *, seed, max_iter, tolerance):
    rng = np.random.default_rng(seed)
    centers = initialize_kmeans_plus_plus(points, n_clusters, rng)
    center_trace = [centers.copy()]

    for iteration in range(1, max_iter + 1):
        labels, inertia = assign_points(points, centers)
        updated_centers = update_centers(points, labels, centers, rng)
        center_trace.append(updated_centers.copy())

        # Stop once every center move becomes tiny.
        center_shift = np.linalg.norm(updated_centers - centers, axis=1).max()
        centers = updated_centers
        if center_shift <= tolerance:
            break

    final_labels, inertia = assign_points(points, centers)
    return KMeansResult(final_labels, centers, center_trace, inertia, iteration)
```

ソース: [k_means_plus_plus.py](k_means_plus_plus.py)

![K-Means++ の出力](k_means_plus_plus.png)

## Fuzzy C-Means

**考え方**

Fuzzy C-Means は、各点が複数のクラスタにどの程度属するかを membership として持つソフトクラスタリングです。境界がはっきりしないデータに向いています。

**手順**

1. k-means++ 風に中心を初期化する。
2. 各点の membership を計算する。
3. membership による重み付き平均で中心を更新する。
4. 中心が収束するまで繰り返す。

**プロセス図**

```mermaid
flowchart TD
    A["中心を初期化する"] --> B["各点と各中心の membership を計算する"]
    B --> C["membership の重みで中心を更新する"]
    C --> D{"中心は収束したか?"}
    D -- "いいえ" --> B
    D -- "はい" --> E["表示用ラベルに変換する"]
```

**主なパラメータ**

- `--clusters`: 中心数。
- `--fuzziness`: あいまいさの強さ。
- `--max-iter`: 最大反復回数。
- `--tolerance`: 収束判定。

**長所**

- 境界付近の不確かさを保持できる。
- 複数クラスタへの近さを数値で扱える。

**短所**

- クラスタ数が必要。
- ノイズを明示的には扱わない。
- membership 行列を持つため K-Means より重い。

**計算量**

- 時間: `O(nkt)`
- 空間: `O(nk)`

**実行例**

```bash
python3 fuzzy_c_means.py --samples 320 --clusters 4 --fuzziness 2.0 --output fuzzy_c_means.png
```

**コアコード抜粋**

```python
def update_memberships(points, centers, fuzziness):
    distances = euclidean_distance_matrix(points, centers)
    memberships = np.zeros_like(distances)
    exponent = 2.0 / (fuzziness - 1.0)

    for sample_index, distance_row in enumerate(distances):
        zero_mask = distance_row == 0.0
        if np.any(zero_mask):
            # If a sample lands exactly on a center, give that center full weight.
            memberships[sample_index, zero_mask] = 1.0 / float(zero_mask.sum())
            continue

        ratios = (distance_row[:, None] / distance_row[None, :]) ** exponent
        memberships[sample_index] = 1.0 / ratios.sum(axis=1)
    return memberships
```

ソース: [fuzzy_c_means.py](fuzzy_c_means.py)

![Fuzzy C-Means の出力](fuzzy_c_means.png)

## 凝集型階層クラスタリング

**考え方**

各点を 1 つのクラスタとして始め、近いクラスタ同士を少しずつ併合していく方法です。細かい粒度から大きな粒度までの構造を追えます。

**手順**

1. すべての点を独立したクラスタとして扱う。
2. linkage ルールでクラスタ間距離を測る。
3. 最も近い 2 クラスタを併合する。
4. 目標クラスタ数になるまで繰り返す。

**プロセス図**

```mermaid
flowchart TD
    A["各点を独立クラスタとして開始"] --> B["クラスタ間距離を測る"]
    B --> C["最も近い 2 クラスタを併合"]
    C --> D{"目標クラスタ数に達したか?"}
    D -- "いいえ" --> B
    D -- "はい" --> E["その階層でラベルを出力"]
```

**主なパラメータ**

- `--clusters`: 最終的に残すクラスタ数。
- `--linkage`: `single`、`average`、`complete`。

**長所**

- 階層構造を見られる。
- 初期中心が不要。

**短所**

- 教材版実装は計算量が大きい。
- 途中の併合は後から取り消せない。

**計算量**

- 時間: 教材版では `O(n^3)`
- 空間: `O(n^2)`

**実行例**

```bash
python3 agglomerative_hierarchical.py --samples 140 --clusters 4 --linkage average --output agglomerative_hierarchical.png
```

**コアコード抜粋**

```python
while len(clusters) > n_clusters:
    best_pair = None
    best_distance = float("inf")

    for left_id, right_id in combinations(sorted(clusters), 2):
        candidate_distance = linkage_distance(
            distances,
            clusters[left_id],
            clusters[right_id],
            linkage,
        )
        if candidate_distance < best_distance:
            best_distance = candidate_distance
            best_pair = (left_id, right_id)

    # Merge the closest two clusters and keep going upward.
    left_id, right_id = best_pair
    clusters[next_cluster_id] = clusters.pop(left_id) + clusters.pop(right_id)
    next_cluster_id += 1
```

ソース: [agglomerative_hierarchical.py](agglomerative_hierarchical.py)

![階層クラスタリングの出力](agglomerative_hierarchical.png)

## DBSCAN

**考え方**

DBSCAN は局所密度を使ってクラスタを定義します。密度の高い領域からクラスタを広げ、孤立した点はノイズとして扱えるのが大きな特徴です。

**手順**

1. 各点について `eps` 半径内の近傍点を集める。
2. `min_samples` 以上の近傍を持つ点を core point とする。
3. core point からクラスタを拡張する。
4. 到達可能な border point をクラスタへ追加する。
5. どこにも入らない点はノイズにする。

**プロセス図**

```mermaid
flowchart TD
    A["各点の eps 近傍を求める"] --> B["min_samples で core point を判定"]
    B --> C["core point から密度到達可能な領域を拡張"]
    C --> D["border point を対応クラスタへ追加"]
    D --> E["残りをノイズとする"]
```

**主なパラメータ**

- `--eps`: 近傍半径。
- `--min-samples`: core point とみなす最小近傍数。

**長所**

- ノイズを自然に分離できる。
- 任意形状のクラスタに強い。
- クラスタ数を事前に決めなくてよい。

**短所**

- 密度が大きく異なるデータでは難しい。
- パラメータがデータ依存になりやすい。
- 教材版は全距離行列を使うため可読性重視。

**計算量**

- 時間: 教材版では `O(n^2)`
- 空間: `O(n^2)`

**実行例**

```bash
python3 dbscan.py --samples 240 --eps 0.45 --min-samples 5 --output dbscan.png
```

**コアコード抜粋**

```python
for point_index in range(len(points)):
    if labels[point_index] != UNVISITED_LABEL:
        continue
    if not core_mask[point_index]:
        labels[point_index] = NOISE_LABEL
        continue

    labels[point_index] = cluster_id
    queue = deque(neighborhoods[point_index])

    # Expand the cluster through density-reachable core points.
    while queue:
        neighbor_index = queue.popleft()
        if labels[neighbor_index] == NOISE_LABEL:
            labels[neighbor_index] = cluster_id
        if labels[neighbor_index] != UNVISITED_LABEL:
            continue
        labels[neighbor_index] = cluster_id
        if core_mask[neighbor_index]:
            queue.extend(neighborhoods[neighbor_index])
```

ソース: [dbscan.py](dbscan.py)

![DBSCAN の出力](dbscan.png)

## OPTICS

**考え方**

OPTICS は密度到達可能性に沿って点を並べ、その並びからクラスタを取り出します。DBSCAN よりも「密度の違う領域が混ざるケース」に柔軟です。

**手順**

1. 密度到達可能性に基づいて点を順序付けする。
2. reachability distance を記録する。
3. reachability plot を作る。
4. `xi` などの規則でクラスタを抽出する。

**プロセス図**

```mermaid
flowchart TD
    A["密度到達可能性に沿って点を並べる"] --> B["reachability distance を記録する"]
    B --> C["reachability plot を作る"]
    C --> D["xi と最小サイズで谷をクラスタ化する"]
    D --> E["ラベルと診断情報を返す"]
```

**主なパラメータ**

- `--min-samples`: 密度推定に使う近傍サイズ。
- `--xi`: 谷を分割する鋭さのしきい値。
- `--min-cluster-size`: 最小クラスタサイズ。

**長所**

- DBSCAN より変密度データに強い。
- reachability plot が診断に役立つ。

**短所**

- DBSCAN より説明がやや難しい。
- パラメータが増える。

**計算量**

- 平均的には `O(n log n)` に近いことが多い。
- 最悪では `O(n^2)`。

**実行例**

```bash
python3 optics.py --samples 240 --min-samples 6 --xi 0.08 --min-cluster-size 24 --output optics.png
```

**コアコード抜粋**

```python
model = OPTICS(
    min_samples=args.min_samples,
    xi=args.xi,
    min_cluster_size=args.min_cluster_size,
    cluster_method="xi",
)
labels = model.fit_predict(points)

# OPTICS exposes an ordering plus reachability values, so we can
# inspect the density landscape instead of only the final labels.
ordering = model.ordering_
reachability = model.reachability_[ordering].copy()
```

ソース: [optics.py](optics.py)

![OPTICS の出力](optics.png)

## スペクトラルクラスタリング

**考え方**

データを近傍グラフに変換し、そのグラフのスペクトル情報を使って分割する方法です。中心では表現しにくい形状に強みがあります。

**手順**

1. 近傍グラフを構築する。
2. グラフラプラシアンのスペクトルを計算する。
3. 固有ベクトル空間へ埋め込む。
4. その空間でクラスタ分割する。

**プロセス図**

```mermaid
flowchart TD
    A["近傍グラフを構築する"] --> B["ラプラシアンのスペクトルを計算"]
    B --> C["主要固有ベクトルで埋め込む"]
    C --> D["埋め込み空間で分割する"]
    D --> E["元のサンプルへラベルを戻す"]
```

**主なパラメータ**

- `--clusters`: 分けたいクラスタ数。
- `--neighbors`: グラフ構築に使う近傍数。

**長所**

- リング状や月形などの非凸構造に強い。
- グラフ構造を自然に扱える。

**短所**

- クラスタ数が必要。
- メモリ消費が大きい。
- ノイズを明示的には扱わない。

**計算量**

- 時間: 固有分解が支配的で `O(n^3)` に近づくことがある。
- 空間: `O(n^2)`

**実行例**

```bash
python3 spectral_clustering.py --samples 320 --clusters 2 --neighbors 12 --output spectral_clustering.png
```

**コアコード抜粋**

```python
model = SpectralClustering(
    n_clusters=args.clusters,
    affinity="nearest_neighbors",
    n_neighbors=args.neighbors,
    assign_labels="kmeans",
    random_state=args.seed,
)

# The graph Laplacian embedding happens inside fit_predict.
labels = model.fit_predict(points)
```

ソース: [spectral_clustering.py](spectral_clustering.py)

![スペクトラルクラスタリングの出力](spectral_clustering.png)

## 互換性・テスト・実装方針

- すべてのスクリプトは Python 3.9 以降を対象にし、Python 3.10 以降だけの構文は使っていません。
- 最初の 4 手法は理解しやすさを優先して手書き実装にしています。
- DBSCAN は読みやすさのため全距離行列を使っています。
- OPTICS とスペクトラルクラスタリングは、実用的なデモとして scikit-learn の実装を利用しています。
- CI では Python `3.9` と最新の `3.x` で `pytest` を実行します。

6 つの手法を横並びで比較したい場合は、まず各スクリプトをデフォルト設定で 1 回ずつ実行し、リポジトリ直下に生成される画像を見比べるのが一番分かりやすいです。


================================================
FILE: README.md
================================================
# Clustering Playbook: Trilingual Clustering Tutorial

```bash
echo "                                                                                ";
echo " ██████╗██╗     ██╗   ██╗███████╗████████╗███████╗██████╗ ██╗███╗   ██╗ ██████╗ ";
echo "██╔════╝██║     ██║   ██║██╔════╝╚══██╔══╝██╔════╝██╔══██╗██║████╗  ██║██╔════╝ ";
echo "██║     ██║     ██║   ██║███████╗   ██║   █████╗  ██████╔╝██║██╔██╗ ██║██║  ███╗";
echo "██║     ██║     ██║   ██║╚════██║   ██║   ██╔══╝  ██╔══██╗██║██║╚██╗██║██║   ██║";
echo "╚██████╗███████╗╚██████╔╝███████║   ██║   ███████╗██║  ██║██║██║ ╚████║╚██████╔╝";
echo " ╚═════╝╚══════╝ ╚═════╝ ╚══════╝   ╚═╝   ╚══════╝╚═╝  ╚═╝╚═╝╚═╝  ╚═══╝ ╚═════╝ ";
echo "                                                                                ";
echo "      ██████╗ ██╗      █████╗ ██╗   ██╗██████╗  ██████╗  ██████╗ ██╗  ██╗       ";
echo "      ██╔══██╗██║     ██╔══██╗╚██╗ ██╔╝██╔══██╗██╔═══██╗██╔═══██╗██║ ██╔╝       ";
echo "      ██████╔╝██║     ███████║ ╚████╔╝ ██████╔╝██║   ██║██║   ██║█████╔╝        ";
echo "      ██╔═══╝ ██║     ██╔══██║  ╚██╔╝  ██╔══██╗██║   ██║██║   ██║██╔═██╗        ";
echo "      ██║     ███████╗██║  ██║   ██║   ██████╔╝╚██████╔╝╚██████╔╝██║  ██╗       ";
echo "      ╚═╝     ╚══════╝╚═╝  ╚═╝   ╚═╝   ╚═════╝  ╚═════╝  ╚═════╝ ╚═╝  ╚═╝       ";
echo "                                                                                ";
```

Language: [English](README.md) | [简体中文](README.zh-CN.md) | [日本語](README.ja.md)

This repository is a refreshed clustering-algorithm learning project. It keeps the first four classic algorithms as readable hand-written Python demos, adds two widely used modern extensions, and turns the old blog-style notes into a GitHub-friendly, runnable reference.

## What this repository offers

- Python 3.9+ compatible scripts that run directly from the repository root.
- Four hand-written teaching demos: K-Means++, Fuzzy C-Means, Agglomerative Hierarchical Clustering, and DBSCAN.
- Two additional practical demos: OPTICS and Spectral Clustering.
- Consistent CLI for every script: `--seed`, `--samples`, `--output`, and `--show`.
- Shared plotting and dataset utilities in [clustering_utils.py](clustering_utils.py).
- Smoke tests in [tests/test_repository.py](tests/test_repository.py) and CI in [.github/workflows/ci.yml](.github/workflows/ci.yml).

## Clustering at a Glance

Clustering is the task of grouping unlabeled samples by similarity. The key difference between clustering algorithms is not just speed, but the very definition of what a "cluster" means: a tight group around a center, a connected dense region, a merge path in a hierarchy, or a partition in a similarity graph.

```mermaid
flowchart TD
    A["Raw unlabeled samples"] --> B{"What makes points belong together?"}
    B --> C["Nearest prototype or center<br/>K-Means++, Fuzzy C-Means"]
    B --> D["Repeatedly merge closest groups<br/>Agglomerative Hierarchical"]
    B --> E["Dense reachable neighborhoods<br/>DBSCAN, OPTICS"]
    B --> F["Graph neighborhood structure<br/>Spectral Clustering"]
```

## Quick Start

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

Run any demo directly:

```bash
python3 k_means_plus_plus.py
python3 fuzzy_c_means.py
python3 agglomerative_hierarchical.py
python3 dbscan.py
python3 optics.py
python3 spectral_clustering.py
```

Useful options shared by all scripts:

- `--seed`: controls reproducibility.
- `--samples`: controls the dataset size.
- `--output`: saves the figure to a custom path.
- `--show`: displays the figure after saving it.

Run the smoke tests:

```bash
pytest
```

## Repository Layout

| File | Purpose |
| --- | --- |
| [clustering_utils.py](clustering_utils.py) | Shared dataset generation, plotting helpers, validation helpers, and k-means++ initialization. |
| [k_means_plus_plus.py](k_means_plus_plus.py) | Hand-written K-Means++ demo for convex blob-shaped clusters. |
| [fuzzy_c_means.py](fuzzy_c_means.py) | Hand-written fuzzy c-means demo for soft memberships. |
| [agglomerative_hierarchical.py](agglomerative_hierarchical.py) | Hand-written agglomerative hierarchical clustering demo. |
| [dbscan.py](dbscan.py) | Hand-written DBSCAN demo for noise-aware density clustering. |
| [optics.py](optics.py) | OPTICS demo with a reachability plot, using scikit-learn. |
| [spectral_clustering.py](spectral_clustering.py) | Spectral clustering demo for non-convex shapes, using scikit-learn. |
| [requirements.txt](requirements.txt) | Minimal dependencies for scripts and tests. |

## Algorithm Comparison

| Algorithm | Family | Need `k` in advance? | Handles noise? | Works well on non-convex shapes? | Typical complexity | Best fit |
| --- | --- | --- | --- | --- | --- | --- |
| K-Means++ | Prototype / partitioning | Yes | No | Usually no | Time `O(nkt)`, space `O(n + k)` | Fast baseline when clusters are compact and roughly spherical |
| Fuzzy C-Means | Soft partitioning | Yes | No | Usually no | Time `O(nkt)`, space `O(nk)` | When soft memberships are more informative than hard assignments |
| Agglomerative Hierarchical | Hierarchical | No fixed `k` during merging, but you choose a cut level | Limited | Sometimes | Demo time `O(n^3)`, space `O(n^2)` | Small datasets where merge history matters |
| DBSCAN | Density-based | No | Yes | Yes | Demo time `O(n^2)`, space `O(n^2)` | Irregular clusters with outliers and unknown cluster count |
| OPTICS | Density-based ordering | No | Yes | Yes | Often near `O(n log n)` with indexing, worst `O(n^2)` | Datasets with different local densities |
| Spectral Clustering | Graph / partitioning | Yes | No explicit noise model | Yes | Time often `O(n^3)`, space `O(n^2)` | Non-convex manifolds and graph-like similarity structure |

## Choosing an Algorithm Quickly

- Start with K-Means++ when you know the cluster count and expect compact groups.
- Choose Fuzzy C-Means when overlap matters and you want degrees of membership.
- Choose Agglomerative Hierarchical Clustering when you care about multi-level structure or cluster merge history.
- Choose DBSCAN when you need outlier resistance and arbitrary cluster shapes.
- Choose OPTICS when DBSCAN feels too sensitive to a single global density threshold.
- Choose Spectral Clustering when clusters are better described by neighborhood graphs than by geometric centers.

## How to Read the Demos

- The scripts in this repository are intentionally small, so each one highlights the core training loop or clustering expansion step instead of hiding everything behind a large framework.
- The generated figures show the final assignments, while the code snippets below focus on the one operation that makes each algorithm different from the others.
- When comparing algorithms, look at three things together: whether noise is isolated, whether curved shapes stay intact, and whether the method needs the cluster count in advance.

## Decision Table

| If your data looks like this | Best first algorithm | Why it is a good starting point |
| --- | --- | --- |
| Compact blob-like groups and a known cluster count | K-Means++ | Fast, stable, and easy to explain |
| Samples overlap and hard labels hide useful uncertainty | Fuzzy C-Means | Keeps soft memberships instead of a single forced label |
| You want to inspect coarse-to-fine grouping structure | Agglomerative Hierarchical | The merge path itself is part of the result |
| Clusters are irregular and outliers matter | DBSCAN | Density expansion naturally separates noise |
| Density changes a lot across regions | OPTICS | More flexible than DBSCAN under one global threshold |
| Rings, moons, or graph-shaped neighborhoods dominate | Spectral Clustering | Uses graph structure instead of center geometry |

## Parameter Cheat Sheet

| Algorithm | Most important knobs | If you increase it | If you decrease it |
| --- | --- | --- | --- |
| K-Means++ | `--clusters`, `--tolerance` | More clusters can split groups; larger tolerance stops earlier | Fewer clusters merge groups; smaller tolerance runs longer |
| Fuzzy C-Means | `--clusters`, `--fuzziness` | Larger fuzziness makes memberships softer | Smaller fuzziness makes assignments harder |
| Agglomerative | `--clusters`, `--linkage` | More target clusters keeps finer structure | Fewer target clusters force more merging |
| DBSCAN | `--eps`, `--min-samples` | Larger `eps` or smaller `min-samples` tends to merge more points | Smaller `eps` or larger `min-samples` tends to split clusters and create more noise |
| OPTICS | `--xi`, `--min-cluster-size` | Larger minimum size favors broader, more conservative clusters | Smaller minimum size allows finer clusters but can fragment the result |
| Spectral | `--clusters`, `--neighbors` | More neighbors smooth the graph more strongly | Fewer neighbors keep local structure sharper but can disconnect the graph |

## K-Means++

**Core idea**

K-Means++ is still the classic k-means objective, but it replaces random seeding with a smarter initialization strategy. By spreading the initial centers apart, it usually reaches better local optima and converges faster than naive random starts.

**Algorithm steps**

1. Pick the first center randomly.
2. Pick each later center with probability proportional to its squared distance from the closest existing center.
3. Assign every sample to the nearest center.
4. Recompute each center as the mean of its assigned samples.
5. Repeat the assign-and-update cycle until the centers barely move.

**Process map**

```mermaid
flowchart TD
    A["Initialize centers with k-means++"] --> B["Assign every point to the nearest center"]
    B --> C["Recompute each center as the cluster mean"]
    C --> D{"Centers moved less than tolerance?"}
    D -- "No" --> B
    D -- "Yes" --> E["Return labels, centers, and center trace"]
```

**Key parameters**

- `--clusters`: number of centers to fit.
- `--max-iter`: optimization steps before forcing a stop.
- `--tolerance`: convergence threshold for center movement.

**Strengths**

- Fast and easy to interpret.
- Excellent baseline for compact, balanced clusters.
- Initialization is much more stable than plain random k-means.

**Limitations**

- Requires the number of clusters in advance.
- Sensitive to strong outliers because every point must belong somewhere.
- Assumes mean-like centers represent the cluster shape well.

**Use when**

- You need a simple, strong baseline.
- The data contains roughly spherical or blob-like clusters.
- Speed matters more than modeling complex geometry.

**Avoid when**

- Cluster shapes are curved, nested, or strongly non-convex.
- The dataset contains obvious noise points.
- You need soft assignments instead of one label per point.

**Complexity**

- Time: `O(nkt)` where `n` is the sample count, `k` is the cluster count, and `t` is the number of iterations.
- Space: `O(n + k)` for the main state in this demo.

**Run**

```bash
python3 k_means_plus_plus.py --samples 320 --clusters 4 --output k_means_plus_plus.png
```

**Core source snippet**

```python
def run_kmeans(points, n_clusters, *, seed, max_iter, tolerance):
    rng = np.random.default_rng(seed)
    centers = initialize_kmeans_plus_plus(points, n_clusters, rng)
    center_trace = [centers.copy()]

    for iteration in range(1, max_iter + 1):
        labels, inertia = assign_points(points, centers)
        updated_centers = update_centers(points, labels, centers, rng)
        center_trace.append(updated_centers.copy())

        # Stop once every center move becomes tiny.
        center_shift = np.linalg.norm(updated_centers - centers, axis=1).max()
        centers = updated_centers
        if center_shift <= tolerance:
            break

    final_labels, inertia = assign_points(points, centers)
    return KMeansResult(final_labels, centers, center_trace, inertia, iteration)
```

Source: [k_means_plus_plus.py](k_means_plus_plus.py)

![K-Means++ output](k_means_plus_plus.png)

## Fuzzy C-Means

**Core idea**

Fuzzy C-Means extends centroid-based clustering by letting each point belong to every cluster with a membership score. That makes it useful when boundaries are soft and a hard yes-or-no assignment hides information.

**Algorithm steps**

1. Initialize centers with the same k-means++ style seeding used in this repository.
2. Compute each point's membership to every center.
3. Update each center as a weighted average using membership scores.
4. Repeat until the centers stop moving.
5. Convert soft memberships to hard labels only for visualization.

**Process map**

```mermaid
flowchart TD
    A["Initialize centers"] --> B["Compute memberships for every sample-center pair"]
    B --> C["Update centers with membership-weighted averages"]
    C --> D{"Centers converged?"}
    D -- "No" --> B
    D -- "Yes" --> E["Convert highest membership to display labels"]
```

**Key parameters**

- `--clusters`: number of fuzzy centers.
- `--fuzziness`: controls how soft the memberships are. Values closer to `1` behave more like hard clustering.
- `--max-iter`: optimization budget.
- `--tolerance`: convergence threshold.

**Strengths**

- Preserves uncertainty at cluster boundaries.
- More expressive than hard assignments when groups overlap.
- Useful for exploratory analysis and soft decision support.

**Limitations**

- Still requires the cluster count in advance.
- Still does not have a native noise model.
- Costs more than K-Means because it updates the full membership matrix.

**Use when**

- Samples can reasonably belong to more than one group.
- You want interpretable center-based clustering with soft confidence.
- You need a bridge between hard partitioning and probabilistic thinking.

**Avoid when**

- Noise rejection is the first priority.
- You need a very large-scale solution with minimal memory use.
- Non-convex cluster shapes dominate the problem.

**Complexity**

- Time: `O(nkt)` with a larger constant than K-Means.
- Space: `O(nk)` because the membership matrix is stored explicitly.

**Run**

```bash
python3 fuzzy_c_means.py --samples 320 --clusters 4 --fuzziness 2.0 --output fuzzy_c_means.png
```

**Core source snippet**

```python
def update_memberships(points, centers, fuzziness):
    distances = euclidean_distance_matrix(points, centers)
    memberships = np.zeros_like(distances)
    exponent = 2.0 / (fuzziness - 1.0)

    for sample_index, distance_row in enumerate(distances):
        zero_mask = distance_row == 0.0
        if np.any(zero_mask):
            # If a sample lands exactly on a center, give that center full weight.
            memberships[sample_index, zero_mask] = 1.0 / float(zero_mask.sum())
            continue

        ratios = (distance_row[:, None] / distance_row[None, :]) ** exponent
        memberships[sample_index] = 1.0 / ratios.sum(axis=1)
    return memberships
```

Source: [fuzzy_c_means.py](fuzzy_c_means.py)

![Fuzzy C-Means output](fuzzy_c_means.png)

## Agglomerative Hierarchical Clustering

**Core idea**

Agglomerative clustering starts with one cluster per point and repeatedly merges the closest pair of clusters. Instead of committing to one level immediately, it builds structure from fine detail to coarse detail.

**Algorithm steps**

1. Treat each point as its own cluster.
2. Compute inter-cluster distances with a linkage rule.
3. Merge the closest pair of clusters.
4. Repeat until only the requested number of clusters remains.

**Process map**

```mermaid
flowchart TD
    A["Start with one cluster per point"] --> B["Measure distances between clusters"]
    B --> C["Merge the closest two clusters"]
    C --> D{"Target number of clusters reached?"}
    D -- "No" --> B
    D -- "Yes" --> E["Cut the hierarchy and output labels"]
```

**Key parameters**

- `--clusters`: number of clusters to keep at the cut level.
- `--linkage`: `single`, `average`, or `complete`.

**Strengths**

- Produces a multi-level view of similarity.
- Can capture structure missed by centroid-only methods.
- Does not need initial seeds.

**Limitations**

- The readable demo implementation is intentionally naive and expensive.
- Earlier merges cannot be undone later.
- Sensitive to the linkage rule.

**Use when**

- The dataset is modest in size.
- You care about merge behavior and cluster granularity.
- You want to compare different linkage strategies.

**Avoid when**

- You need highly scalable clustering on large datasets.
- You need strong noise handling without pre-processing.

**Complexity**

- Demo time: `O(n^3)` because cluster distances are recomputed naively.
- Space: `O(n^2)` for the pairwise distance matrix.

**Run**

```bash
python3 agglomerative_hierarchical.py --samples 140 --clusters 4 --linkage average --output agglomerative_hierarchical.png
```

**Core source snippet**

```python
while len(clusters) > n_clusters:
    best_pair = None
    best_distance = float("inf")

    for left_id, right_id in combinations(sorted(clusters), 2):
        candidate_distance = linkage_distance(
            distances,
            clusters[left_id],
            clusters[right_id],
            linkage,
        )
        if candidate_distance < best_distance:
            best_distance = candidate_distance
            best_pair = (left_id, right_id)

    # Merge the closest two clusters and keep going upward.
    left_id, right_id = best_pair
    clusters[next_cluster_id] = clusters.pop(left_id) + clusters.pop(right_id)
    next_cluster_id += 1
```

Source: [agglomerative_hierarchical.py](agglomerative_hierarchical.py)

![Agglomerative hierarchical output](agglomerative_hierarchical.png)

## DBSCAN

**Core idea**

DBSCAN groups points by local density. Points in dense regions become core points, sparse regions can become noise, and cluster shapes do not need to be spherical.

**Algorithm steps**

1. For each point, find all neighbors inside radius `eps`.
2. Mark points with at least `min_samples` neighbors as core points.
3. Expand clusters outward from core points.
4. Attach border points to reachable clusters.
5. Leave isolated points as noise.

**Process map**

```mermaid
flowchart TD
    A["Build eps-neighborhood for each point"] --> B["Mark core points with at least min_samples neighbors"]
    B --> C["Expand density-reachable clusters from core points"]
    C --> D["Attach border points to reachable clusters"]
    D --> E["Mark the remaining samples as noise"]
```

**Key parameters**

- `--eps`: neighborhood radius.
- `--min-samples`: density threshold for core points.

**Strengths**

- Naturally identifies noise.
- Works well on irregular cluster shapes.
- Does not require the cluster count in advance.

**Limitations**

- A single global density threshold can fail on strongly varying densities.
- Parameter tuning can be data dependent.
- This readable demo uses a full pairwise distance matrix, so it favors clarity over scale.

**Use when**

- Outliers matter.
- Clusters are curved, chained, or otherwise non-convex.
- You do not know the cluster count beforehand.

**Avoid when**

- Cluster densities differ dramatically.
- You need a one-shot default that works equally well across many scales.

**Complexity**

- Demo time: `O(n^2)` with the all-pairs distance matrix.
- Demo space: `O(n^2)`.
- Indexed production implementations can be much faster on suitable data.

**Run**

```bash
python3 dbscan.py --samples 240 --eps 0.45 --min-samples 5 --output dbscan.png
```

**Core source snippet**

```python
for point_index in range(len(points)):
    if labels[point_index] != UNVISITED_LABEL:
        continue
    if not core_mask[point_index]:
        labels[point_index] = NOISE_LABEL
        continue

    labels[point_index] = cluster_id
    queue = deque(neighborhoods[point_index])

    # Expand the cluster through density-reachable core points.
    while queue:
        neighbor_index = queue.popleft()
        if labels[neighbor_index] == NOISE_LABEL:
            labels[neighbor_index] = cluster_id
        if labels[neighbor_index] != UNVISITED_LABEL:
            continue
        labels[neighbor_index] = cluster_id
        if core_mask[neighbor_index]:
            queue.extend(neighborhoods[neighbor_index])
```

Source: [dbscan.py](dbscan.py)

![DBSCAN output](dbscan.png)

## OPTICS

**Core idea**

OPTICS can be read as "DBSCAN without forcing one global density scale first." It orders points by density reachability and then extracts clusters from the resulting reachability landscape.

**Algorithm steps**

1. Visit points in an order guided by density reachability.
2. Record each point's reachability distance.
3. Build a reachability plot that shows where dense valleys appear.
4. Extract clusters from that ordering using a rule such as `xi`.

**Process map**

```mermaid
flowchart TD
    A["Order points by density reachability"] --> B["Record reachability distance for each visit"]
    B --> C["Build the reachability plot"]
    C --> D["Extract valleys as clusters with xi and minimum size"]
    D --> E["Return labels plus diagnostic ordering"]
```

**Key parameters**

- `--min-samples`: neighborhood size for density estimation.
- `--xi`: steepness threshold used to split valleys in the reachability plot.
- `--min-cluster-size`: minimum size for extracted clusters.

**Strengths**

- More flexible than DBSCAN when local densities differ.
- The reachability plot provides additional diagnostic value.
- Still handles noise and arbitrary cluster geometry well.

**Limitations**

- Harder to explain than DBSCAN to beginners.
- Cluster extraction has more moving parts.
- Results can still shift noticeably with parameter changes.

**Use when**

- DBSCAN either over-merges or over-splits because density is uneven.
- You want a density-based method with a richer diagnostic view.

**Avoid when**

- You need the simplest possible explanation for stakeholders.
- A centroid model already fits the data well.

**Complexity**

- Often near `O(n log n)` with appropriate indexing.
- Worst-case complexity can rise to `O(n^2)`.
- This repository uses scikit-learn's implementation for a practical, runnable demo.

**Run**

```bash
python3 optics.py --samples 240 --min-samples 6 --xi 0.08 --min-cluster-size 24 --output optics.png
```

**Core source snippet**

```python
model = OPTICS(
    min_samples=args.min_samples,
    xi=args.xi,
    min_cluster_size=args.min_cluster_size,
    cluster_method="xi",
)
labels = model.fit_predict(points)

# OPTICS exposes an ordering plus reachability values, so we can
# inspect the density landscape instead of only the final labels.
ordering = model.ordering_
reachability = model.reachability_[ordering].copy()
```

Source: [optics.py](optics.py)

![OPTICS output](optics.png)

## Spectral Clustering

**Core idea**

Spectral clustering converts the dataset into a graph, studies the graph's Laplacian structure, and then partitions the data in a lower-dimensional embedding derived from eigenvectors. That makes it especially helpful when geometric centers are not the right mental model.

**Algorithm steps**

1. Build a similarity graph from local neighborhoods.
2. Compute spectral information from the graph Laplacian.
3. Embed the points using the leading eigenvectors.
4. Partition that embedding into the requested number of clusters.

**Process map**

```mermaid
flowchart TD
    A["Build a nearest-neighbor similarity graph"] --> B["Compute the graph Laplacian spectrum"]
    B --> C["Embed samples with leading eigenvectors"]
    C --> D["Cluster the embedding"]
    D --> E["Map the partition back to the original samples"]
```

**Key parameters**

- `--clusters`: number of graph partitions to extract.
- `--neighbors`: number of nearest neighbors used to build the affinity graph.

**Strengths**

- Very effective on non-convex shapes such as rings and moons.
- Uses graph structure instead of assuming center-based geometry.
- Often gives elegant results on manifold-like data.

**Limitations**

- Requires the cluster count in advance.
- Memory hungry because the affinity structure grows quickly with `n`.
- Does not explicitly model noise.

**Use when**

- Neighborhood relationships are more informative than global Euclidean distance.
- You want to separate intertwined or ring-like structures.

**Avoid when**

- The dataset is very large and memory is tight.
- You need explicit outlier handling.

**Complexity**

- Time is often dominated by eigendecomposition and can approach `O(n^3)`.
- Space is typically `O(n^2)` because of the graph representation.

**Run**

```bash
python3 spectral_clustering.py --samples 320 --clusters 2 --neighbors 12 --output spectral_clustering.png
```

**Core source snippet**

```python
model = SpectralClustering(
    n_clusters=args.clusters,
    affinity="nearest_neighbors",
    n_neighbors=args.neighbors,
    assign_labels="kmeans",
    random_state=args.seed,
)

# The graph Laplacian embedding happens inside fit_predict.
labels = model.fit_predict(points)
```

Source: [spectral_clustering.py](spectral_clustering.py)

![Spectral clustering output](spectral_clustering.png)

## Compatibility, Testing, and Design Notes

- All scripts are designed for Python 3.9 and newer. They avoid Python 3.10+ only syntax so the same files run on Python 3.9 through current Python 3 releases.
- The first four algorithms prioritize readability over raw performance. That is why the repository keeps explicit NumPy implementations instead of hiding everything behind a library call.
- The DBSCAN demo intentionally uses a full pairwise distance matrix because it is easier to read. Production systems often replace that with spatial indexing.
- The OPTICS and Spectral demos use scikit-learn because those algorithms are better shown as practical runnable examples than reimplemented in a long educational derivation.
- The CI workflow tests Python `3.9` and the latest available `3.x` runtime through GitHub Actions.

If you want to compare all demos side by side, run them once with their defaults and inspect the generated figures in the repository root.


================================================
FILE: README.zh-CN.md
================================================
# Clustering Playbook:三语聚类算法教程

```bash
echo "                                                                                ";
echo " ██████╗██╗     ██╗   ██╗███████╗████████╗███████╗██████╗ ██╗███╗   ██╗ ██████╗ ";
echo "██╔════╝██║     ██║   ██║██╔════╝╚══██╔══╝██╔════╝██╔══██╗██║████╗  ██║██╔════╝ ";
echo "██║     ██║     ██║   ██║███████╗   ██║   █████╗  ██████╔╝██║██╔██╗ ██║██║  ███╗";
echo "██║     ██║     ██║   ██║╚════██║   ██║   ██╔══╝  ██╔══██╗██║██║╚██╗██║██║   ██║";
echo "╚██████╗███████╗╚██████╔╝███████║   ██║   ███████╗██║  ██║██║██║ ╚████║╚██████╔╝";
echo " ╚═════╝╚══════╝ ╚═════╝ ╚══════╝   ╚═╝   ╚══════╝╚═╝  ╚═╝╚═╝╚═╝  ╚═══╝ ╚═════╝ ";
echo "                                                                                ";
echo "      ██████╗ ██╗      █████╗ ██╗   ██╗██████╗  ██████╗  ██████╗ ██╗  ██╗       ";
echo "      ██╔══██╗██║     ██╔══██╗╚██╗ ██╔╝██╔══██╗██╔═══██╗██╔═══██╗██║ ██╔╝       ";
echo "      ██████╔╝██║     ███████║ ╚████╔╝ ██████╔╝██║   ██║██║   ██║█████╔╝        ";
echo "      ██╔═══╝ ██║     ██╔══██║  ╚██╔╝  ██╔══██╗██║   ██║██║   ██║██╔═██╗        ";
echo "      ██║     ███████╗██║  ██║   ██║   ██████╔╝╚██████╔╝╚██████╔╝██║  ██╗       ";
echo "      ╚═╝     ╚══════╝╚═╝  ╚═╝   ╚═╝   ╚═════╝  ╚═════╝  ╚═════╝ ╚═╝  ╚═╝       ";
echo "                                                                                ";
```

语言: [English](README.md) | [简体中文](README.zh-CN.md) | [日本語](README.ja.md)

这是一个经过整体翻新的聚类算法学习仓库。项目保留了前 4 个经典算法的“手写教学版”实现,并补充了 2 个更现代、在实践中很常见的聚类方法;同时把旧的博文式内容重构成适合 GitHub 阅读、运行和维护的三语教程仓库。

## 这个仓库提供什么

- 可直接在仓库根目录运行的 Python 3.9+ 脚本。
- 4 个手写教学实现:K-Means++、模糊 C 均值、凝聚层次聚类、DBSCAN。
- 2 个补充算法示例:OPTICS、谱聚类。
- 所有脚本统一支持 `--seed`、`--samples`、`--output`、`--show`。
- 共享工具模块:[clustering_utils.py](clustering_utils.py)。
- 冒烟测试:[tests/test_repository.py](tests/test_repository.py)。
- GitHub Actions 持续集成:[.github/workflows/ci.yml](.github/workflows/ci.yml)。

## 一眼看懂聚类在做什么

聚类本质上是在“没有标签”的前提下,把相似样本组织成若干组。但不同算法对“什么叫同一簇”有完全不同的理解:有的看离中心有多近,有的看是否能逐步合并,有的看局部密度,有的看图结构中的连通关系。理解这层差异,比死记算法名字更重要。

```mermaid
flowchart TD
    A["原始无标签样本"] --> B{"什么样的点应该分到一起?"}
    B --> C["离某个中心更近<br/>K-Means++, Fuzzy C-Means"]
    B --> D["最近的簇不断合并<br/>凝聚层次聚类"]
    B --> E["局部密度可达<br/>DBSCAN, OPTICS"]
    B --> F["近邻图结构更相似<br/>谱聚类"]
```

## 快速开始

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

直接运行任意一个示例:

```bash
python3 k_means_plus_plus.py
python3 fuzzy_c_means.py
python3 agglomerative_hierarchical.py
python3 dbscan.py
python3 optics.py
python3 spectral_clustering.py
```

所有脚本都支持的通用参数:

- `--seed`:随机种子,便于复现实验。
- `--samples`:样本数量。
- `--output`:输出图片路径。
- `--show`:保存后显示图片。

运行测试:

```bash
pytest
```

## 仓库结构

| 文件 | 说明 |
| --- | --- |
| [clustering_utils.py](clustering_utils.py) | 数据生成、绘图、参数校验、k-means++ 初始化等公共逻辑。 |
| [k_means_plus_plus.py](k_means_plus_plus.py) | 手写 K-Means++ 示例。 |
| [fuzzy_c_means.py](fuzzy_c_means.py) | 手写模糊 C 均值示例。 |
| [agglomerative_hierarchical.py](agglomerative_hierarchical.py) | 手写凝聚层次聚类示例。 |
| [dbscan.py](dbscan.py) | 手写 DBSCAN 示例。 |
| [optics.py](optics.py) | 基于 scikit-learn 的 OPTICS 示例,并包含 reachability plot。 |
| [spectral_clustering.py](spectral_clustering.py) | 基于 scikit-learn 的谱聚类示例。 |
| [requirements.txt](requirements.txt) | 运行脚本和测试所需的最小依赖。 |

## 算法对比

| 算法 | 方法类别 | 是否需要预先给定 `k` | 是否能识别噪声 | 是否适合非凸形状 | 典型复杂度 | 更适合什么场景 |
| --- | --- | --- | --- | --- | --- | --- |
| K-Means++ | 原型 / 划分式 | 是 | 否 | 一般不适合 | 时间 `O(nkt)`,空间 `O(n + k)` | 簇较紧凑、近似球形、需要高效基线 |
| Fuzzy C-Means | 软划分 | 是 | 否 | 一般不适合 | 时间 `O(nkt)`,空间 `O(nk)` | 想保留样本对多个簇的隶属度 |
| Agglomerative Hierarchical | 层次聚类 | 合并时不需要,但最终要决定切分层级 | 有限 | 某些情况下可以 | 教学版时间 `O(n^3)`,空间 `O(n^2)` | 数据规模较小、关注层级结构 |
| DBSCAN | 基于密度 | 否 | 是 | 是 | 教学版时间 `O(n^2)`,空间 `O(n^2)` | 不规则簇、存在离群点、簇数未知 |
| OPTICS | 基于密度排序 | 否 | 是 | 是 | 常见情况下接近 `O(n log n)`,最坏 `O(n^2)` | 局部密度差异明显、DBSCAN 难调参 |
| Spectral Clustering | 图划分 / 谱方法 | 是 | 无显式噪声模型 | 是 | 时间常接近 `O(n^3)`,空间 `O(n^2)` | 非凸流形、图结构相似性更重要 |

## 快速选型建议

- 已知簇数,而且簇大致是团状分布:先试 K-Means++。
- 样本边界模糊,希望保留“部分属于多个簇”的信息:用 Fuzzy C-Means。
- 希望观察从细粒度到粗粒度的合并过程:用凝聚层次聚类。
- 需要处理噪声和任意形状簇:优先考虑 DBSCAN。
- DBSCAN 对同一组参数过分敏感,且各区域密度差别较大:尝试 OPTICS。
- 数据更像“邻接图问题”而不是“找几个中心”:考虑谱聚类。

## 怎么阅读这些示例

- 这个仓库的脚本都故意写得比较短,重点是把每个算法最核心的那一步显式展示出来,而不是把逻辑完全藏进库调用里。
- 每个章节的图片展示的是最终聚类结果,而“核心源码片段”展示的是这个算法与其他算法真正拉开差异的关键动作。
- 横向比较时,最值得一起看的 3 件事是:是否能识别噪声、是否能保持非凸形状、是否需要事先知道簇数。

## 选型决策表

| 如果你的问题更像这样 | 推荐先试 | 为什么适合作为第一选择 |
| --- | --- | --- |
| 团状分布明显,而且簇数已知 | K-Means++ | 快、稳、容易解释 |
| 样本重叠明显,硬标签会丢失信息 | Fuzzy C-Means | 能保留软隶属度,而不是被迫只归一类 |
| 你想看从细到粗的层次结构 | 凝聚层次聚类 | 合并路径本身就是结果的一部分 |
| 簇形状不规则,而且离群点很重要 | DBSCAN | 密度扩展天然适合把噪声分开 |
| 不同区域密度差异很大 | OPTICS | 比单一阈值的 DBSCAN 更灵活 |
| 数据像环、月牙或近邻图结构 | 谱聚类 | 依赖图结构而不是均值中心 |

## 参数速查表

| 算法 | 最关键参数 | 参数调大通常意味着 | 参数调小时通常意味着 |
| --- | --- | --- | --- |
| K-Means++ | `--clusters`, `--tolerance` | 更多簇会切得更细;更大容差会更早停止 | 更少簇会更容易合并;更小容差会迭代更久 |
| Fuzzy C-Means | `--clusters`, `--fuzziness` | 更大的模糊指数会让隶属度更“软” | 更小的模糊指数会更接近硬划分 |
| Agglomerative | `--clusters`, `--linkage` | 更多目标簇保留更细结构 | 更少目标簇会强制更多合并 |
| DBSCAN | `--eps`, `--min-samples` | 更大 `eps` 或更小 `min-samples` 往往更容易把点并到一起 | 更小 `eps` 或更大 `min-samples` 更容易把簇切碎并产生噪声 |
| OPTICS | `--xi`, `--min-cluster-size` | 更大的最小簇规模会更保守,偏向大簇 | 更小的最小簇规模允许更细碎的结构,但也更容易分裂 |
| 谱聚类 | `--clusters`, `--neighbors` | 更大的近邻数会让图更平滑 | 更小的近邻数会更强调局部结构,但也更容易断图 |

## K-Means++

**核心思想**

K-Means++ 本质上还是 k-means 的目标函数,但它用更聪明的方式选择初始中心。初始中心分得更开,通常可以减少坏的局部最优解,也更容易更快收敛。

**算法步骤**

1. 随机选择第一个中心。
2. 后续中心按“距离最近已有中心越远,被选中的概率越大”的策略选取。
3. 把每个点分配给最近的中心。
4. 用簇内样本均值更新中心。
5. 重复“分配 + 更新”,直到中心几乎不再移动。

**过程图**

```mermaid
flowchart TD
    A["用 k-means++ 方式初始化中心"] --> B["把每个点分给最近的中心"]
    B --> C["用簇内均值更新中心"]
    C --> D{"中心移动是否小于阈值?"}
    D -- "否" --> B
    D -- "是" --> E["输出标签、中心和轨迹"]
```

**关键参数**

- `--clusters`:簇数。
- `--max-iter`:最大迭代轮数。
- `--tolerance`:中心移动阈值,小于它就停止。

**优点**

- 速度快,结果直观。
- 对团状、近似球形簇效果好。
- 比普通随机初始化更稳定。

**局限**

- 必须事先知道簇数。
- 对离群点比较敏感。
- 对弯曲、环状、嵌套簇往往无能为力。

**适用场景**

- 需要一个强而简单的基线方法。
- 数据分布比较规整,簇间分离较明显。

**不适合**

- 噪声很多。
- 簇形状明显不是凸的或球形的。
- 希望保留软隶属度。

**复杂度**

- 时间复杂度:`O(nkt)`。
- 空间复杂度:`O(n + k)`。

**运行命令**

```bash
python3 k_means_plus_plus.py --samples 320 --clusters 4 --output k_means_plus_plus.png
```

**核心源码片段**

```python
def run_kmeans(points, n_clusters, *, seed, max_iter, tolerance):
    rng = np.random.default_rng(seed)
    centers = initialize_kmeans_plus_plus(points, n_clusters, rng)
    center_trace = [centers.copy()]

    for iteration in range(1, max_iter + 1):
        labels, inertia = assign_points(points, centers)
        updated_centers = update_centers(points, labels, centers, rng)
        center_trace.append(updated_centers.copy())

        # Stop once every center move becomes tiny.
        center_shift = np.linalg.norm(updated_centers - centers, axis=1).max()
        centers = updated_centers
        if center_shift <= tolerance:
            break

    final_labels, inertia = assign_points(points, centers)
    return KMeansResult(final_labels, centers, center_trace, inertia, iteration)
```

源码: [k_means_plus_plus.py](k_means_plus_plus.py)

![K-Means++ 示例图](k_means_plus_plus.png)

## Fuzzy C-Means

**核心思想**

模糊 C 均值是在中心型聚类的基础上,引入“每个样本对每个簇都有隶属度”的思想。它不是强行把一个点硬分到唯一的簇里,而是保留不确定性和重叠性。

**算法步骤**

1. 先用 k-means++ 风格方式初始化中心。
2. 计算每个样本对每个中心的隶属度。
3. 用隶属度加权平均来更新中心。
4. 重复直到中心收敛。
5. 为了画图,再把最大隶属度对应的簇当作可视化标签。

**过程图**

```mermaid
flowchart TD
    A["初始化中心"] --> B["计算每个样本对每个中心的隶属度"]
    B --> C["用隶属度加权平均更新中心"]
    C --> D{"中心是否收敛?"}
    D -- "否" --> B
    D -- "是" --> E["取最大隶属度作为展示标签"]
```

**关键参数**

- `--clusters`:中心个数。
- `--fuzziness`:模糊指数,越接近 `1` 越接近硬聚类。
- `--max-iter`:最大迭代次数。
- `--tolerance`:中心收敛阈值。

**优点**

- 能描述边界样本的“模糊归属”。
- 比硬划分更能表达簇间重叠。

**局限**

- 同样需要预先给定簇数。
- 不具备原生的噪声建模能力。
- 需要维护完整的隶属度矩阵,开销高于 K-Means。

**适用场景**

- 样本可能同时接近多个簇。
- 希望把“属于某类的程度”也一起保留下来。

**不适合**

- 首要目标是剔除噪声。
- 数据量很大且对内存更敏感。

**复杂度**

- 时间复杂度:`O(nkt)`。
- 空间复杂度:`O(nk)`。

**运行命令**

```bash
python3 fuzzy_c_means.py --samples 320 --clusters 4 --fuzziness 2.0 --output fuzzy_c_means.png
```

**核心源码片段**

```python
def update_memberships(points, centers, fuzziness):
    distances = euclidean_distance_matrix(points, centers)
    memberships = np.zeros_like(distances)
    exponent = 2.0 / (fuzziness - 1.0)

    for sample_index, distance_row in enumerate(distances):
        zero_mask = distance_row == 0.0
        if np.any(zero_mask):
            # If a sample lands exactly on a center, give that center full weight.
            memberships[sample_index, zero_mask] = 1.0 / float(zero_mask.sum())
            continue

        ratios = (distance_row[:, None] / distance_row[None, :]) ** exponent
        memberships[sample_index] = 1.0 / ratios.sum(axis=1)
    return memberships
```

源码: [fuzzy_c_means.py](fuzzy_c_means.py)

![模糊 C 均值示例图](fuzzy_c_means.png)

## 凝聚层次聚类

**核心思想**

凝聚层次聚类从“每个点各自为一个簇”开始,逐步把最相近的两个簇合并起来。它不是一开始就固定一个结果,而是构造从细到粗的层次结构。

**算法步骤**

1. 每个样本先看作一个独立簇。
2. 根据 linkage 规则计算簇间距离。
3. 合并距离最近的两个簇。
4. 一直重复,直到剩下目标簇数。

**过程图**

```mermaid
flowchart TD
    A["开始时每个点单独成簇"] --> B["计算簇间距离"]
    B --> C["合并最近的两个簇"]
    C --> D{"是否达到目标簇数?"}
    D -- "否" --> B
    D -- "是" --> E["在当前层级输出结果"]
```

**关键参数**

- `--clusters`:最终保留的簇数。
- `--linkage`:簇间距离规则,可选 `single`、`average`、`complete`。

**优点**

- 可以观察多层次结构。
- 不需要初始中心。
- 便于比较不同 linkage 的效果。

**局限**

- 这个教学版实现强调可读性,复杂度较高。
- 早期错误合并后面不能撤销。
- 对 linkage 选择比较敏感。

**适用场景**

- 数据规模不大。
- 希望理解簇之间是如何一步步合并的。

**不适合**

- 追求大规模高性能。
- 需要强噪声鲁棒性但又不想预处理。

**复杂度**

- 时间复杂度:教学版约为 `O(n^3)`。
- 空间复杂度:`O(n^2)`。

**运行命令**

```bash
python3 agglomerative_hierarchical.py --samples 140 --clusters 4 --linkage average --output agglomerative_hierarchical.png
```

**核心源码片段**

```python
while len(clusters) > n_clusters:
    best_pair = None
    best_distance = float("inf")

    for left_id, right_id in combinations(sorted(clusters), 2):
        candidate_distance = linkage_distance(
            distances,
            clusters[left_id],
            clusters[right_id],
            linkage,
        )
        if candidate_distance < best_distance:
            best_distance = candidate_distance
            best_pair = (left_id, right_id)

    # Merge the closest two clusters and keep going upward.
    left_id, right_id = best_pair
    clusters[next_cluster_id] = clusters.pop(left_id) + clusters.pop(right_id)
    next_cluster_id += 1
```

源码: [agglomerative_hierarchical.py](agglomerative_hierarchical.py)

![凝聚层次聚类示例图](agglomerative_hierarchical.png)

## DBSCAN

**核心思想**

DBSCAN 不再依赖“找几个中心”,而是用局部密度来定义簇。密度足够高的区域会形成簇,稀疏孤立的点会被视为噪声,因此它特别适合不规则形状和离群点同时存在的情况。

**算法步骤**

1. 对每个点找到 `eps` 邻域内的所有邻居。
2. 邻居数达到 `min_samples` 的点记为核心点。
3. 从核心点出发向外扩展簇。
4. 边界点归到可达的簇中。
5. 无法归入任何簇的点作为噪声保留。

**过程图**

```mermaid
flowchart TD
    A["为每个点建立 eps 邻域"] --> B["根据 min_samples 标记核心点"]
    B --> C["从核心点扩展密度可达簇"]
    C --> D["把边界点挂到对应簇上"]
    D --> E["剩余点标记为噪声"]
```

**关键参数**

- `--eps`:邻域半径。
- `--min-samples`:成为核心点所需的最小邻域点数。

**优点**

- 能自然识别噪声。
- 对任意形状簇更友好。
- 不需要事先给定簇数。

**局限**

- 当不同区域密度差异很大时,单一全局阈值容易失效。
- 参数调节依赖具体数据。
- 这个教学版用到了完整距离矩阵,更偏向可读性而不是大规模性能。

**适用场景**

- 簇形状不规则。
- 数据里有明显离群点。
- 事先不知道簇数。

**不适合**

- 各簇密度差异特别大。
- 想靠一组固定参数适配所有尺度。

**复杂度**

- 教学版时间复杂度:`O(n^2)`。
- 教学版空间复杂度:`O(n^2)`。

**运行命令**

```bash
python3 dbscan.py --samples 240 --eps 0.45 --min-samples 5 --output dbscan.png
```

**核心源码片段**

```python
for point_index in range(len(points)):
    if labels[point_index] != UNVISITED_LABEL:
        continue
    if not core_mask[point_index]:
        labels[point_index] = NOISE_LABEL
        continue

    labels[point_index] = cluster_id
    queue = deque(neighborhoods[point_index])

    # Expand the cluster through density-reachable core points.
    while queue:
        neighbor_index = queue.popleft()
        if labels[neighbor_index] == NOISE_LABEL:
            labels[neighbor_index] = cluster_id
        if labels[neighbor_index] != UNVISITED_LABEL:
            continue
        labels[neighbor_index] = cluster_id
        if core_mask[neighbor_index]:
            queue.extend(neighborhoods[neighbor_index])
```

源码: [dbscan.py](dbscan.py)

![DBSCAN 示例图](dbscan.png)

## OPTICS

**核心思想**

OPTICS 可以理解为“先按密度可达关系对样本排序,再从排序结果里提取簇”。它和 DBSCAN 同属密度类方法,但不会一开始就被一个全局密度阈值锁死,因此更适合不同区域密度不同的情况。

**算法步骤**

1. 依据密度可达关系访问样本。
2. 记录每个点的 reachability distance。
3. 构造 reachability plot。
4. 再根据 `xi` 等规则从图中提取簇。

**过程图**

```mermaid
flowchart TD
    A["按密度可达关系访问样本"] --> B["记录每次访问的可达距离"]
    B --> C["形成 reachability plot"]
    C --> D["按 xi 和最小簇规模提取谷底结构"]
    D --> E["输出标签和诊断信息"]
```

**关键参数**

- `--min-samples`:局部密度估计的邻域规模。
- `--xi`:从 reachability plot 中切分簇的陡峭度阈值。
- `--min-cluster-size`:提取簇时允许的最小簇规模。

**优点**

- 比 DBSCAN 更能处理变密度数据。
- reachability plot 很有解释价值。
- 同样适合任意形状簇并能保留噪声。

**局限**

- 对初学者来说比 DBSCAN 更难理解。
- 参数更多,解释和调参成本更高。

**适用场景**

- DBSCAN 经常过分合并或过分切碎。
- 希望观察密度结构本身,而不仅仅是最终标签。

**不适合**

- 需要最容易讲清楚的密度聚类方法。
- 数据用中心型方法就已经能很好表达。

**复杂度**

- 常见情况下接近 `O(n log n)`。
- 最坏情况下可到 `O(n^2)`。
- 本仓库使用 scikit-learn 实现,重点放在可运行演示和 reachability plot 上。

**运行命令**

```bash
python3 optics.py --samples 240 --min-samples 6 --xi 0.08 --min-cluster-size 24 --output optics.png
```

**核心源码片段**

```python
model = OPTICS(
    min_samples=args.min_samples,
    xi=args.xi,
    min_cluster_size=args.min_cluster_size,
    cluster_method="xi",
)
labels = model.fit_predict(points)

# OPTICS exposes an ordering plus reachability values, so we can
# inspect the density landscape instead of only the final labels.
ordering = model.ordering_
reachability = model.reachability_[ordering].copy()
```

源码: [optics.py](optics.py)

![OPTICS 示例图](optics.png)

## 谱聚类

**核心思想**

谱聚类先把样本转换成图,再利用图拉普拉斯矩阵的谱信息,把样本嵌入到一个更适合切分的低维空间里。它擅长处理“靠中心不好描述、但靠邻域关系很好描述”的数据结构。

**算法步骤**

1. 根据近邻关系构造相似图。
2. 计算图拉普拉斯矩阵的谱信息。
3. 用特征向量把样本映射到新的表示空间。
4. 在这个表示空间中进行划分。

**过程图**

```mermaid
flowchart TD
    A["构造近邻相似图"] --> B["计算图拉普拉斯谱"]
    B --> C["用主要特征向量嵌入样本"]
    C --> D["在嵌入空间里完成划分"]
    D --> E["映射回原始样本得到标签"]
```

**关键参数**

- `--clusters`:要切出的簇数。
- `--neighbors`:构图时使用的近邻数。

**优点**

- 对环形、月牙形等非凸结构很有效。
- 不依赖“均值中心”这种假设。
- 对图结构问题很自然。

**局限**

- 仍然需要预先给定簇数。
- 内存开销较大。
- 没有显式噪声模型。

**适用场景**

- 非凸结构明显。
- 邻接关系比全局欧氏距离更重要。

**不适合**

- 样本非常大且内存紧张。
- 需要明确识别离群点。

**复杂度**

- 时间复杂度往往受特征分解主导,可接近 `O(n^3)`。
- 空间复杂度通常为 `O(n^2)`。

**运行命令**

```bash
python3 spectral_clustering.py --samples 320 --clusters 2 --neighbors 12 --output spectral_clustering.png
```

**核心源码片段**

```python
model = SpectralClustering(
    n_clusters=args.clusters,
    affinity="nearest_neighbors",
    n_neighbors=args.neighbors,
    assign_labels="kmeans",
    random_state=args.seed,
)

# The graph Laplacian embedding happens inside fit_predict.
labels = model.fit_predict(points)
```

源码: [spectral_clustering.py](spectral_clustering.py)

![谱聚类示例图](spectral_clustering.png)

## 兼容性、测试与实现说明

- 所有脚本都以 Python 3.9+ 为目标,不使用 Python 3.10 之后才有的专属语法,因此同一份代码可以覆盖 Python 3.9 到当前较新的 Python 3 版本。
- 前 4 个算法保留手写实现,是为了让原理和关键步骤更容易阅读,不是为了追求最强性能。
- DBSCAN 教学版故意使用完整距离矩阵,以换取实现直观。
- OPTICS 与谱聚类使用 scikit-learn,是因为这两类算法更适合作为“可运行实践示例”,而不是在仓库里堆一大段底层重写代码。
- CI 会在 Python `3.9` 和 GitHub Actions 提供的最新 `3.x` 上运行 `pytest`。

如果你想横向比较所有算法,最简单的方式就是按默认参数把 6 个脚本都运行一遍,再对比仓库根目录生成的图片。


================================================
FILE: agglomerative_hierarchical.py
================================================
"""Teaching-oriented agglomerative hierarchical clustering demo."""

from __future__ import annotations

from dataclasses import dataclass
from itertools import combinations

import numpy as np

from clustering_utils import (
    build_common_parser,
    format_cluster_summary,
    make_convex_dataset,
    pairwise_distances,
    plot_clusters,
    positive_int,
)


@dataclass
class AgglomerativeResult:
    """Container for the final hierarchical clustering state."""

    labels: np.ndarray
    centers: np.ndarray
    merge_distances: list[float]


def linkage_distance(
    distances: np.ndarray,
    left_indices: list[int],
    right_indices: list[int],
    linkage: str,
) -> float:
    """Compute the inter-cluster distance for a linkage strategy."""
    block = distances[np.ix_(left_indices, right_indices)]
    if linkage == "single":
        return float(block.min())
    if linkage == "complete":
        return float(block.max())
    return float(block.mean())


def run_agglomerative(
    points: np.ndarray,
    n_clusters: int,
    *,
    linkage: str,
) -> AgglomerativeResult:
    """Cluster samples with a naive but readable agglomerative algorithm."""
    if n_clusters > len(points):
        raise ValueError("n_clusters must not exceed the number of samples")

    distances = pairwise_distances(points)
    clusters: dict[int, list[int]] = {index: [index] for index in range(len(points))}
    next_cluster_id = len(points)
    merge_distances: list[float] = []

    while len(clusters) > n_clusters:
        best_pair: tuple[int, int] | None = None
        best_distance = float("inf")
        cluster_ids = sorted(clusters)

        for left_id, right_id in combinations(cluster_ids, 2):
            candidate_distance = linkage_distance(
                distances,
                clusters[left_id],
                clusters[right_id],
                linkage,
            )
            if candidate_distance < best_distance:
                best_distance = candidate_distance
                best_pair = (left_id, right_id)

        if best_pair is None:
            break

        left_id, right_id = best_pair
        merged_members = clusters.pop(left_id) + clusters.pop(right_id)
        clusters[next_cluster_id] = merged_members
        next_cluster_id += 1
        merge_distances.append(best_distance)

    labels = np.full(len(points), -1, dtype=int)
    ordered_clusters = sorted(
        clusters.values(),
        key=lambda members: points[members, 0].mean(),
    )
    centers = []
    for label, members in enumerate(ordered_clusters):
        labels[members] = label
        centers.append(points[members].mean(axis=0))

    return AgglomerativeResult(
        labels=labels,
        centers=np.asarray(centers),
        merge_distances=merge_distances,
    )


def parse_args():
    """Parse command-line arguments."""
    parser = build_common_parser(
        "Run a teaching-oriented agglomerative hierarchical clustering demo.",
        default_samples=140,
        default_output="agglomerative_hierarchical.png",
    )
    parser.add_argument(
        "--clusters",
        type=positive_int,
        default=4,
        help="Number of clusters to keep after all merges.",
    )
    parser.add_argument(
        "--linkage",
        choices=("single", "average", "complete"),
        default="average",
        help="Linkage rule used to compare clusters.",
    )
    return parser.parse_args()


def main() -> None:
    """Generate data, run agglomerative clustering, and save the figure."""
    args = parse_args()
    points = make_convex_dataset(args.samples, args.seed, n_clusters=args.clusters)
    result = run_agglomerative(points, args.clusters, linkage=args.linkage)

    output_path = plot_clusters(
        points,
        result.labels,
        title=f"Agglomerative Hierarchical Clustering ({args.linkage.title()} Linkage)",
        output_path=args.output,
        show=args.show,
        centers=result.centers,
    )
    final_merge = result.merge_distances[-1] if result.merge_distances else 0.0
    print(
        f"Saved {output_path} | final_merge_distance={final_merge:.2f} | "
        f"{format_cluster_summary(result.labels)}"
    )


if __name__ == "__main__":
    main()


================================================
FILE: clustering_utils.py
================================================
"""Shared utilities for the clustering demo scripts."""

from __future__ import annotations

import argparse
import os
from pathlib import Path
from typing import Iterable, Sequence

import numpy as np
from sklearn.datasets import make_blobs, make_circles, make_moons

NOISE_LABEL = -1


def positive_int(value: str) -> int:
    """Parse a positive integer for argparse."""
    parsed = int(value)
    if parsed <= 0:
        raise argparse.ArgumentTypeError("value must be a positive integer")
    return parsed


def non_negative_int(value: str) -> int:
    """Parse a non-negative integer for argparse."""
    parsed = int(value)
    if parsed < 0:
        raise argparse.ArgumentTypeError("value must be a non-negative integer")
    return parsed


def positive_float(value: str) -> float:
    """Parse a positive float for argparse."""
    parsed = float(value)
    if parsed <= 0:
        raise argparse.ArgumentTypeError("value must be a positive number")
    return parsed


def fuzziness_value(value: str) -> float:
    """Parse the fuzzy c-means exponent, which must stay above 1."""
    parsed = float(value)
    if parsed <= 1.0:
        raise argparse.ArgumentTypeError("fuzziness must be greater than 1")
    return parsed


def build_common_parser(
    description: str,
    *,
    default_samples: int,
    default_output: str,
) -> argparse.ArgumentParser:
    """Create a parser with the shared CLI flags used across the demos."""
    parser = argparse.ArgumentParser(description=description)
    parser.add_argument("--seed", type=int, default=42, help="Random seed.")
    parser.add_argument(
        "--samples",
        type=positive_int,
        default=default_samples,
        help="Number of samples to generate.",
    )
    parser.add_argument(
        "--output",
        type=Path,
        default=Path(default_output),
        help="Target image path.",
    )
    parser.add_argument(
        "--show",
        action="store_true",
        help="Display the plot after saving it.",
    )
    return parser


def load_pyplot(show: bool):
    """Import matplotlib.pyplot with a safe backend for headless runs."""
    import matplotlib

    is_headless = os.name != "nt" and not os.environ.get("DISPLAY")
    if not show or is_headless:
        matplotlib.use("Agg")

    import matplotlib.pyplot as plt

    return plt


def save_figure(figure, output_path: Path, show: bool) -> Path:
    """Persist a matplotlib figure and optionally display it."""
    output_path = Path(output_path)
    output_path.parent.mkdir(parents=True, exist_ok=True)
    figure.tight_layout()
    figure.savefig(output_path, dpi=200, bbox_inches="tight")

    plt = load_pyplot(show)
    if show:
        plt.show()
    plt.close(figure)
    return output_path


def default_blob_centers(n_clusters: int) -> np.ndarray:
    """Return readable fixed centers for the synthetic blob dataset."""
    presets = np.array(
        [
            (-5.0, -5.0),
            (-5.0, 4.5),
            (4.5, -4.5),
            (4.5, 4.5),
            (0.0, 0.0),
            (-8.0, 0.5),
            (8.0, 0.5),
            (0.0, 8.0),
        ],
        dtype=float,
    )
    if n_clusters <= len(presets):
        return presets[:n_clusters]

    angles = np.linspace(0.0, 2.0 * np.pi, n_clusters, endpoint=False)
    return np.column_stack((7.0 * np.cos(angles), 7.0 * np.sin(angles)))


def allocate_counts(total: int, weights: Sequence[float]) -> list[int]:
    """Split a total count across weighted buckets while preserving the sum."""
    normalized = np.asarray(weights, dtype=float)
    normalized = normalized / normalized.sum()
    counts = [max(1, int(total * weight)) for weight in normalized]

    difference = total - sum(counts)
    index = 0
    while difference != 0:
        bucket = index % len(counts)
        if difference > 0:
            counts[bucket] += 1
            difference -= 1
        elif counts[bucket] > 1:
            counts[bucket] -= 1
            difference += 1
        index += 1
    return counts


def make_convex_dataset(
    samples: int,
    seed: int,
    *,
    n_clusters: int = 4,
    cluster_std: float = 0.85,
) -> np.ndarray:
    """Generate a well-separated blob dataset for centroid-style clustering."""
    data, _ = make_blobs(
        n_samples=samples,
        centers=default_blob_centers(n_clusters),
        cluster_std=cluster_std,
        random_state=seed,
    )
    return data.astype(np.float64)


def make_density_dataset(samples: int, seed: int) -> np.ndarray:
    """Generate a dataset with irregular shapes, mixed density, and noise."""
    moon_samples, blob_samples, noise_samples = allocate_counts(samples, (0.64, 0.26, 0.10))

    moons, _ = make_moons(n_samples=moon_samples, noise=0.07, random_state=seed)
    moons = moons * np.array([2.4, 1.7]) + np.array([-4.0, 0.45])

    blob, _ = make_blobs(
        n_samples=blob_samples,
        centers=np.array([[4.9, -0.9]]),
        cluster_std=0.5,
        random_state=seed + 1,
    )

    rng = np.random.default_rng(seed + 2)
    noise = rng.uniform(low=(-8.0, -4.5), high=(8.0, 4.5), size=(noise_samples, 2))

    data = np.vstack((moons, blob, noise))
    rng.shuffle(data)
    return data.astype(np.float64)


def make_spectral_dataset(samples: int, seed: int) -> np.ndarray:
    """Generate a pair of nested non-convex clusters for spectral clustering."""
    data, _ = make_circles(
        n_samples=samples,
        factor=0.45,
        noise=0.05,
        random_state=seed,
    )
    transform = np.array([[1.0, 0.25], [-0.1, 1.1]])
    return (data @ transform) * 4.5


def squared_distance_matrix(points: np.ndarray, centers: np.ndarray) -> np.ndarray:
    """Return squared Euclidean distances between samples and centers."""
    deltas = points[:, None, :] - centers[None, :, :]
    return np.einsum("ijk,ijk->ij", deltas, deltas)


def euclidean_distance_matrix(points: np.ndarray, others: np.ndarray) -> np.ndarray:
    """Return Euclidean distances between all rows in two arrays."""
    return np.sqrt(np.maximum(squared_distance_matrix(points, others), 0.0))


def pairwise_distances(points: np.ndarray) -> np.ndarray:
    """Return an all-pairs Euclidean distance matrix for a point cloud."""
    return euclidean_distance_matrix(points, points)


def initialize_kmeans_plus_plus(
    points: np.ndarray,
    n_clusters: int,
    rng: np.random.Generator,
) -> np.ndarray:
    """Initialize centers with the k-means++ seeding strategy."""
    if n_clusters > len(points):
        raise ValueError("n_clusters must be smaller than or equal to the sample count")

    centers = np.empty((n_clusters, points.shape[1]), dtype=np.float64)
    first_index = int(rng.integers(len(points)))
    centers[0] = points[first_index]
    closest_distances = squared_distance_matrix(points, centers[:1]).ravel()

    for center_index in range(1, n_clusters):
        total_distance = closest_distances.sum()
        if total_distance == 0.0:
            chosen_index = int(rng.integers(len(points)))
        else:
            probabilities = closest_distances / total_distance
            chosen_index = int(rng.choice(len(points), p=probabilities))

        centers[center_index] = points[chosen_index]
        closest_distances = np.minimum(
            closest_distances,
            squared_distance_matrix(points, centers[center_index : center_index + 1]).ravel(),
        )
    return centers


def plot_clusters(
    points: np.ndarray,
    labels: np.ndarray,
    *,
    title: str,
    output_path: Path,
    show: bool,
    centers: np.ndarray | None = None,
    center_trace: Sequence[np.ndarray] | None = None,
    core_mask: np.ndarray | None = None,
) -> Path:
    """Create a scatter plot for clustered 2D points."""
    plt = load_pyplot(show)
    figure, axis = plt.subplots(figsize=(8.5, 6.5))

    unique_labels = sorted(int(label) for label in np.unique(labels))
    palette = plt.get_cmap("tab10", max(len(unique_labels), 1))

    cluster_order = 0
    for label in unique_labels:
        cluster_points = points[labels == label]
        if len(cluster_points) == 0:
            continue

        if label == NOISE_LABEL:
            axis.scatter(
                cluster_points[:, 0],
                cluster_points[:, 1],
                c="#94a3b8",
                marker="x",
                s=48,
                label="Noise",
            )
            continue

        axis.scatter(
            cluster_points[:, 0],
            cluster_points[:, 1],
            color=palette(cluster_order),
            s=42,
            alpha=0.86,
            label=f"Cluster {cluster_order + 1}",
        )
        cluster_order += 1

    if core_mask is not None and np.any(core_mask):
        core_points = points[core_mask]
        axis.scatter(
            core_points[:, 0],
            core_points[:, 1],
            facecolors="none",
            edgecolors="#0f172a",
            linewidths=0.8,
            s=80,
            label="Core points",
        )

    if center_trace:
        stacked_trace = np.asarray(center_trace)
        if stacked_trace.ndim == 3:
            for center_index in range(stacked_trace.shape[1]):
                path = stacked_trace[:, center_index, :]
                axis.plot(
                    path[:, 0],
                    path[:, 1],
                    linestyle="--",
                    linewidth=1.5,
                    color="#111827",
                    alpha=0.8,
                )

    if centers is not None and len(centers) > 0:
        axis.scatter(
            centers[:, 0],
            centers[:, 1],
            marker="*",
            c="#111827",
            s=250,
            edgecolors="white",
            linewidths=0.7,
            label="Centers",
        )

    axis.set_title(title, fontsize=14, fontweight="bold")
    axis.set_xlabel("Feature 1")
    axis.set_ylabel("Feature 2")
    axis.grid(alpha=0.22, linestyle=":")
    axis.legend(loc="best", frameon=True)

    return save_figure(figure, output_path, show)


def format_cluster_summary(labels: np.ndarray) -> str:
    """Return a human-readable summary of cluster counts."""
    unique, counts = np.unique(labels, return_counts=True)
    parts = []
    for label, count in zip(unique, counts):
        name = "noise" if int(label) == NOISE_LABEL else f"cluster {int(label) + 1}"
        parts.append(f"{name}: {int(count)}")
    return ", ".join(parts)


def count_clusters(labels: np.ndarray) -> int:
    """Count non-noise clusters in a label vector."""
    return int(sum(label != NOISE_LABEL for label in np.unique(labels)))


def ensure_python_39_compatible_features() -> None:
    """Keep a no-op sentinel that documents the compatibility target."""
    return None


================================================
FILE: dbscan.py
================================================
"""Teaching-oriented DBSCAN demo implemented with NumPy only."""

from __future__ import annotations

from collections import deque
from dataclasses import dataclass

import numpy as np

from clustering_utils import (
    NOISE_LABEL,
    build_common_parser,
    count_clusters,
    format_cluster_summary,
    make_density_dataset,
    pairwise_distances,
    plot_clusters,
    positive_float,
    positive_int,
)

UNVISITED_LABEL = -99


@dataclass
class DBSCANResult:
    """Container for DBSCAN labels and the detected core points."""

    labels: np.ndarray
    core_mask: np.ndarray


def run_dbscan(points: np.ndarray, *, eps: float, min_samples: int) -> DBSCANResult:
    """Cluster samples with a readable DBSCAN implementation."""
    distances = pairwise_distances(points)
    neighborhoods = [np.flatnonzero(distances[index] <= eps).tolist() for index in range(len(points))]
    core_mask = np.array([len(neighbors) >= min_samples for neighbors in neighborhoods], dtype=bool)
    labels = np.full(len(points), UNVISITED_LABEL, dtype=int)

    cluster_id = 0
    for point_index in range(len(points)):
        if labels[point_index] != UNVISITED_LABEL:
            continue

        if not core_mask[point_index]:
            labels[point_index] = NOISE_LABEL
            continue

        labels[point_index] = cluster_id
        queue = deque(neighborhoods[point_index])

        # Breadth-first expansion keeps the implementation easy to follow.
        while queue:
            neighbor_index = queue.popleft()

            if labels[neighbor_index] == NOISE_LABEL:
                labels[neighbor_index] = cluster_id

            if labels[neighbor_index] != UNVISITED_LABEL:
                continue

            labels[neighbor_index] = cluster_id
            if core_mask[neighbor_index]:
                queue.extend(
                    candidate
                    for candidate in neighborhoods[neighbor_index]
                    if labels[candidate] in (UNVISITED_LABEL, NOISE_LABEL)
                )

        cluster_id += 1

    labels[labels == UNVISITED_LABEL] = NOISE_LABEL
    return DBSCANResult(labels=labels, core_mask=core_mask)


def parse_args():
    """Parse command-line arguments."""
    parser = build_common_parser(
        "Run a teaching-oriented DBSCAN clustering demo.",
        default_samples=240,
        default_output="dbscan.png",
    )
    parser.add_argument(
        "--eps",
        type=positive_float,
        default=0.45,
        help="Neighborhood radius used to define density reachability.",
    )
    parser.add_argument(
        "--min-samples",
        type=positive_int,
        default=5,
        help="Minimum number of points in an eps-neighborhood to mark a core point.",
    )
    return parser.parse_args()


def main() -> None:
    """Generate data, run DBSCAN, and save the result figure."""
    args = parse_args()
    points = make_density_dataset(args.samples, args.seed)
    result = run_dbscan(points, eps=args.eps, min_samples=args.min_samples)

    output_path = plot_clusters(
        points,
        result.labels,
        title="DBSCAN on Mixed-Density Shapes and Noise",
        output_path=args.output,
        show=args.show,
        core_mask=result.core_mask,
    )
    print(
        f"Saved {output_path} | clusters={count_clusters(result.labels)} | "
        f"{format_cluster_summary(result.labels)}"
    )


if __name__ == "__main__":
    main()


================================================
FILE: fuzzy_c_means.py
================================================
"""Readable fuzzy c-means clustering demo implemented with NumPy only."""

from __future__ import annotations

from dataclasses import dataclass

import numpy as np

from clustering_utils import (
    build_common_parser,
    euclidean_distance_matrix,
    format_cluster_summary,
    fuzziness_value,
    initialize_kmeans_plus_plus,
    make_convex_dataset,
    plot_clusters,
    positive_float,
    positive_int,
)


@dataclass
class FuzzyCMeansResult:
    """Container for the final fuzzy clustering state."""

    labels: np.ndarray
    centers: np.ndarray
    memberships: np.ndarray
    center_trace: list[np.ndarray]
    iterations: int


def update_memberships(
    points: np.ndarray,
    centers: np.ndarray,
    fuzziness: float,
) -> np.ndarray:
    """Compute the soft assignment matrix for all samples."""
    distances = euclidean_distance_matrix(points, centers)
    memberships = np.zeros_like(distances)
    exponent = 2.0 / (fuzziness - 1.0)

    for sample_index, distance_row in enumerate(distances):
        zero_mask = distance_row == 0.0
        if np.any(zero_mask):
            memberships[sample_index, zero_mask] = 1.0 / float(zero_mask.sum())
            continue

        ratios = (distance_row[:, None] / distance_row[None, :]) ** exponent
        memberships[sample_index] = 1.0 / ratios.sum(axis=1)
    return memberships


def update_centers(
    points: np.ndarray,
    memberships: np.ndarray,
    fuzziness: float,
) -> np.ndarray:
    """Update cluster centers with membership-weighted averages."""
    weights = memberships**fuzziness
    numerator = weights.T @ points
    denominator = weights.sum(axis=0)[:, None]
    return numerator / denominator


def run_fuzzy_c_means(
    points: np.ndarray,
    n_clusters: int,
    *,
    seed: int,
    fuzziness: float,
    max_iter: int,
    tolerance: float,
) -> FuzzyCMeansResult:
    """Cluster samples with fuzzy c-means."""
    rng = np.random.default_rng(seed)
    centers = initialize_kmeans_plus_plus(points, n_clusters, rng)
    center_trace = [centers.copy()]
    memberships = np.zeros((len(points), n_clusters), dtype=np.float64)

    iterations = 0
    for iteration in range(1, max_iter + 1):
        memberships = update_memberships(points, centers, fuzziness)
        updated_centers = update_centers(points, memberships, fuzziness)
        center_trace.append(updated_centers.copy())

        center_shift = np.linalg.norm(updated_centers - centers, axis=1).max()
        centers = updated_centers
        iterations = iteration
        if center_shift <= tolerance:
            break

    labels = memberships.argmax(axis=1)
    return FuzzyCMeansResult(
        labels=labels,
        centers=centers,
        memberships=memberships,
        center_trace=center_trace,
        iterations=iterations,
    )


def parse_args():
    """Parse command-line arguments."""
    parser = build_common_parser(
        "Run a teaching-oriented fuzzy c-means clustering demo.",
        default_samples=320,
        default_output="fuzzy_c_means.png",
    )
    parser.add_argument(
        "--clusters",
        type=positive_int,
        default=4,
        help="Number of cluster centers to fit.",
    )
    parser.add_argument(
        "--fuzziness",
        type=fuzziness_value,
        default=2.0,
        help="The fuzziness exponent m. Values near 1 make assignments harder.",
    )
    parser.add_argument(
        "--max-iter",
        type=positive_int,
        default=120,
        help="Maximum number of optimization steps.",
    )
    parser.add_argument(
        "--tolerance",
        type=positive_float,
        default=1e-3,
        help="Stop when every center moves less than this threshold.",
    )
    return parser.parse_args()


def main() -> None:
    """Generate data, run fuzzy c-means, and save the result figure."""
    args = parse_args()
    points = make_convex_dataset(args.samples, args.seed, n_clusters=args.clusters)
    result = run_fuzzy_c_means(
        points,
        args.clusters,
        seed=args.seed,
        fuzziness=args.fuzziness,
        max_iter=args.max_iter,
        tolerance=args.tolerance,
    )

    output_path = plot_clusters(
        points,
        result.labels,
        title="Fuzzy C-Means on Convex Blob Clusters",
        output_path=args.output,
        show=args.show,
        centers=result.centers,
        center_trace=result.center_trace,
    )
    print(
        f"Saved {output_path} | iterations={result.iterations} | "
        f"{format_cluster_summary(result.labels)}"
    )


if __name__ == "__main__":
    main()


================================================
FILE: k_means_plus_plus.py
================================================
"""Readable k-means++ clustering demo implemented with NumPy only."""

from __future__ import annotations

from dataclasses import dataclass

import numpy as np

from clustering_utils import (
    build_common_parser,
    format_cluster_summary,
    initialize_kmeans_plus_plus,
    make_convex_dataset,
    plot_clusters,
    positive_float,
    positive_int,
    squared_distance_matrix,
)


@dataclass
class KMeansResult:
    """Container for the final clustering state."""

    labels: np.ndarray
    centers: np.ndarray
    center_trace: list[np.ndarray]
    inertia: float
    iterations: int


def assign_points(points: np.ndarray, centers: np.ndarray) -> tuple[np.ndarray, float]:
    """Assign every point to the closest center and return the inertia."""
    distances = squared_distance_matrix(points, centers)
    labels = distances.argmin(axis=1)
    inertia = float(np.take_along_axis(distances, labels[:, None], axis=1).sum())
    return labels, inertia


def update_centers(
    points: np.ndarray,
    labels: np.ndarray,
    current_centers: np.ndarray,
    rng: np.random.Generator,
) -> np.ndarray:
    """Recompute centers and reseed empty clusters with far-away samples."""
    new_centers = current_centers.copy()
    distances = squared_distance_matrix(points, current_centers)
    farthest_index = int(np.argmax(distances.min(axis=1)))

    for cluster_index in range(len(current_centers)):
        members = points[labels == cluster_index]
        if len(members) == 0:
            new_centers[cluster_index] = points[farthest_index]
            farthest_index = int(rng.integers(len(points)))
            continue
        new_centers[cluster_index] = members.mean(axis=0)
    return new_centers


def run_kmeans(
    points: np.ndarray,
    n_clusters: int,
    *,
    seed: int,
    max_iter: int,
    tolerance: float,
) -> KMeansResult:
    """Cluster samples with k-means++ initialization."""
    if n_clusters > len(points):
        raise ValueError("n_clusters must not exceed the number of samples")

    rng = np.random.default_rng(seed)
    centers = initialize_kmeans_plus_plus(points, n_clusters, rng)
    center_trace = [centers.copy()]

    iterations = 0
    inertia = 0.0
    for iteration in range(1, max_iter + 1):
        labels, inertia = assign_points(points, centers)
        updated_centers = update_centers(points, labels, centers, rng)
        center_trace.append(updated_centers.copy())

        center_shift = np.linalg.norm(updated_centers - centers, axis=1).max()
        centers = updated_centers
        iterations = iteration
        if center_shift <= tolerance:
            break

    final_labels, inertia = assign_points(points, centers)
    return KMeansResult(
        labels=final_labels,
        centers=centers,
        center_trace=center_trace,
        inertia=inertia,
        iterations=iterations,
    )


def parse_args():
    """Parse command-line arguments."""
    parser = build_common_parser(
        "Run a teaching-oriented k-means++ clustering demo.",
        default_samples=320,
        default_output="k_means_plus_plus.png",
    )
    parser.add_argument(
        "--clusters",
        type=positive_int,
        default=4,
        help="Number of clusters to recover.",
    )
    parser.add_argument(
        "--max-iter",
        type=positive_int,
        default=100,
        help="Maximum number of optimization steps.",
    )
    parser.add_argument(
        "--tolerance",
        type=positive_float,
        default=1e-3,
        help="Stop when every center moves less than this threshold.",
    )
    return parser.parse_args()


def main() -> None:
    """Generate data, run k-means++, and save the result figure."""
    args = parse_args()
    points = make_convex_dataset(args.samples, args.seed, n_clusters=args.clusters)
    result = run_kmeans(
        points,
        args.clusters,
        seed=args.seed,
        max_iter=args.max_iter,
        tolerance=args.tolerance,
    )

    output_path = plot_clusters(
        points,
        result.labels,
        title="K-Means++ on Convex Blob Clusters",
        output_path=args.output,
        show=args.show,
        centers=result.centers,
        center_trace=result.center_trace,
    )
    print(
        f"Saved {output_path} | iterations={result.iterations} | "
        f"inertia={result.inertia:.2f} | {format_cluster_summary(result.labels)}"
    )


if __name__ == "__main__":
    main()


================================================
FILE: optics.py
================================================
"""OPTICS demo focused on variable-density clustering."""

from __future__ import annotations

import numpy as np
from sklearn.cluster import OPTICS

from clustering_utils import (
    NOISE_LABEL,
    build_common_parser,
    count_clusters,
    format_cluster_summary,
    load_pyplot,
    make_density_dataset,
    positive_float,
    positive_int,
    save_figure,
)


def plot_optics_result(
    points: np.ndarray,
    labels: np.ndarray,
    model: OPTICS,
    *,
    output_path,
    show: bool,
):
    """Plot both the clustered points and the OPTICS reachability landscape."""
    plt = load_pyplot(show)
    figure, (scatter_axis, reachability_axis) = plt.subplots(
        2,
        1,
        figsize=(9.0, 9.5),
        gridspec_kw={"height_ratios": [2.2, 1.3]},
    )

    unique_labels = sorted(int(label) for label in np.unique(labels))
    palette = plt.get_cmap("tab10", max(len(unique_labels), 1))

    cluster_order = 0
    for label in unique_labels:
        cluster_points = points[labels == label]
        if label == NOISE_LABEL:
            scatter_axis.scatter(
                cluster_points[:, 0],
                cluster_points[:, 1],
                c="#94a3b8",
                marker="x",
                s=48,
                label="Noise",
            )
            continue

        color = palette(cluster_order)
        scatter_axis.scatter(
            cluster_points[:, 0],
            cluster_points[:, 1],
            color=color,
            s=42,
            alpha=0.86,
            label=f"Cluster {cluster_order + 1}",
        )
        cluster_order += 1

    ordering = model.ordering_
    reachability = model.reachability_[ordering].copy()
    finite_reachability = reachability[np.isfinite(reachability)]
    replacement = float(finite_reachability.max() * 1.05) if finite_reachability.size else 1.0
    reachability[~np.isfinite(reachability)] = replacement
    ordered_labels = labels[ordering]

    cluster_order = 0
    for label in unique_labels:
        mask = ordered_labels == label
        if not np.any(mask):
            continue
        if label == NOISE_LABEL:
            color = "#94a3b8"
            name = "Noise"
        else:
            color = palette(cluster_order)
            name = f"Cluster {cluster_order + 1}"
            cluster_order += 1
        reachability_axis.bar(
            np.flatnonzero(mask),
            reachability[mask],
            width=1.0,
            color=color,
            alpha=0.9,
            label=name,
        )

    scatter_axis.set_title("OPTICS on Mixed-Density Shapes and Noise", fontsize=14, fontweight="bold")
    scatter_axis.set_xlabel("Feature 1")
    scatter_axis.set_ylabel("Feature 2")
    scatter_axis.grid(alpha=0.22, linestyle=":")
    scatter_axis.legend(loc="best")

    reachability_axis.set_title("Reachability Plot", fontsize=12, fontweight="bold")
    reachability_axis.set_xlabel("Sample order")
    reachability_axis.set_ylabel("Reachability distance")
    reachability_axis.grid(alpha=0.22, linestyle=":")

    return save_figure(figure, output_path, show)


def parse_args():
    """Parse command-line arguments."""
    parser = build_common_parser(
        "Run an OPTICS demo for variable-density clustering.",
        default_samples=240,
        default_output="optics.png",
    )
    parser.add_argument(
        "--min-samples",
        type=positive_int,
        default=6,
        help="Minimum neighborhood size used by OPTICS.",
    )
    parser.add_argument(
        "--xi",
        type=positive_float,
        default=0.08,
        help="Steepness threshold used to extract clusters from the reachability plot.",
    )
    parser.add_argument(
        "--min-cluster-size",
        type=positive_int,
        default=24,
        help="Minimum cluster size used when extracting clusters.",
    )
    return parser.parse_args()


def main() -> None:
    """Generate data, run OPTICS, and save the figure."""
    args = parse_args()
    points = make_density_dataset(args.samples, args.seed)
    model = OPTICS(
        min_samples=args.min_samples,
        xi=args.xi,
        min_cluster_size=args.min_cluster_size,
        cluster_method="xi",
    )
    labels = model.fit_predict(points)
    output_path = plot_optics_result(points, labels, model, output_path=args.output, show=args.show)
    print(
        f"Saved {output_path} | clusters={count_clusters(labels)} | "
        f"{format_cluster_summary(labels)}"
    )


if __name__ == "__main__":
    main()


================================================
FILE: requirements.txt
================================================
numpy>=1.23
matplotlib>=3.7
scikit-learn>=1.3
pytest>=8.0


================================================
FILE: spectral_clustering.py
================================================
"""Spectral clustering demo for non-convex cluster shapes."""

from __future__ import annotations

import warnings

from sklearn.cluster import SpectralClustering

from clustering_utils import (
    build_common_parser,
    count_clusters,
    format_cluster_summary,
    make_spectral_dataset,
    plot_clusters,
    positive_int,
)


def parse_args():
    """Parse command-line arguments."""
    parser = build_common_parser(
        "Run a spectral clustering demo on non-convex ring-shaped data.",
        default_samples=320,
        default_output="spectral_clustering.png",
    )
    parser.add_argument(
        "--clusters",
        type=positive_int,
        default=2,
        help="Number of graph partitions to recover.",
    )
    parser.add_argument(
        "--neighbors",
        type=positive_int,
        default=12,
        help="Number of nearest neighbors used to build the affinity graph.",
    )
    return parser.parse_args()


def main() -> None:
    """Generate data, run spectral clustering, and save the figure."""
    args = parse_args()
    points = make_spectral_dataset(args.samples, args.seed)
    model = SpectralClustering(
        n_clusters=args.clusters,
        affinity="nearest_neighbors",
        n_neighbors=args.neighbors,
        assign_labels="kmeans",
        random_state=args.seed,
    )
    with warnings.catch_warnings():
        warnings.filterwarnings(
            "ignore",
            message="Graph is not fully connected, spectral embedding may not work as expected.",
        )
        labels = model.fit_predict(points)

    output_path = plot_clusters(
        points,
        labels,
        title="Spectral Clustering on Nested Non-Convex Rings",
        output_path=args.output,
        show=args.show,
    )
    print(
        f"Saved {output_path} | clusters={count_clusters(labels)} | "
        f"{format_cluster_summary(labels)}"
    )


if __name__ == "__main__":
    main()


================================================
FILE: tests/test_repository.py
================================================
"""Smoke tests for the clustering demo repository."""

from __future__ import annotations

import re
import subprocess
import sys
from pathlib import Path

import pytest

ROOT = Path(__file__).resolve().parents[1]
README_FILES = [
    ROOT / "README.md",
    ROOT / "README.zh-CN.md",
    ROOT / "README.ja.md",
]
SCRIPT_CASES = [
    ("k_means_plus_plus.py", ["--samples", "80"]),
    ("fuzzy_c_means.py", ["--samples", "80"]),
    ("agglomerative_hierarchical.py", ["--samples", "56"]),
    ("dbscan.py", ["--samples", "120"]),
    ("optics.py", ["--samples", "120"]),
    ("spectral_clustering.py", ["--samples", "120"]),
]


@pytest.mark.parametrize(("script_name", "extra_args"), SCRIPT_CASES)
def test_demo_scripts_smoke(script_name: str, extra_args: list[str], tmp_path: Path) -> None:
    """Every demo script should run end-to-end and emit an image."""
    output_path = tmp_path / f"{Path(script_name).stem}.png"
    result = subprocess.run(
        [sys.executable, str(ROOT / script_name), "--output", str(output_path), *extra_args],
        cwd=ROOT,
        capture_output=True,
        text=True,
        timeout=180,
        check=False,
    )
    message = f"stdout:\n{result.stdout}\n\nstderr:\n{result.stderr}"
    assert result.returncode == 0, message
    assert output_path.exists(), message
    assert output_path.stat().st_size > 0, message


def test_readmes_have_balanced_code_fences_and_valid_local_links() -> None:
    """The localized READMEs should keep local links in sync with the repo."""
    markdown_link_pattern = re.compile(r"!?\[[^\]]+\]\(([^)]+)\)")

    for readme_path in README_FILES:
        text = readme_path.read_text(encoding="utf-8")
        assert text.count("```") % 2 == 0, readme_path.name
        assert not text.startswith("---\nlayout:"), readme_path.name
        assert "README.md" in text
        assert "README.zh-CN.md" in text
        assert "README.ja.md" in text

        for match in markdown_link_pattern.finditer(text):
            target = match.group(1).split("#", 1)[0]
            if not target or target.startswith(("http://", "https://", "mailto:")):
                continue
            candidate = (readme_path.parent / target).resolve()
            assert candidate.exists(), f"{readme_path.name} links to missing file: {target}"


def test_expected_repository_files_exist() -> None:
    """Key scripts and generated figures should stay present."""
    expected_paths = [
        ROOT / "clustering_utils.py",
        ROOT / "k_means_plus_plus.py",
        ROOT / "fuzzy_c_means.py",
        ROOT / "agglomerative_hierarchical.py",
        ROOT / "dbscan.py",
        ROOT / "optics.py",
        ROOT / "spectral_clustering.py",
        ROOT / "k_means_plus_plus.png",
        ROOT / "fuzzy_c_means.png",
        ROOT / "agglomerative_hierarchical.png",
        ROOT / "dbscan.png",
        ROOT / "optics.png",
        ROOT / "spectral_clustering.png",
    ]
    for path in expected_paths:
        assert path.exists(), path.name
Download .txt
gitextract_3ss8tfqs/

├── .github/
│   └── workflows/
│       └── ci.yml
├── .gitignore
├── README.ja.md
├── README.md
├── README.zh-CN.md
├── agglomerative_hierarchical.py
├── clustering_utils.py
├── dbscan.py
├── fuzzy_c_means.py
├── k_means_plus_plus.py
├── optics.py
├── requirements.txt
├── spectral_clustering.py
└── tests/
    └── test_repository.py
Download .txt
SYMBOL INDEX (49 symbols across 8 files)

FILE: agglomerative_hierarchical.py
  class AgglomerativeResult (line 21) | class AgglomerativeResult:
  function linkage_distance (line 29) | def linkage_distance(
  function run_agglomerative (line 44) | def run_agglomerative(
  function parse_args (line 101) | def parse_args():
  function main (line 123) | def main() -> None:

FILE: clustering_utils.py
  function positive_int (line 16) | def positive_int(value: str) -> int:
  function non_negative_int (line 24) | def non_negative_int(value: str) -> int:
  function positive_float (line 32) | def positive_float(value: str) -> float:
  function fuzziness_value (line 40) | def fuzziness_value(value: str) -> float:
  function build_common_parser (line 48) | def build_common_parser(
  function load_pyplot (line 77) | def load_pyplot(show: bool):
  function save_figure (line 90) | def save_figure(figure, output_path: Path, show: bool) -> Path:
  function default_blob_centers (line 104) | def default_blob_centers(n_clusters: int) -> np.ndarray:
  function allocate_counts (line 126) | def allocate_counts(total: int, weights: Sequence[float]) -> list[int]:
  function make_convex_dataset (line 146) | def make_convex_dataset(
  function make_density_dataset (line 163) | def make_density_dataset(samples: int, seed: int) -> np.ndarray:
  function make_spectral_dataset (line 185) | def make_spectral_dataset(samples: int, seed: int) -> np.ndarray:
  function squared_distance_matrix (line 197) | def squared_distance_matrix(points: np.ndarray, centers: np.ndarray) -> ...
  function euclidean_distance_matrix (line 203) | def euclidean_distance_matrix(points: np.ndarray, others: np.ndarray) ->...
  function pairwise_distances (line 208) | def pairwise_distances(points: np.ndarray) -> np.ndarray:
  function initialize_kmeans_plus_plus (line 213) | def initialize_kmeans_plus_plus(
  function plot_clusters (line 243) | def plot_clusters(
  function format_cluster_summary (line 335) | def format_cluster_summary(labels: np.ndarray) -> str:
  function count_clusters (line 345) | def count_clusters(labels: np.ndarray) -> int:
  function ensure_python_39_compatible_features (line 350) | def ensure_python_39_compatible_features() -> None:

FILE: dbscan.py
  class DBSCANResult (line 26) | class DBSCANResult:
  function run_dbscan (line 33) | def run_dbscan(points: np.ndarray, *, eps: float, min_samples: int) -> D...
  function parse_args (line 76) | def parse_args():
  function main (line 98) | def main() -> None:

FILE: fuzzy_c_means.py
  class FuzzyCMeansResult (line 23) | class FuzzyCMeansResult:
  function update_memberships (line 33) | def update_memberships(
  function update_centers (line 54) | def update_centers(
  function run_fuzzy_c_means (line 66) | def run_fuzzy_c_means(
  function parse_args (line 103) | def parse_args():
  function main (line 137) | def main() -> None:

FILE: k_means_plus_plus.py
  class KMeansResult (line 22) | class KMeansResult:
  function assign_points (line 32) | def assign_points(points: np.ndarray, centers: np.ndarray) -> tuple[np.n...
  function update_centers (line 40) | def update_centers(
  function run_kmeans (line 61) | def run_kmeans(
  function parse_args (line 100) | def parse_args():
  function main (line 128) | def main() -> None:

FILE: optics.py
  function plot_optics_result (line 21) | def plot_optics_result(
  function parse_args (line 108) | def parse_args():
  function main (line 136) | def main() -> None:

FILE: spectral_clustering.py
  function parse_args (line 19) | def parse_args():
  function main (line 41) | def main() -> None:

FILE: tests/test_repository.py
  function test_demo_scripts_smoke (line 29) | def test_demo_scripts_smoke(script_name: str, extra_args: list[str], tmp...
  function test_readmes_have_balanced_code_fences_and_valid_local_links (line 46) | def test_readmes_have_balanced_code_fences_and_valid_local_links() -> None:
  function test_expected_repository_files_exist (line 66) | def test_expected_repository_files_exist() -> None:
Condensed preview — 14 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (119K chars).
[
  {
    "path": ".github/workflows/ci.yml",
    "chars": 577,
    "preview": "name: CI\n\non:\n  push:\n  pull_request:\n\njobs:\n  test:\n    runs-on: ubuntu-latest\n    strategy:\n      fail-fast: false\n   "
  },
  {
    "path": ".gitignore",
    "chars": 41,
    "preview": "__pycache__/\n.pytest_cache/\n.venv/\n*.pyc\n"
  },
  {
    "path": "README.ja.md",
    "chars": 15030,
    "preview": "# Clustering Playbook: 3言語クラスタリングチュートリアル\n\n```bash\necho \"                                                                "
  },
  {
    "path": "README.md",
    "chars": 25981,
    "preview": "# Clustering Playbook: Trilingual Clustering Tutorial\n\n```bash\necho \"                                                   "
  },
  {
    "path": "README.zh-CN.md",
    "chars": 14960,
    "preview": "# Clustering Playbook:三语聚类算法教程\n\n```bash\necho \"                                                                          "
  },
  {
    "path": "agglomerative_hierarchical.py",
    "chars": 4249,
    "preview": "\"\"\"Teaching-oriented agglomerative hierarchical clustering demo.\"\"\"\n\nfrom __future__ import annotations\n\nfrom dataclasse"
  },
  {
    "path": "clustering_utils.py",
    "chars": 10752,
    "preview": "\"\"\"Shared utilities for the clustering demo scripts.\"\"\"\n\nfrom __future__ import annotations\n\nimport argparse\nimport os\nf"
  },
  {
    "path": "dbscan.py",
    "chars": 3445,
    "preview": "\"\"\"Teaching-oriented DBSCAN demo implemented with NumPy only.\"\"\"\n\nfrom __future__ import annotations\n\nfrom collections i"
  },
  {
    "path": "fuzzy_c_means.py",
    "chars": 4594,
    "preview": "\"\"\"Readable fuzzy c-means clustering demo implemented with NumPy only.\"\"\"\n\nfrom __future__ import annotations\n\nfrom data"
  },
  {
    "path": "k_means_plus_plus.py",
    "chars": 4447,
    "preview": "\"\"\"Readable k-means++ clustering demo implemented with NumPy only.\"\"\"\n\nfrom __future__ import annotations\n\nfrom dataclas"
  },
  {
    "path": "optics.py",
    "chars": 4510,
    "preview": "\"\"\"OPTICS demo focused on variable-density clustering.\"\"\"\n\nfrom __future__ import annotations\n\nimport numpy as np\nfrom s"
  },
  {
    "path": "requirements.txt",
    "chars": 58,
    "preview": "numpy>=1.23\nmatplotlib>=3.7\nscikit-learn>=1.3\npytest>=8.0\n"
  },
  {
    "path": "spectral_clustering.py",
    "chars": 1944,
    "preview": "\"\"\"Spectral clustering demo for non-convex cluster shapes.\"\"\"\n\nfrom __future__ import annotations\n\nimport warnings\n\nfrom"
  },
  {
    "path": "tests/test_repository.py",
    "chars": 3006,
    "preview": "\"\"\"Smoke tests for the clustering demo repository.\"\"\"\n\nfrom __future__ import annotations\n\nimport re\nimport subprocess\ni"
  }
]

About this extraction

This page contains the full source code of the haoyuhu/cluster-analysis GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 14 files (91.4 KB), approximately 29.1k tokens, and a symbol index with 49 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!