Full Code of MingchaoZhu/DeepLearning for AI

master c433d6c4acbb cached

16 files

140.3 KB

44.3k tokens

412 symbols

1 requests

Download .txt

Repository: MingchaoZhu/DeepLearning
Branch: master
Commit: c433d6c4acbb
Files: 16
Total size: 140.3 KB

Directory structure:
gitextract_8qsunllr/

├── .gitattributes
├── LICENSE
├── README.md
├── code/
│   ├── chapter 11.py
│   ├── chapter5.py
│   ├── chapter6.py
│   ├── chapter7.py
│   ├── chapter8.py
│   ├── chapter9.py
│   └── method/
│       ├── __init__.py
│       ├── activation/
│       │   └── activation.py
│       ├── optimizer/
│       │   └── optimizer.py
│       └── weight/
│           └── weight.py
├── contents.txt
├── reference.txt
└── update.txt

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitattributes
================================================
*.txt linguist-language=python


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2020 Mingchao Zhu

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# Deep Learning

《**深度学习**》是深度学习领域唯一的综合性图书，全称也叫做**深度学习 AI圣经(Deep Learning)**，由三位全球知名专家IanGoodfellow、YoshuaBengio、AaronCourville编著，全书囊括了数学及相关概念的背景知识，包括线性代数、概率论、信息论、数值优化以及机器学习中的相关内容。同时，它还介绍了工业界中实践者用到的深度学习技术，包括深度前馈网络、正则化、优化算法、卷积网络、序列建模和实践方法等，并且调研了诸如自然语言处理、语音识别、计算机视觉、在线推荐系统、生物信息学以及视频游戏方面的应用。最后，深度学习全书还提供了一些研究方向，涵盖的理论主题包括线性因子模型、自编码器、表示学习、结构化概率模型、蒙特卡罗方法、配分函数、近似推断以及深度生成模型，适用于相关专业的大学生或研究生使用。

<img src="https://github.com/MingchaoZhu/DeepLearning/blob/master/docs/cover.jpg" width="200" height="300" alt="深度学习封面" align=center>

可以下载《深度学习》的中文版 [pdf](https://github.com/MingchaoZhu/DeepLearning/releases/download/v0.0.1/DL_cn.pdf) 和英文版 [pdf](https://github.com/MingchaoZhu/DeepLearning/releases/download/v0.0.0/DL_en.pdf) 直接阅读。

对于本项目的工作，你可以直接下载 [深度学习_原理与代码实现.pdf](https://github.com/MingchaoZhu/DeepLearning/releases/download/v1.1.1/default.pdf) (后面会对该书不断更新)

---

《深度学习》可以说是深度学习与人工智能的入门宝典，许多算法爱好者、机器学习培训班、互联网企业的面试，很多都参考这本书。但本书晦涩，加上官方没有提供代码实现，因此某些地方较难理解。本项目**基于数学推导和产生原理重新描述了书中的概念**，并用**Python** (numpy 库为主) 复现了书本内容 ( **源码级代码实现。推导过程和代码实现均放在了下载区的 pdf 文件中**，重要部分的实现代码也放入 **code 文件夹**中 )。

然而我水平有限，但我真诚地希望这项工作可以帮助到更多人学习深度学习算法。我需要大家的建议和帮助。如果你在阅读中遇到有误或解释不清的地方，希望可以汇总你的建议，在 Issues 提出。如果你也想加入这项工作书写中或有其他问题，可以联系我的邮箱。如果你在你的工作或博客中用到了本书，还请可以注明引用链接。

写的过程中参考了较多网上优秀的工作，所有参考资源保存在了`reference.txt`文件中。

# 留言

这份工作就是在写这一本 [深度学习_原理与代码实现.pdf](https://github.com/MingchaoZhu/DeepLearning/releases/download/v1.1.1/default.pdf)。正如你在 pdf 文件中所见到的，《深度学习》涉及到的每一个概念，都会去给它详细的描述、原理层面的推导，以及用代码的实现。代码实现不会调用 Tensorflow、PyTorch、MXNet 等任何深度学习框架，甚至包括 sklearn (pdf 里用到 sklearn 的部分都是用来验证代码无误)，一切代码都是从原理层面实现 (Python 的基础库 NumPy)，并有详细注释，与代码区上方的原理描述区一致，你可以结合原理和代码一起理解。

这份工作的起因是我自身的热爱，但为完成这份工作我需要投入大量的时间精力，一般会写到凌晨两三点。推导、代码、作图都是慢慢打磨的，我会保证这份工作的质量。这份工作会一直更新完，已经上传的章节也会继续补充内容。如果你在阅读过程中遇到有想要描述的概念点或者错误点，请发邮件告知我。

真的很感谢你的认可与推广。最后，请等待下一次更新。

我是 朱明超，我的邮箱是：deityrayleigh@gmail.com

# 更新说明

2020/3/：

```python
1. 修改第五章决策树部分，补充 ID3 和 CART 的原理，代码实现以 CART 为主。
2. 第七章添加 L1 和 L2 正则化最优解的推导 (即 L1稀疏解的原理)。
3. 第七章添加集成学习方法的推导与代码实现，包括 Bagging (随机森林)、Boosting (Adaboost、GBDT、XGBoost)。
4. 第八章添加牛顿法与拟牛顿法 (DFP、BFGS、L-BFGS) 的推导。
5. 第十一章节添加贝叶斯线性回归、高斯过程回归 (GPR) 与贝叶斯优化的推导与代码实现。
```
后面每次的更新内容会统一放在 `update.txt` 文件中。

# 章节目录与文件下载

除了《深度学习》书中的概念点，**本项目也在各章节添加一些补充知识，例如第七章集成学习部分的 随机森林、Adaboost、GBDT、XGBoost 的原理剖析和代码实现等，又或者第十二章对当前一些主流方法的描述**。大的章节目录和 pdf 文件下载链接可以详见下表，而具体 pdf 文件中的实际目录请参考 `contents.txt`。你可以在下面的 pdf 链接中下载对应章节，也可以在 [releases](https://github.com/MingchaoZhu/DeepLearning/releases) 界面直接下载所有文件。

| 中文章节 | 英文章节 | 下载<br />(含推导与代码实现) |
| ------------ | ------------ | ------------ |
| 第一章 前言 | 1 Introduction |  |
| 第二章 线性代数 | 2 Linear Algebra | [pdf](https://github.com/MingchaoZhu/DeepLearning/raw/master/2%20%E7%BA%BF%E6%80%A7%E4%BB%A3%E6%95%B0.pdf) |
| 第三章 概率与信息论                 | 3 Probability and Information Theory | [pdf](https://github.com/MingchaoZhu/DeepLearning/raw/master/3%20%E6%A6%82%E7%8E%87%E4%B8%8E%E4%BF%A1%E6%81%AF%E8%AE%BA.pdf) |
| 第四章 数值计算                     | 4 Numerical Computation | [pdf](https://github.com/MingchaoZhu/DeepLearning/raw/master/4%20%E6%95%B0%E5%80%BC%E8%AE%A1%E7%AE%97.pdf) |
| 第五章 机器学习基础                 | 5 Machine Learning Basics | [pdf](https://github.com/MingchaoZhu/DeepLearning/raw/master/5%20%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0%E5%9F%BA%E7%A1%80.pdf) |
| 第六章 深度前馈网络                 | 6 Deep Feedforward Networks | [pdf](https://github.com/MingchaoZhu/DeepLearning/raw/master/6%20%E6%B7%B1%E5%BA%A6%E5%89%8D%E9%A6%88%E7%BD%91%E7%BB%9C.pdf) |
| 第七章 深度学习中的正则化           | 7 Regularization for Deep Learning | [pdf](https://github.com/MingchaoZhu/DeepLearning/raw/master/7%20%E6%B7%B1%E5%BA%A6%E5%AD%A6%E4%B9%A0%E4%B8%AD%E7%9A%84%E6%AD%A3%E5%88%99%E5%8C%96.pdf) |
| 第八章 深度模型中的优化 | 8 Optimization for Training Deep Models | [pdf](https://github.com/MingchaoZhu/DeepLearning/raw/master/8%20%E6%B7%B1%E5%BA%A6%E6%A8%A1%E5%9E%8B%E4%B8%AD%E7%9A%84%E4%BC%98%E5%8C%96.pdf) |
| 第九章 卷积网络 | 9 Convolutional Networks | [pdf](https://github.com/MingchaoZhu/DeepLearning/raw/master/9%20%E5%8D%B7%E7%A7%AF%E7%BD%91%E7%BB%9C.pdf) |
| 第十章 序列建模：循环和递归网络 | 10 Sequence Modeling: Recurrent and Recursive Nets |  |
| 第十一章 实践方法论                 | 11 Practical Methodology | [pdf](https://github.com/MingchaoZhu/DeepLearning/raw/master/11%20%E5%AE%9E%E8%B7%B5%E6%96%B9%E6%B3%95%E8%AE%BA.pdf) |
| 第十二章 应用 | 12 Applications |  |
| 第十三章 线性因子模型 | 13 Linear Factor Models |  |
| 第十四章 自编码器                   | 14 Autoencoders |  |
| 第十五章 表示学习                   | 15 Representation Learning |  |
| 第十六章 深度学习中的结构化概率模型 | 16 Structured Probabilistic Models for Deep Learning |  |
| 第十七章 蒙特卡罗方法 | 17 Monte Carlo Methods |  |
| 第十八章 直面配分函数 | 18 Confronting the Partition Function |  |
| 第十九章 近似推断                   | 19 Approximate Inference |  |
| 第二十章 深度生成模型 | 20 Deep Generative Models |  |

尚未上传的章节会在后续陆续上传。

# 致谢

感谢对本项目的认可和推广。

+ 专知：https://mp.weixin.qq.com/s/dVD-vKJsMGqnBz2v4O-Q3Q
+ GitHubDaily：https://m.weibo.cn/5722964389/4504392843690487
+ 程序员遇见GitHub：https://mp.weixin.qq.com/s/EzFOnwpkv7mr2TSjPtVG9A
+ 爱可可：https://m.weibo.cn/1402400261/4503389646699745

# 赞助

本项目书写耗费时间精力。如果本项目对你有帮助，可以请作者吃份冰淇淋：

<img src="./docs/pay.jpg" width="200" height="200" alt="支付" align=center>


================================================
FILE: code/chapter 11.py
================================================
import pandas as pd
import numpy as np
import itertools
import time
import re
from scipy.stats import norm
import matplotlib.pyplot as plt


def cal_conf_matrix(labels, preds):
    """
    计算混淆矩阵。
    
    参数说明：
    labels：样本标签 (真实结果)
    preds：预测结果
    """
    n_sample = len(labels)
    result = pd.DataFrame(index=range(0,n_sample),columns=('probability','label'))
    result['label'] = np.array(labels)
    result['probability'] = np.array(preds)
    cm = np.arange(4).reshape(2,2)
    cm[0,0] = len(result[result['label']==1][result['probability']>=0.5]) # TP，注意这里是以 0.5 为阈值
    cm[0,1] = len(result[result['label']==1][result['probability']<0.5])  # FN
    cm[1,0] = len(result[result['label']==0][result['probability']>=0.5]) # FP
    cm[1,1] = len(result[result['label']==0][result['probability']<0.5])  # TN  
    return cm


def cal_PRF1(labels, preds):
    """
    计算查准率P，查全率R，F1值。
    """
    cm = cal_conf_matrix(labels, preds)
    P = cm[0,0]/(cm[0,0]+cm[1,0])
    R = cm[0,0]/(cm[0,0]+cm[0,1])
    F1 = 2*P*R/(P+R)
    return P, R, F1


def cal_PRcurve(labels, preds):
    """
    计算PR曲线上的值。
    """
    n_sample = len(labels)
    result = pd.DataFrame(index=range(0,n_sample),columns=('probability','label'))
    y_pred[y_pred>=0.5] = 1
    y_pred[y_pred<0.5] = 0
    result['label'] = np.array(labels)
    result['probability'] = np.array(preds)
    result.sort_values('probability',inplace=True,ascending=False)
    PandR = pd.DataFrame(index=range(len(labels)),columns=('P','R'))
    for j in range(len(result)):
        # 以每一个概率为分类的阈值，统计此时正例和反例的数量
        result_j = result.head(n=j+1)
        P = len(result_j[result_j['label']==1])/float(len(result_j))  # 当前实际为正的数量/当前预测为正的数量
        R = len(result_j[result_j['label']==1])/float(len(result[result['label']==1]))  # 当前真正例的数量/实际为正的数量
        PandR.iloc[j] = [P,R]
    return PandR


def cal_ROCcurve(labels, preds):
    """
    计算ROC曲线上的值。
    """
    n_sample = len(labels)
    result = pd.DataFrame(index=range(0,n_sample),columns=('probability','label'))
    y_pred[y_pred>=0.5] = 1
    y_pred[y_pred<0.5] = 0
    result['label'] = np.array(labels)
    result['probability'] = np.array(preds)
    # 计算 TPR,FPR
    result.sort_values('probability',inplace=True,ascending=False)
    TPRandFPR=pd.DataFrame(index=range(len(result)),columns=('TPR','FPR'))
    for j in range(len(result)):
        # 以每一个概率为分类的阈值，统计此时正例和反例的数量
        result_j=result.head(n=j+1)
        TPR=len(result_j[result_j['label']==1])/float(len(result[result['label']==1]))  # 当前真正例的数量/实际为正的数量
        FPR=len(result_j[result_j['label']==0])/float(len(result[result['label']==0]))  # 当前假正例的数量/实际为负的数量
        TPRandFPR.iloc[j]=[TPR,FPR]
    return TPRandFPR


def timeit(func):
    """
    装饰器，计算函数执行时间
    """
    def wrapper(*args, **kwargs):
        time_start = time.time()
        result = func(*args, **kwargs)
        time_end = time.time()
        exec_time = time_end - time_start
        print("{function} exec time: {time}s".format(function=func.__name__,time=exec_time))
        return result
    return wrapper

@timeit
def area_auc(labels, preds):
    """
    AUC值的梯度法计算
    """
    TPRandFPR = cal_ROCcurve(labels, preds)
    # 计算AUC，计算小矩形的面积之和
    auc = 0.
    prev_x = 0
    for x, y in zip(TPRandFPR.FPR,TPRandFPR.TPR):
        if x != prev_x:
            auc += (x - prev_x) * y
            prev_x = x
    return auc

@timeit
def naive_auc(labels, preds):
    """
    AUC值的概率法计算
    """
    n_pos = sum(labels)
    n_neg = len(labels) - n_pos
    total_pair = n_pos * n_neg  # 总的正负样本对的数目
    labels_preds = zip(labels, preds)
    labels_preds = sorted(labels_preds,key=lambda x:x[1])  # 对预测概率升序排序
    count_neg = 0  # 统计负样本出现的个数
    satisfied_pair = 0   # 统计满足条件的样本对的个数
    for i in range(len(labels_preds)):
        if labels_preds[i][0] == 1:
            satisfied_pair += count_neg  # 表明在这个正样本下，有哪些负样本满足条件
        else:
            count_neg += 1
    return satisfied_pair / float(total_pair)


#####----Bayesian Hyperparameter Optimization----####
class KernelBase(ABC):
    
    def __init__(self):
        super().__init__()
        self.params = {}
        self.hyperparams = {}

    @abstractmethod
    def _kernel(self, X, Y):
        raise NotImplementedError

    def __call__(self, X, Y=None):
        return self._kernel(X, Y)

    def __str__(self):
        P, H = self.params, self.hyperparams
        p_str = ", ".join(["{}={}".format(k, v) for k, v in P.items()])
        return "{}({})".format(H["op"], p_str)

    def summary(self):
        return {
            "op": self.hyperparams["op"],
            "params": self.params,
            "hyperparams": self.hyperparams,
        }


class RBFKernel(KernelBase):
    
    def __init__(self, sigma=None):
        """
        RBF 核。
        """
        super().__init__()
        self.hyperparams = {"op": "RBFKernel"}
        self.params = {"sigma": sigma}  # 如果 sigma 未赋值则默认为 np.sqrt(n_features/2)，n_features 为特征数。

    def _kernel(self, X, Y=None):
        """
        对 X 和 Y 的行的每一对计算 RBF 核。如果 Y 为空，则 Y=X。

        参数说明：
        X：输入数组，为 (n_samples, n_features)
        Y：输入数组，为 (m_samples, n_features)
        """
        X = X.reshape(-1, 1) if X.ndim == 1 else X
        Y = X if Y is None else Y
        Y = Y.reshape(-1, 1) if Y.ndim == 1 else Y
        assert X.ndim == 2 and Y.ndim == 2, "X and Y must have 2 dimensions"
        sigma = np.sqrt(X.shape[1] / 2) if self.params["sigma"] is None else self.params["sigma"]
        X, Y = X / sigma, Y / sigma
        D = -2 * X @ Y.T + np.sum(Y**2, axis=1) + np.sum(X**2, axis=1)[:, np.newaxis]
        D[D < 0] = 0
        return np.exp(-0.5 * D)
    

class KernelInitializer(object):
    
    def __init__(self, param=None):
        self.param = param

    def __call__(self):
        r = r"([a-zA-Z0-9]*)=([^,)]*)"
        kr_str = self.param.lower()
        kwargs = dict([(i, eval(j)) for (i, j) in re.findall(r, self.param)])
        if "rbf" in kr_str:
            kernel = RBFKernel(**kwargs)
        else:
            raise NotImplementedError("{}".format(kr_str))
        return kernel


class GPRegression:
    """
    高斯过程回归
    """
    def __init__(self, kernel="RBFKernel", sigma=1e-10):
        self.kernel = KernelInitializer(kernel)()
        self.params = {"GP_mean": None, "GP_cov": None, "X": None}
        self.hyperparams = {"kernel": str(self.kernel), "sigma": sigma}

    def fit(self, X, y):
        """
        用已有的样本集合得到 GP 先验。

        参数说明：
        X：输入数组，为 (n_samples, n_features)
        y：输入数组 X 的目标值，为 (n_samples)
        """
        mu = np.zeros(X.shape[0])
        Cov = self.kernel(X, X)
        self.params["X"] = X
        self.params["y"] = y
        self.params["GP_cov"] = Cov
        self.params["GP_mean"] = mu

    def predict(self, X_star, conf_interval=0.95):
        """
        对新的样本 X 进行预测。

        参数说明：
        X_star：输入数组，为 (n_samples, n_features)
        conf_interval：置信区间，浮点型 (0, 1)，default=0.95
        """
        X = self.params["X"]
        y = self.params["y"]
        K = self.params["GP_cov"]
        sigma = self.hyperparams["sigma"]
        K_star = self.kernel(X_star, X)
        K_star_star = self.kernel(X_star, X_star)
        sig = np.eye(K.shape[0]) * sigma
        K_y_inv = np.linalg.pinv(K + sig)
        mean = K_star @ K_y_inv @ y
        cov = K_star_star - K_star @ K_y_inv @ K_star.T
        percentile = norm.ppf(conf_interval)
        conf = percentile * np.sqrt(np.diag(cov))
        return mean, conf, cov


class BayesianOptimization:
    
    def __init__(self):
        self.model = GPRegression()
        
    def acquisition_function(self, Xsamples):
        mu, _, cov = self.model.predict(Xsamples)
        mu = mu if mu.ndim==1 else (mu.T)[0]
        ysample = np.random.multivariate_normal(mu, cov) 
        return ysample
    
    def opt_acquisition(self, X, n_samples=20):
        # 样本搜索策略，一般方法有随机搜索、基于网格的搜索，或局部搜索
        # 我们这里就用简单的随机搜索，这里也可以定义样本的范围
        Xsamples = np.random.randint(low=1,high=50,size=n_samples*X.shape[1])
        Xsamples = Xsamples.reshape(n_samples, X.shape[1])
        # 计算采集函数的值并取最大的值
        scores = self.acquisition_function(Xsamples)
        ix = np.argmax(scores)
        return Xsamples[ix, 0]
    
    def fit(self, f, X, y):
        # 拟合 GPR 模型
        self.model.fit(X, y)
        # 优化过程
        for i in range(15):
            x_star = self.opt_acquisition(X)  # 下一个采样点
            y_star = f(x_star)
            mean, conf, cov = self.model.predict(np.array([[x_star]]))
            # 添加当前数据到数据集合
            X = np.vstack((X, [[x_star]]))
            y = np.vstack((y, [[y_star]]))
            # 更新 GPR 模型
            self.model.fit(X, y)
        ix = np.argmax(y)
        print('Best Result: x=%.3f, y=%.3f' % (X[ix], y[ix]))
        return X[ix], y[ix]    



================================================
FILE: code/chapter5.py
================================================
import numpy as np
import cvxopt
import math


########-----NaiveBayes------#########
class NaiveBayes():
    
    def __init__(self):
        self.parameters = [] # 保存每个特征针对每个类的均值和方差
        self.y = None
        self.classes = None

    def fit(self, X, y):
        self.y = y
        self.classes = np.unique(y) # 类别 
        # 计算每个特征针对每个类的均值和方差
        for i, c in enumerate(self.classes):
            # 选择类别为c的X
            X_where_c = X[np.where(self.y == c)]
            self.parameters.append([])
            # 添加均值与方差
            for col in X_where_c.T:
                parameters = {"mean": col.mean(), "var": col.var()}
                self.parameters[i].append(parameters)
    
    def _calculate_prior(self, c):
        """
        先验函数。
        """
        frequency = np.mean(self.y == c)
        return frequency

    def _calculate_likelihood(self, mean, var, X):
        """
        似然函数。
        """
        # 高斯概率
        eps = 1e-4 # 防止除数为0
        coeff = 1.0 / math.sqrt(2.0 * math.pi * var + eps)
        exponent = math.exp(-(math.pow(X - mean, 2) / (2 * var + eps)))
        return coeff * exponent
    
    def _calculate_probabilities(self, X):
        posteriors = []
        for i, c in enumerate(self.classes):
            posterior = self._calculate_prior(c)
            for feature_value, params in zip(X, self.parameters[i]):
                # 独立性假设
                # P(x1,x2|Y) = P(x1|Y)*P(x2|Y)
                likelihood = self._calculate_likelihood(params["mean"], params["var"], feature_value)
                posterior *= likelihood
            posteriors.append(posterior)
        # 返回具有最大后验概率的类别
        return self.classes[np.argmax(posteriors)]
    
    def predict(self, X):
        y_pred = [self._calculate_probabilities(sample) for sample in X]
        return y_pred
    
    def score(self, X, y):
        y_pred = self.predict(X)
        accuracy = np.sum(y == y_pred, axis=0) / len(y)
        return accuracy


########-----LogisticRegression------#########
def Sigmoid(x):
    return 1/(1 + np.exp(-x))

class LogisticRegression():

    def __init__(self, learning_rate=.1):
        self.param = None
        self.learning_rate = learning_rate
        self.sigmoid = Sigmoid

    def _initialize_parameters(self, X):
        n_features = np.shape(X)[1]
        # 初始化参数theta， [-1/sqrt(N), 1/sqrt(N)]
        limit = 1 / math.sqrt(n_features)
        self.param = np.random.uniform(-limit, limit, (n_features,))

    def fit(self, X, y, n_iterations=4000):
        self._initialize_parameters(X)
        # 参数theta的迭代更新
        for i in range(n_iterations):
            # 求预测
            y_pred = self.sigmoid(X.dot(self.param))
            # 最小化损失函数，参数更新公式
            self.param -= self.learning_rate * -(y - y_pred).dot(X)

    def predict(self, X):
        y_pred = self.sigmoid(X.dot(self.param))
        return y_pred

    def score(self, X, y):
        y_pred = self.predict(X)
        accuracy = np.sum(y == y_pred, axis=0) / len(y)
        return accuracy
    

########-----SupportVectorMachine------#########
# 隐藏cvxopt输出
cvxopt.solvers.options['show_progress'] = False

def linear_kernel(**kwargs):
    """
    线性核
    """
    def f(x1, x2):
        return np.inner(x1, x2)
    return f

def polynomial_kernel(power, coef, **kwargs):
    """
    多项式核
    """
    def f(x1, x2):
        return (np.inner(x1, x2) + coef)**power
    return f

def rbf_kernel(gamma, **kwargs):
    """
    高斯核
    """
    def f(x1, x2):
        distance = np.linalg.norm(x1 - x2) ** 2
        return np.exp(-gamma * distance)
    return f

class SupportVectorMachine():

    def __init__(self, kernel=linear_kernel, power=4, gamma=None, coef=4):
        self.kernel = kernel
        self.power = power
        self.gamma = gamma
        self.coef = coef
        self.lagr_multipliers = None
        self.support_vectors = None
        self.support_vector_labels = None
        self.intercept = None

    def fit(self, X, y):

        n_samples, n_features = np.shape(X)

        # gamma默认设置为1 / n_features
        if not self.gamma:
            self.gamma = 1 / n_features
        
        # 定义核函数
        self.kernel = self.kernel(
            power=self.power,
            gamma=self.gamma,
            coef=self.coef)

        # 计算Gram矩阵
        kernel_matrix = np.zeros((n_samples, n_samples))
        for i in range(n_samples):
            for j in range(n_samples):
                kernel_matrix[i, j] = self.kernel(X[i], X[j])
        
        # 构造二次规划问题
        # 形式为 min (1/2)x.T*P*x+q.T*x, s.t. G*x<=h, A*x=b
        P = cvxopt.matrix(np.outer(y, y) * kernel_matrix, tc='d')
        q = cvxopt.matrix(np.ones(n_samples) * -1)
        A = cvxopt.matrix(y, (1, n_samples), tc='d')
        b = cvxopt.matrix(0, tc='d')

        G = cvxopt.matrix(np.identity(n_samples) * -1)
        h = cvxopt.matrix(np.zeros(n_samples))

        # 用cvxopt求解二次规划问题
        minimization = cvxopt.solvers.qp(P, q, G, h, A, b)
        lagr_mult = np.ravel(minimization['x'])
        # 非0的alpha值
        idx = lagr_mult > 1e-7
        # alpha值
        self.lagr_multipliers = lagr_mult[idx]
        # 支持向量
        self.support_vectors = X[idx]
        # 支持向量的标签
        self.support_vector_labels = y[idx]

        # 通过第一个支持向量计算b
        self.intercept = self.support_vector_labels[0]
        for i in range(len(self.lagr_multipliers)):
            self.intercept -= self.lagr_multipliers[i] * self.support_vector_labels[
                i] * self.kernel(self.support_vectors[i], self.support_vectors[0])

    def predict(self, X):
        y_pred = []
        for sample in X:
            # 对于输入的x, 计算f(x)
            prediction = 0
            for i in range(len(self.lagr_multipliers)):
                prediction += self.lagr_multipliers[i] * self.support_vector_labels[
                    i] * self.kernel(self.support_vectors[i], sample)
            prediction += self.intercept
            y_pred.append(np.sign(prediction))
        return np.array(y_pred)
    
    def score(self, X, y):
        y_pred = self.predict(X)
        accuracy = np.sum(y == y_pred, axis=0) / len(y)
        return accuracy

    
########-----KNN------#########
class KNN():
    
    def __init__(self, k=10):
        self._k = k

    def fit(self, X, y):
        self._unique_labels = np.unique(y)
        self._class_num = len(self._unique_labels)
        self._datas = X
        self._labels = y.astype(np.int32)

    def predict(self, X):
        # 欧式距离计算
        dist = np.sum(np.square(X), axis=1, keepdims=True) - 2 * np.dot(X, self._datas.T)
        dist = dist + np.sum(np.square(self._datas), axis=1, keepdims=True).T
        dist = np.argsort(dist)[:,:self._k]
        return np.array([np.argmax(np.bincount(self._labels[dist][i])) for i in range(len(X))])
        idx = lagr_mult > 1e-7
        # alpha值
        self.lagr_multipliers = lagr_mult[idx]
        # 支持向量
        self.support_vectors = X[idx]
        # 支持向量的标签
        self.support_vector_labels = y[idx]

        # 通过第一个支持向量计算b
        self.intercept = self.support_vector_labels[0]
        for i in range(len(self.lagr_multipliers)):
            self.intercept -= self.lagr_multipliers[i] * self.support_vector_labels[
                i] * self.kernel(self.support_vectors[i], self.support_vectors[0])

    def predict(self, X):
        y_pred = []
        for sample in X:
            # 对于输入的x, 计算f(x)
            prediction = 0
            for i in range(len(self.lagr_multipliers)):
                prediction += self.lagr_multipliers[i] * self.support_vector_labels[
                    i] * self.kernel(self.support_vectors[i], sample)
            prediction += self.intercept
            y_pred.append(np.sign(prediction))
        return np.array(y_pred)

    def score(self, X, y):
        y_pred = self.predict(X)
        accuracy = np.sum(y == y_pred, axis=0) / len(y)
        return accuracy

    
########-----DecisionTree------#########
class DecisionNode():

    def __init__(self, feature_i=None, threshold=None,
                 value=None, true_branch=None, false_branch=None):
        self.feature_i = feature_i          # 当前结点测试的特征的索引
        self.threshold = threshold          # 当前结点测试的特征的阈值
        self.value = value                  # 结点值（如果结点为叶子结点）
        self.true_branch = true_branch      # 左子树（满足阈值， 将特征值大于等于切分点值的数据划分为左子树）
        self.false_branch = false_branch    # 右子树（未满足阈值， 将特征值小于切分点值的数据划分为右子树）

        
def divide_on_feature(X, feature_i, threshold):
    """
    依据切分变量和切分点，将数据集分为两个子区域
    """
    split_func = None
    if isinstance(threshold, int) or isinstance(threshold, float):
        split_func = lambda sample: sample[feature_i] >= threshold
    else:
        split_func = lambda sample: sample[feature_i] == threshold

    X_1 = np.array([sample for sample in X if split_func(sample)])
    X_2 = np.array([sample for sample in X if not split_func(sample)])

    return np.array([X_1, X_2])


class DecisionTree(object):

    def __init__(self, min_samples_split=2, min_impurity=1e-7,
                 max_depth=float("inf"), loss=None):
        self.root = None  # 根结点
        self.min_samples_split = min_samples_split  # 满足切分的最少样本数
        self.min_impurity = min_impurity  # 满足切分的最小纯度
        self.max_depth = max_depth  # 树的最大深度
        self._impurity_calculation = None  # 计算纯度的函数，如对于分类树采用信息增益
        self._leaf_value_calculation = None  # 计算y在叶子结点值的函数
        self.one_dim = None  # y是否为one-hot编码

    def fit(self, X, y):
        self.one_dim = len(np.shape(y)) == 1
        self.root = self._build_tree(X, y)

    def _build_tree(self, X, y, current_depth=0):
        """
        递归方法建立决策树
        """
        largest_impurity = 0
        best_criteria = None    # 当前最优分类的特征索引和阈值
        best_sets = None        # 数据子集

        if len(np.shape(y)) == 1:
            y = np.expand_dims(y, axis=1)

        Xy = np.concatenate((X, y), axis=1)

        n_samples, n_features = np.shape(X)

        if n_samples >= self.min_samples_split and current_depth <= self.max_depth:
            # 对每个特征计算纯度
            for feature_i in range(n_features):
                feature_values = np.expand_dims(X[:, feature_i], axis=1)
                unique_values = np.unique(feature_values)

                # 遍历特征i所有的可能值找到最优纯度
                for threshold in unique_values:
                    # 基于X在特征i处是否满足阈值来划分X和y， Xy1为满足阈值的子集
                    Xy1, Xy2 = divide_on_feature(Xy, feature_i, threshold)

                    if len(Xy1) > 0 and len(Xy2) > 0:
                        # 取出Xy中y的集合
                        y1 = Xy1[:, n_features:]
                        y2 = Xy2[:, n_features:]

                        # 计算纯度
                        impurity = self._impurity_calculation(y, y1, y2)

                        # 如果纯度更高，则更新
                        if impurity > largest_impurity:
                            largest_impurity = impurity
                            best_criteria = {"feature_i": feature_i, "threshold": threshold}
                            best_sets = {
                                "leftX": Xy1[:, :n_features],   # X的左子树
                                "lefty": Xy1[:, n_features:],   # y的左子树
                                "rightX": Xy2[:, :n_features],  # X的右子树
                                "righty": Xy2[:, n_features:]   # y的右子树
                                }

        if largest_impurity > self.min_impurity:
            # 建立左子树和右子树
            true_branch = self._build_tree(best_sets["leftX"], best_sets["lefty"], current_depth + 1)
            false_branch = self._build_tree(best_sets["rightX"], best_sets["righty"], current_depth + 1)
            return DecisionNode(feature_i=best_criteria["feature_i"], threshold=best_criteria[
                                "threshold"], true_branch=true_branch, false_branch=false_branch)

        # 如果是叶结点则计算值
        leaf_value = self._leaf_value_calculation(y)

        return DecisionNode(value=leaf_value)


    def predict_value(self, x, tree=None):
        """
        预测样本，沿着树递归搜索
        """
        # 根结点
        if tree is None:
            tree = self.root

        # 递归出口
        if tree.value is not None:
            return tree.value

        # 选择当前结点的特征
        feature_value = x[tree.feature_i]

        branch = tree.false_branch
        if isinstance(feature_value, int) or isinstance(feature_value, float):
            if feature_value >= tree.threshold:
                branch = tree.true_branch
        elif feature_value == tree.threshold:
            branch = tree.true_branch

        return self.predict_value(x, branch)

    def predict(self, X):
        y_pred = [self.predict_value(sample) for sample in X]
        return y_pred

    def score(self, X, y):
        y_pred = self.predict(X)
        accuracy = np.sum(y == y_pred, axis=0) / len(y)
        return accuracy
    
    def print_tree(self, tree=None, indent=" "):
        """
        输出树
        """
        if not tree:
            tree = self.root

        if tree.value is not None:
            print(tree.value)
        else:
            print("feature|threshold -> %s | %s" % (tree.feature_i, tree.threshold))
            print("%sT->" % (indent), end="")
            self.print_tree(tree.true_branch, indent + indent)
            print("%sF->" % (indent), end="")
            self.print_tree(tree.false_branch, indent + indent)


def calculate_entropy(y):
    log2 = lambda x: math.log(x) / math.log(2)
    unique_labels = np.unique(y)
    entropy = 0
    for label in unique_labels:
        count = len(y[y == label])
        p = count / len(y)
        entropy += -p * log2(p)
    return entropy


def calculate_gini(y):
    unique_labels = np.unique(y)
    var = 0
    for label in unique_labels:
        count = len(y[y == label])
        p = count / len(y)
        var += p ** 2
    return 1 - var


class ClassificationTree(DecisionTree):
    """
    分类树，在决策书节点选择计算信息增益/基尼指数，在叶子节点选择多数表决。
    """
    def _calculate_gini_index(self, y, y1, y2):
        """
        计算基尼指数
        """
        p = len(y1) / len(y)
        gini = calculate_gini(y)
        gini_index = gini - p * \
            calculate_gini(y1) - (1 - p) * \
            calculate_gini(y2)
        return gini_index
    
    
    def _calculate_information_gain(self, y, y1, y2):
        """
        计算信息增益
        """
        p = len(y1) / len(y)
        entropy = calculate_entropy(y)
        info_gain = entropy - p * \
            calculate_entropy(y1) - (1 - p) * \
            calculate_entropy(y2)
        return info_gain

    def _majority_vote(self, y):
        """
        多数表决
        """
        most_common = None
        max_count = 0
        for label in np.unique(y):
            count = len(y[y == label])
            if count > max_count:
                most_common = label
                max_count = count
        return most_common

    def fit(self, X, y):
        self._impurity_calculation = self._calculate_gini_index
        self._leaf_value_calculation = self._majority_vote
        super(ClassificationTree, self).fit(X, y)


def calculate_mse(y):
    return np.mean((y - np.mean(y)) ** 2)


def calculate_variance(y):
    n_samples = np.shape(y)[0]
    variance = (1 / n_samples) * np.diag((y - np.mean(y)).T.dot(y - np.mean(y)))
    return variance


class RegressionTree(DecisionTree):
    """
    回归树，在决策书节点选择计算MSE/方差降低，在叶子节点选择均值。
    """
    def _calculate_mse(self, y, y1, y2):
        """
        计算MSE降低
        """
        mse_tot = calculate_mse(y)
        mse_1 = calculate_mse(y1)
        mse_2 = calculate_mse(y2)
        frac_1 = len(y1) / len(y)
        frac_2 = len(y2) / len(y)
        mse_reduction = mse_tot - (frac_1 * mse_1 + frac_2 * mse_2)
        return mse_reduction
    
    def _calculate_variance_reduction(self, y, y1, y2):
        """
        计算方差降低
        """
        var_tot = calculate_variance(y)
        var_1 = calculate_variance(y1)
        var_2 = calculate_variance(y2)
        frac_1 = len(y1) / len(y)
        frac_2 = len(y2) / len(y)
        variance_reduction = var_tot - (frac_1 * var_1 + frac_2 * var_2)
        return sum(variance_reduction)

    def _mean_of_y(self, y):
        """
        计算均值
        """
        value = np.mean(y, axis=0)
        return value if len(value) > 1 else value[0]

    def fit(self, X, y):
        self._impurity_calculation = self._calculate_mse
        self._leaf_value_calculation = self._mean_of_y
        super(RegressionTree, self).fit(X, y)


########-----PCA------#########
class PCA():
    
    def __init__(self):
        pass
    
    def fit(self, X, n_components):
        n_samples = np.shape(X)[0]
        covariance_matrix = (1 / (n_samples-1)) * (X - X.mean(axis=0)).T.dot(X - X.mean(axis=0))

        # 对协方差矩阵进行特征值分解
        eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)

        # 对特征值（特征向量）从大到小排序
        idx = eigenvalues.argsort()[::-1]
        eigenvalues = eigenvalues[idx][:n_components]
        eigenvectors = np.atleast_1d(eigenvectors[:, idx])[:, :n_components]

        # 得到低维表示
        X_transformed = X.dot(eigenvectors)


########-----KMeans------#########
def distEclud(x,y):
    """
    计算欧氏距离
    """
    return np.sqrt(np.sum((x-y)**2))  

def randomCent(dataSet,k):
    """
    为数据集构建一个包含 K 个随机质心的集合
    """
    m,n = dataSet.shape
    centroids = np.zeros((k,n))
    for i in range(k):
        index = int(np.random.uniform(0,m))
        centroids[i,:] = dataSet[index,:]
    return centroids

class KMeans():
    
    def __init__(self):
        self.dataSet = None
        self.k = None
        
    def fit(self, dataSet, k):
        self.dataSet = dataSet
        self.k = k
        m = np.shape(dataSet)[0]
        # 第一列存样本属于哪一簇
        # 第二列存样本的到簇的中心点的误差
        clusterAssment = np.mat(np.zeros((m,2)))
        clusterChange = True
        centroids = randomCent(self.dataSet,k)
        while clusterChange:
            clusterChange = False
            for i in range(m):
                minDist = 1e6
                minIndex = -1
                # 遍历所有的质心, 找出最近的质心
                for j in range(k):
                    distance = distEclud(centroids[j,:], self.dataSet[i,:])
                    if distance < minDist:
                        minDist = distance
                        minIndex = j
                # 更新每一行样本所属的簇
                if clusterAssment[i,0] != minIndex:
                    clusterChange = True
                    clusterAssment[i,:] = minIndex, minDist**2
            # 更新质心
            for j in range(k):
                pointsInCluster = dataSet[np.nonzero(clusterAssment[:,0].A == j)[0]]  # 获取簇类所有的点
                centroids[j,:] = np.mean(pointsInCluster,axis=0)   # 对矩阵的行求均值

        return centroids,clusterAssment

        return X_transformed


================================================
FILE: code/chapter6.py
================================================
from abc import ABC, abstractmethod
import numpy as np
import time
import re
import inspect
from collections import OrderedDict

import sys
sys.path.append('../')
from method.optimizer import OptimizerInitializer
from method.weight import WeightInitializer
from method.activation import ActivationInitializer


def sigmoid(x):
    return 1 / (1 + np.exp(-x))


def softmax(x):
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return e_x / e_x.sum(axis=-1, keepdims=True)
    
    
class LayerBase(ABC):
    
    def __init__(self, optimizer="sgd"):
        self.X = []   # 网络层输入
        self.gradients = {}  # 网络层待梯度更新变量
        self.params = {}  # 网络层参数变量
        self.acti_fn = None   # 网络层激活函数
        self.optimizer = OptimizerInitializer(optimizer)()  # 网络层优化方法

    @abstractmethod
    def _init_params(self, **kwargs):
        """
        函数作用：初始化参数
        """
        raise NotImplementedError
        
    @abstractmethod
    def forward(self, X, **kwargs):  
        """
        函数作用：前向传播
        """
        raise NotImplementedError

    @abstractmethod
    def backward(self, out, **kwargs):
        """
        函数作用：反向传播
        """
        raise NotImplementedError
        
    def flush_gradients(self):
        """
        函数作用：重置更新参数列表
        """
        self.X = []
        for k, v in self.gradients.items():
            self.gradients[k] = np.zeros_like(v)
            
        for k, v in self.derived_variables.items():
            self.derived_variables[k] = []

    def update(self):
        """
        函数作用：更新参数
        """
        for k, v in self.gradients.items():
            if k in self.params:
                self.params[k] = self.optimizer(self.params[k], v, k)
    
    
class FullyConnected(LayerBase):
    """
    定义全连接层，实现 a=g(x*W+b)，前向传播输入x，返回a；反向传播输入
    """
    
    def __init__(self, n_out, acti_fn, init_w, optimizer=None):
        """
        参数说明：
        acti_fn：激活函数， str型
        init_w：权重初始化方法， str型
        n_out：隐藏层输出维数
        optimizer：优化方法
        """
        super().__init__(optimizer)
        
        self.n_in = None  # 隐藏层输入维数， int型
        self.n_out = n_out  # 隐藏层输出维数， int型
        self.acti_fn = ActivationInitializer(acti_fn)()
        self.init_w = init_w
        self.init_weights = WeightInitializer(mode=init_w)
        self.is_initialized = False  # 是否初始化， bool型变量
    
    def _init_params(self):
        b = np.zeros((1, self.n_out))
        W = self.init_weights((self.n_in, self.n_out))
        self.params = {"W": W, "b": b}
        self.gradients = {"W": np.zeros_like(W), "b": np.zeros_like(b)}
        self.derived_variables = {"Z": []}
        self.is_initialized = True
        
    def forward(self, X, retain_derived=True):
        """
        全连接网络的前向传播，原理见上文 反向传播算法 部分。
        
        参数说明：
        X：输入数组，为（n_samples, n_in），float型
        retain_derived：是否保留中间变量，以便反向传播时再次使用，bool型
        """
        if not self.is_initialized:  # 如果参数未初始化，先初始化参数
            self.n_in = X.shape[1]
            self._init_params()
            
        W = self.params["W"]
        b = self.params["b"]
        z = X @ W + b
        a = self.acti_fn.forward(z)
        
        if retain_derived:
            self.X.append(X)
            
        return a
    
    def backward(self, dLda, retain_grads=True):
        """
        全连接网络的反向传播，原理见上文 反向传播算法 部分。
        
        参数说明：
        dLda：关于损失的梯度，为（n_samples, n_out），float型
        retain_grads：是否计算中间变量的参数梯度，bool型
        """
        if not isinstance(dLda, list):
            dLda = [dLda]
        
        dX = []
        X = self.X
        for da, x in zip(dLda, X):
            dx, dw, db = self._bwd(da, x)
            dX.append(dx)

            if retain_grads:
                self.gradients["W"] += dw
                self.gradients["b"] += db
        
        return dX[0] if len(X) == 1 else dX

    def _bwd(self, dLda, X):
        W = self.params["W"]
        b = self.params["b"]

        Z = X @ W + b
        dZ = dLda * self.acti_fn.grad(Z)

        dX = dZ @ W.T
        dW = X.T @ dZ
        db = dZ.sum(axis=0, keepdims=True)
        return dX, dW, db
    
    @property
    def hyperparams(self):
        return {
            "layer": "FullyConnected",
            "init_w": self.init_w,
            "n_in": self.n_in,
            "n_out": self.n_out,
            "acti_fn": str(self.acti_fn),
            "optimizer": {
                "hyperparams": self.optimizer.hyperparams,
            },
            "components": {
                k: v for k, v in self.params.items()
            }
        }
    
    
class ObjectiveBase(ABC):
    
    def __init__(self):
        super().__init__()

    @abstractmethod
    def loss(self, y_true, y_pred):
        """
        函数作用：计算损失
        """
        raise NotImplementedError

    @abstractmethod
    def grad(self, y_true, y_pred, **kwargs):
        """
        函数作用：计算代价函数的梯度
        """
        raise NotImplementedError


class SquaredError(ObjectiveBase):
    """
    二次代价函数。
    """
    def __init__(self):
        super().__init__()

    def __call__(self, y_true, y_pred):
        return self.loss(y_true, y_pred)

    def __str__(self):
        return "SquaredError"

    @staticmethod
    def loss(y_true, y_pred):
        """
        参数说明：
        y_true：训练的 n 个样本的真实值， 形状为(n,m)数组；
        y_pred：训练的 n 个样本的预测值， 形状为(n,m)数组；
        """
        (n, _) = y_true.shape
        return 0.5 * np.linalg.norm(y_pred - y_true) ** 2 / n

    @staticmethod
    def grad(y_true, y_pred, z, acti_fn):
        (n, _) = y_true.shape
        return (y_pred - y_true) * acti_fn.grad(z) / n


class CrossEntropy(ObjectiveBase):
    """
    交叉熵代价函数。
    """
    def __init__(self):
        super().__init__()

    def __call__(self, y_true, y_pred):
        return self.loss(y_true, y_pred)

    def __str__(self):
        return "CrossEntropy"

    @staticmethod
    def loss(y_true, y_pred):
        """
        参数说明：
        y_true：训练的 n 个样本的真实值， 要求形状为(n,m)二进制（每个样本均为 one-hot 编码）；
        y_pred：训练的 n 个样本的预测值， 形状为(n,m)；
        """
        (n, _) = y_true.shape
        eps = np.finfo(float).eps  # 防止 np.log(0)
        cross_entropy = -np.sum(y_true * np.log(y_pred + eps)) / n 
        return cross_entropy
    
    @staticmethod
    def grad(y_true, y_pred):
        (n, _) = y_true.shape
        grad = (y_pred - y_true) / n
        return grad
    
    
def minibatch(X, batchsize=256, shuffle=True):
    """
    函数作用：将数据集分割成 batch， 基于 mini batch 训练。
    """
    N = X.shape[0]
    idx = np.arange(N)
    n_batches = int(np.ceil(N / batchsize))

    if shuffle:
        np.random.shuffle(idx)

    def mb_generator():
        for i in range(n_batches):
            yield idx[i * batchsize : (i + 1) * batchsize]

    return mb_generator(), n_batches


class DFN(object):
    
    def __init__(
        self,
        hidden_dims_1=None,
        hidden_dims_2=None,
        optimizer="sgd(lr=0.01)",
        init_w="std_normal",
        loss=CrossEntropy()
    ):
        self.optimizer = optimizer
        self.init_w = init_w
        self.loss = loss
        self.hidden_dims_1 = hidden_dims_1
        self.hidden_dims_2 = hidden_dims_2
        self.is_initialized = False
    
    def _set_params(self):
        """
        函数作用：模型初始化
        FC1 -> Sigmoid -> FC2 -> Softmax
        """
        self.layers = OrderedDict()
        self.layers["FC1"] = FullyConnected(
            n_out=self.hidden_dims_1,
            acti_fn="sigmoid", 
            init_w=self.init_w,
            optimizer=self.optimizer
        )
        self.layers["FC2"] = FullyConnected(
            n_out=self.hidden_dims_2,
            acti_fn="affine(slope=1, intercept=0)",
            init_w=self.init_w,
            optimizer=self.optimizer
        )
        self.is_initialized = True
    
    def forward(self, X_train):
        Xs = {}
        out = X_train
        for k, v in self.layers.items():
            Xs[k] = out
            out = v.forward(out)
        return out, Xs
    
    def backward(self, grad):
        dXs = {}
        out = grad
        for k, v in reversed(list(self.layers.items())):
            dXs[k] = out
            out = v.backward(out)
        return out, dXs
    
    def update(self):
        """
        函数作用：梯度更新
        """
        for k, v in reversed(list(self.layers.items())):
            v.update()
        self.flush_gradients()
    
    def flush_gradients(self, curr_loss=None):
        """
        函数作用：更新后重置梯度
        """
        for k, v in self.layers.items():
            v.flush_gradients()
    
    def fit(self, X_train, y_train, n_epochs=20, batch_size=64, verbose=False, epo_verbose=True):
        """
        参数说明：
        X_train：训练数据
        y_train：训练数据标签
        n_epochs：epoch 次数
        batch_size：每次 epoch 的 batch size
        verbose：是否每个 batch 输出损失
        epo_verbose：是否每个 epoch 输出损失
        """
        self.verbose = verbose
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        
        if not self.is_initialized:
            self.n_features = X_train.shape[1]
            self._set_params()
        
        prev_loss = np.inf
        for i in range(n_epochs):
            loss, epoch_start = 0.0, time.time()
            batch_generator, n_batch = minibatch(X_train, self.batch_size, shuffle=True)

            for j, batch_idx in enumerate(batch_generator):
                batch_len, batch_start = len(batch_idx), time.time()
                X_batch, y_batch = X_train[batch_idx], y_train[batch_idx]
                out, _ = self.forward(X_batch)
                y_pred_batch = softmax(out)
                batch_loss = self.loss(y_batch, y_pred_batch)
                grad = self.loss.grad(y_batch, y_pred_batch)
                _, _ = self.backward(grad)
                self.update()
                loss += batch_loss

                if self.verbose:
                    fstr = "\t[Batch {}/{}] Train loss: {:.3f} ({:.1f}s/batch)"
                    print(fstr.format(j + 1, n_batch, batch_loss, time.time() - batch_start))

            loss /= n_batch
            if epo_verbose:
                fstr = "[Epoch {}] Avg. loss: {:.3f}  Delta: {:.3f} ({:.2f}m/epoch)"
                print(fstr.format(i + 1, loss, prev_loss - loss, (time.time() - epoch_start) / 60.0))
            prev_loss = loss
            
    def evaluate(self, X_test, y_test, batch_size=128):
        acc = 0.0
        batch_generator, n_batch = minibatch(X_test, batch_size, shuffle=True)
        for j, batch_idx in enumerate(batch_generator):
            batch_len, batch_start = len(batch_idx), time.time()
            X_batch, y_batch = X_test[batch_idx], y_test[batch_idx]
            y_pred_batch, _ = self.forward(X_batch)
            y_pred_batch = np.argmax(y_pred_batch, axis=1)
            y_batch = np.argmax(y_batch, axis=1)
            acc += np.sum(y_pred_batch == y_batch)
        return acc / X_test.shape[0]
    
    @property
    def hyperparams(self):
        return {
            "init_w": self.init_w,
            "loss": str(self.loss),
            "optimizer": self.optimizer,
            "hidden_dims_1": self.hidden_dims_1,
            "hidden_dims_2": self.hidden_dims_2,
            "components": {k: v.params for k, v in self.layers.items()}
        }
    

================================================
FILE: code/chapter7.py
================================================
from abc import ABC, abstractmethod
import numpy as np
import math
import re
import progressbar
from chapter5 import RegressionTree, DecisionTree, ClassificationTree

#########---Regularizer---######
class RegularizerBase(ABC):
    
    def __init__(self, **kwargs):
        super().__init__()
    
    @abstractmethod
    def loss(self, **kwargs):
        raise NotImplementedError
    
    @abstractmethod
    def grad(self, **kwargs):
        raise NotImplementedError

class L1Regularizer(RegularizerBase):
    
    def __init__(self, lambd=0.001):
        super().__init__()
        self.lambd = lambd
    
    def loss(self, params):
        loss = 0
        pattern = re.compile(r'^W\d+')
        for key, val in params.items():
            if pattern.match(key):
                loss +=  0.5 * np.sum(np.abs(val)) * self.lambd
        return loss
    
    def grad(self, params):
        for key, val in params.items():
            grad = self.lambd * np.sign(val)
        return grad
    
class L2Regularizer(RegularizerBase):
    
    def __init__(self, lambd=0.001):
        super().__init__()
        self.lambd = lambd
        
    def loss(self, params):
        loss = 0
        for key, val in params.items():
            loss +=  0.5 * np.sum(np.square(val)) * self.lambd
        return loss
    
    def grad(self, params):
        for key, val in params.items():
            grad = self.lambd * val
        return grad
    
class RegularizerInitializer(object):
    
    def __init__(self, regular_name="l2"):
        self.regular_name = regular_name
    
    def __call__(self):
        r = r"([a-zA-Z]*)=([^,)]*)"
        regular_str = self.regular_name.lower()
        kwargs = dict([(i, eval(j)) for (i, j) in re.findall(r, regular_str)])
        if  "l1" in regular_str.lower():
            regular = L1Regularizer(**kwargs)
        elif "l2" in regular_str.lower():
            regular = L2Regularizer(**kwargs)
        else:
            raise ValueError("Unrecognized regular: {}".format(regular_str))
        return regular
    

#######----Dataset Augmentation----####
class Image(object):
    
    def __init__(self, image):
        self._set_params(image)
        
    def _set_params(self, image):
        self.img = image 
        self.row = image.shape[0] # 图像高度
        self.col = image.shape[1] # 图像宽度
        self.transform = None

    def Translation(self, delta_x, delta_y):
        """
        平移。
        
        参数说明：
        delta_x：控制左右平移，若大于0左移，小于0右移
        delta_y：控制上下平移，若大于0上移，小于0下移
        """
        self.transform = np.array([[1, 0, delta_x], 
                                   [0, 1, delta_y], 
                                   [0,  0,  1]])

    def Resize(self, alpha):
        """
        缩放。
        
        参数说明：
        alpha：缩放因子，不进行缩放设置为1
        """
        self.transform = np.array([[alpha, 0, 0], 
                                   [0, alpha, 0], 
                                   [0,  0,  1]])

    def HorMirror(self): 
        """
        水平镜像。
        """
        self.transform = np.array([[1,  0,  0], 
                                   [0, -1, self.col-1], 
                                   [0,  0,  1]])

    def VerMirror(self): 
        """
        垂直镜像。
        """
        self.transform = np.array([[-1, 0, self.row-1], 
                                   [0,  1,  0], 
                                   [0,  0,  1]])

    def Rotate(self, angle): 
        """
        旋转。
        
        参数说明：
        angle：旋转角度
        """
        self.transform = np.array([[math.cos(angle),-math.sin(angle),0],
                                   [math.sin(angle), math.cos(angle),0],
                                   [    0,              0,         1]])        

    def operate(self):
        temp = np.zeros(self.img.shape, dtype=self.img.dtype)
        for i in range(self.row):
            for j in range(self.col):
                temp_pos = np.array([i, j, 1])
                [x,y,z] = np.dot(self.transform, temp_pos)
                x = int(x)
                y = int(y)

                if x>=self.row or y>=self.col or x<0 or y<0:
                    temp[i,j,:] = 0
                else:
                    temp[i,j,:] = self.img[x,y]
        return temp
    
    def __call__(self, act):
        r = r"([a-zA-Z]*)=([^,)]*)"
        act_str = act.lower()
        kwargs = dict([(i, eval(j)) for (i, j) in re.findall(r, act_str)])
        if "translation" in act_str:
            self.Translation(**kwargs)
        elif "resize" in act_str:
            self.Resize(**kwargs)
        elif "hormirror" in act_str:
            self.HorMirror(**kwargs)
        elif "vermirror" in act_str:
            self.VerMirror(**kwargs)
        elif "rotate" in act_str:
            self.Rotate(**kwargs)
        return self.operate()

    
#######----Early Stopping----####
def early_stopping(valid):
    """
    参数说明：
    valid：验证集正确率列表
    """
    if len(valid) > 5:
        if valid[-1] < valid[-5] and valid[-2] < valid[-5] and valid[-3] < valid[-5] and valid[-4] < valid[-5]:
            return True
    return False


#####---Bagging--#####
def bootstrap_sample(X, Y):
    N, M = X.shape
    idxs = np.random.choice(N, N, replace=True)
    return X[idxs], Y[idxs]

class BaggingModel(object):

    def __init__(self, n_models):
        """
        参数说明：
        n_models：网络模型数目
        """
        self.models = []
        self.n_models = n_models

    def fit(self, X, Y):
        self.models = []
        for i in range(self.n_models):
            print("training {} base model:".format(i))
            X_samp, Y_samp = bootstrap_sample(X, Y)
            model = DFN(hidden_dims_1=200, hidden_dims_2=10)
            model.fit(X_samp, Y_samp)
            self.models.append(model)

    def predict(self, X):
        model_preds = np.array([[np.argmax(t.forward(x)[0]) for x in X] for t in self.models])
        return self._vote(model_preds)

    def _vote(self, predictions):
        out = [np.bincount(x).argmax() for x in predictions.T]
        return np.array(out)
    
    def evaluate(self, X_test, y_test):
        acc = 0.0
        y_pred = self.predict(X_test)
        y_true = np.argmax(y_test, axis=1)
        acc += np.sum(y_pred == y_true)
        return acc / X_test.shape[0]
    

#####----Dropout----#######
class Dropout(ABC):
    
    def __init__(self, wrapped_layer, p):
        """
        参数说明：
        wrapped_layer：被 dropout 的层
        p：神经元保留率
        """
        super().__init__()
        self._base_layer = wrapped_layer
        self.p = p
        self._init_wrapper_params()
        
    def _init_wrapper_params(self):
        self._wrapper_derived_variables = {"dropout_mask": None}
        self._wrapper_hyperparams = {"wrapper": "Dropout", "p": self.p}
        
    def flush_gradients(self):
        """
        函数作用：调用 base layer 重置更新参数列表
        """
        self._base_layer.flush_gradients()
        
    def update(self):
        """
        函数作用：调用 base layer 更新参数
        """
        self._base_layer.update()
        
    def forward(self, X, is_train=True):
        """
        参数说明：
        X：输入数组；
        is_train：是否为训练阶段，bool型；
        """
        mask = np.ones(X.shape).astype(bool)
        if is_train:
            mask = (np.random.rand(*X.shape) < self.p) / self.p
            X = mask * X
        self._wrapper_derived_variables["dropout_mask"] = mask
        return self._base_layer.forward(X)
        
    def backward(self, dLda):
        return self._base_layer.backward(dLda)
    
    @property
    def hyperparams(self):
        hp = self._base_layer.hyperparams
        hpw = self._wrapper_hyperparams
        if "wrappers" in hp:
            hp["wrappers"].append(hpw)
        else:
            hp["wrappers"] = [hpw]
        return hp


#####----Bagging----#######
# 进度条
bar_widgets = [
    'Training: ', progressbar.Percentage(), ' ', progressbar.Bar(marker="-", left="[", right="]"),
    ' ', progressbar.ETA()
]

def get_random_subsets(X, y, n_subsets, replacements=True):
    """从训练数据中抽取数据子集 (默认可重复抽样)"""
    n_samples = np.shape(X)[0]
    # 将 X 和 y 拼接，并将元素随机排序
    Xy = np.concatenate((X, y.reshape((1, len(y))).T), axis=1)
    np.random.shuffle(Xy)
    subsets = []
    # 如果抽样时不重复抽样，可以只使用 50% 的训练数据；如果抽样时可重复抽样，使用全部的训练数据，默认可重复抽样
    subsample_size = int(n_samples // 2)
    if replacements:
        subsample_size = n_samples      
    for _ in range(n_subsets):
        idx = np.random.choice(
            range(n_samples),
            size=np.shape(range(subsample_size)),
            replace=replacements)
        X = Xy[idx][:, :-1]
        y = Xy[idx][:, -1]
        subsets.append([X, y])
    return subsets


class Bagging():
    """
    Bagging分类器。使用一组分类树，这些分类树使用特征训练数据的随机子集。
    """
    def __init__(self, n_estimators=100, max_features=None, min_samples_split=2,
                 min_gain=0, max_depth=float("inf")):
        self.n_estimators = n_estimators    # 树的数目
        self.min_samples_split = min_samples_split   # 分割所需的最小样本数
        self.min_gain = min_gain            # 分割所需的最小纯度 (最小信息增益)
        self.max_depth = max_depth          # 树的最大深度
        self.progressbar = progressbar.ProgressBar(widgets=bar_widgets)

        # 初始化决策树
        self.trees = []
        for _ in range(n_estimators):
            self.trees.append(
                ClassificationTree(
                    min_samples_split=self.min_samples_split,
                    min_impurity=min_gain,
                    max_depth=self.max_depth))

    def fit(self, X, y):
        # 对每棵树选择数据集的随机子集
        subsets = get_random_subsets(X, y, self.n_estimators)
        for i in self.progressbar(range(self.n_estimators)):
            X_subset, y_subset = subsets[i]
            # 用特征子集和真实值训练一棵子模型 (这里的数据也是训练数据集的随机子集)
            self.trees[i].fit(X_subset, y_subset)

    def predict(self, X):
        y_preds = np.empty((X.shape[0], len(self.trees)))
        # 每棵决策树都在数据上预测
        for i, tree in enumerate(self.trees):
            # 基于特征做出预测
            prediction = tree.predict(X)
            y_preds[:, i] = prediction
            
        y_pred = []
        # 对每个样本，选择最常见的类别作为预测
        for sample_predictions in y_preds:
            y_pred.append(np.bincount(sample_predictions.astype('int')).argmax())
        return y_pred
    
    def score(self, X, y):
        y_pred = self.predict(X)
        accuracy = np.sum(y == y_pred, axis=0) / len(y)
        return accuracy

    
#####----RandomForest----#######
class RandomForest():
    """
    随机森林分类器。使用一组分类树，这些分类树使用特征的随机子集训练数据的随机子集。
    """
    def __init__(self, n_estimators=100, max_features=None, min_samples_split=2,
                 min_gain=0, max_depth=float("inf")):
        self.n_estimators = n_estimators    # 树的数目
        self.max_features = max_features    # 每棵树的最大使用特征数
        self.min_samples_split = min_samples_split   # 分割所需的最小样本数
        self.min_gain = min_gain            # 分割所需的最小纯度 (最小信息增益)
        self.max_depth = max_depth          # 树的最大深度
        self.progressbar = progressbar.ProgressBar(widgets=bar_widgets)

        # 初始化决策树
        self.trees = []
        for _ in range(n_estimators):
            self.trees.append(
                ClassificationTree(
                    min_samples_split=self.min_samples_split,
                    min_impurity=min_gain,
                    max_depth=self.max_depth))

    def fit(self, X, y):
        n_features = np.shape(X)[1]
        # 如果 max_features 没有定义，取默认值 sqrt(n_features)
        if not self.max_features:
            self.max_features = int(math.sqrt(n_features))

        # 对每棵树选择数据集的随机子集
        subsets = get_random_subsets(X, y, self.n_estimators)

        for i in self.progressbar(range(self.n_estimators)):
            X_subset, y_subset = subsets[i]
            # 选择特征的随机子集
            idx = np.random.choice(range(n_features), size=self.max_features, replace=True)
            # 保存特征的索引用于预测
            self.trees[i].feature_indices = idx
            # 选择索引对应的特征
            X_subset = X_subset[:, idx]
            # 用特征子集和真实值训练一棵子模型 (这里的数据也是训练数据集的随机子集)
            self.trees[i].fit(X_subset, y_subset)

    def predict(self, X):
        y_preds = np.empty((X.shape[0], len(self.trees)))
        # 每棵决策树都在数据上预测
        for i, tree in enumerate(self.trees):
            # 使用该决策树训练使用的特征
            idx = tree.feature_indices
            # 基于特征做出预测
            prediction = tree.predict(X[:, idx])
            y_preds[:, i] = prediction
            
        y_pred = []
        # 对每个样本，选择最常见的类别作为预测
        for sample_predictions in y_preds:
            y_pred.append(np.bincount(sample_predictions.astype('int')).argmax())
        return y_pred
    
    def score(self, X, y):
        y_pred = self.predict(X)
        accuracy = np.sum(y == y_pred, axis=0) / len(y)
        return accuracy

    
#####----Adaboost----#######
# 决策树桩，作为 Adaboost 算法的弱分类器 (基分类器)
class DecisionStump():
    
    def __init__(self):
        self.polarity = 1            # 表示决策树桩默认输出的类别为 1 或是 -1
        self.feature_index = None    # 用于分类的特征索引
        self.threshold = None        # 特征的阈值
        self.alpha = None            # 表示分类器准确性的值

class Adaboost():
    """
    Adaboost 算法。
    """
    def __init__(self, n_estimators=5):
        self.n_estimators = n_estimators    # 将使用的弱分类器的数量
        self.progressbar = progressbar.ProgressBar(widgets=bar_widgets)

    def fit(self, X, y):
        n_samples, n_features = np.shape(X)
        # 初始化权重 (上文中的 D)，均为 1/N
        w = np.full(n_samples, (1 / n_samples))
        self.trees = []
        # 迭代过程
        for _ in self.progressbar(range(self.n_estimators)):
            tree = DecisionStump()
            min_error = float('inf')    # 使用某一特征值的阈值预测样本的最小误差
            # 迭代遍历每个 (不重复的) 特征值，查找预测 y 的最佳阈值
            for feature_i in range(n_features):
                feature_values = np.expand_dims(X[:, feature_i], axis=1)
                unique_values = np.unique(feature_values)
                # 将该特征的每个特征值作为阈值
                for threshold in unique_values:
                    p = 1
                    # 将所有样本预测默认值可以设置为 1
                    prediction = np.ones(np.shape(y))
                    # 低于特征值阈值的预测改为 -1
                    prediction[X[:, feature_i] < threshold] = -1
                    # 计算错误率
                    error = sum(w[y != prediction])
                    # 如果错误率超过 50%，我们反转决策树桩默认输出的类别
                    # 比如 error = 0.8 => (1 - error) = 0.2，
                    # 原来计算的是输出到类别 1 的概率，类别 1 作为默认类别。反转后类别 0 作为默认类别
                    if error > 0.5:
                        error = 1 - error
                        p = -1
                    # 如果这个阈值导致最小的错误率，则保存
                    if error < min_error:
                        tree.polarity = p
                        tree.threshold = threshold
                        tree.feature_index = feature_i
                        min_error = error
                        
            # 计算用于更新样本权值的 alpha 值，也是作为基分类器的系数。
            tree.alpha = 0.5 * math.log((1.0 - min_error) / (min_error + 1e-10))
            # 将所有样本预测默认值设置为 1
            predictions = np.ones(np.shape(y))
            # 如果特征值低于阈值，则修改预测结果，这里还需要考虑弱分类器的默认输出类别
            negative_idx = (tree.polarity * X[:, tree.feature_index] < tree.polarity * tree.threshold)
            predictions[negative_idx] = -1
            # 计算新权值，未正确分类样本的权值增大，正确分类样本的权值减小
            w *= np.exp(-tree.alpha * y * predictions)
            w /= np.sum(w)
            # 保存分类器
            self.trees.append(tree)

    def predict(self, X):
        n_samples = np.shape(X)[0]
        y_pred = np.zeros((n_samples, 1))
        # 用每一个基分类器预测样本
        for tree in self.trees:
            # 将所有样本预测默认值设置为 1
            predictions = np.ones(np.shape(y_pred))
            negative_idx = (tree.polarity * X[:, tree.feature_index] < tree.polarity * tree.threshold)
            predictions[negative_idx] = -1
            # 对基分类器加权求和，权重 alpha
            y_pred += tree.alpha * predictions
        # 返回预测结果 1 或 -1
        y_pred = np.sign(y_pred).flatten()
        return y_pred
    
    def score(self, X, y):
        y_pred = self.predict(X)
        accuracy = np.sum(y == y_pred, axis=0) / len(y)
        return accuracy

    
#####----GBDT----#######
class Loss(ABC):

    def __init__(self):
        super().__init__()

    @abstractmethod    
    def loss(self, y_true, y_pred):
        return NotImplementedError()

    @abstractmethod    
    def grad(self, y, y_pred):
        raise NotImplementedError()

class SquareLoss(Loss):
    
    def __init__(self): 
        pass

    def loss(self, y, y_pred):
        pass

    def grad(self, y, y_pred):
        return -(y - y_pred)
    
    def hess(self, y, y_pred):
        return 1

class CrossEntropyLoss(Loss):
    
    def __init__(self): 
        pass

    def loss(self, y, y_pred):
        pass

    def grad(self, y, y_pred):
        return - (y - y_pred)  
    
    def hess(self, y, y_pred):
        return y_pred * (1-y_pred)


def softmax(x):
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return e_x / e_x.sum(axis=-1, keepdims=True)


def line_search(self, y, y_pred, h_pred):
    Lp = 2 * np.sum((y - y_pred) * h_pred)
    Lpp = np.sum(h_pred * h_pred)
    return 1 if np.sum(Lpp) == 0 else Lp / Lpp


def to_categorical(x, n_classes=None):
    """
    One-hot编码
    """
    if not n_classes:
        n_classes = np.amax(x) + 1
    one_hot = np.zeros((x.shape[0], n_classes))
    one_hot[np.arange(x.shape[0]), x] = 1
    return one_hot


class GradientBoostingDecisionTree(object):
    """
    GBDT 算法。用一组基学习器 (回归树) 学习损失函数的梯度。
    """
    def __init__(self, n_estimators, learning_rate=1, min_samples_split=2,
                 min_impurity=1e-7, max_depth=float("inf"), is_regression=False, line_search=False):
        self.n_estimators = n_estimators         # 迭代的次数
        self.learning_rate = learning_rate       # 训练过程中沿着负梯度走的步长，也就是学习率
        self.min_samples_split = min_samples_split    # 分割所需的最小样本数
        self.min_impurity = min_impurity         # 分割所需的最小纯度
        self.max_depth = max_depth               # 树的最大深度
        self.is_regression = is_regression       # 分类问题或回归问题
        self.line_search = line_search           # 是否使用 line search
        self.progressbar = progressbar.ProgressBar(widgets=bar_widgets)        
        # 回归问题采用基础的平方损失，分类问题采用交叉熵损失
        self.loss = SquareLoss()
        if not self.is_regression:
            self.loss = CrossEntropyLoss()

    def fit(self, X, Y):
        # 分类问题将 Y 转化为 one-hot 编码
        if not self.is_regression:
            Y = to_categorical(Y.flatten())
        else:
            Y = Y.reshape(-1, 1) if len(Y.shape) == 1 else Y
        self.out_dims = Y.shape[1]
        self.trees = np.empty((self.n_estimators, self.out_dims), dtype=object)
        Y_pred = np.full(np.shape(Y), np.mean(Y, axis=0))
        self.weights = np.ones((self.n_estimators, self.out_dims))
        self.weights[1:, :] *= self.learning_rate
        # 迭代过程
        for i in self.progressbar(range(self.n_estimators)):
            for c in range(self.out_dims):
                tree = RegressionTree(
                        min_samples_split=self.min_samples_split,
                        min_impurity=self.min_impurity,
                        max_depth=self.max_depth)
                # 计算损失的梯度，并用梯度进行训练
                if not self.is_regression:   
                    Y_hat = softmax(Y_pred)
                    y, y_pred = Y[:, c], Y_hat[:, c]
                else:
                    y, y_pred = Y[:, c], Y_pred[:, c]
                neg_grad = -1 * self.loss.grad(y, y_pred)
                tree.fit(X, neg_grad)
                # 用新的基学习器进行预测
                h_pred = tree.predict(X)
                # line search
                if self.line_search == True:
                    self.weights[i, c] *= line_search(y, y_pred, h_pred)
                # 加法模型中添加基学习器的预测，得到最新迭代下的加法模型预测
                Y_pred[:, c] += np.multiply(self.weights[i, c], h_pred)
                self.trees[i, c] = tree
    
    def predict(self, X):
        Y_pred = np.zeros((X.shape[0], self.out_dims))
        # 生成预测
        for c in range(self.out_dims):
            y_pred = np.array([])
            for i in range(self.n_estimators):
                update = np.multiply(self.weights[i, c], self.trees[i, c].predict(X))
                y_pred = update if not y_pred.any() else y_pred + update
            Y_pred[:, c] = y_pred
        if not self.is_regression: 
            # 分类问题输出最可能类别
            Y_pred = Y_pred.argmax(axis=1)
        return Y_pred
    
    def score(self, X, y):
        y_pred = self.predict(X)
        accuracy = np.sum(y == y_pred, axis=0) / len(y)
        return accuracy


class GradientBoostingRegressor(GradientBoostingDecisionTree):
    
    def __init__(self, n_estimators=200, learning_rate=1, min_samples_split=2,
                 min_impurity=1e-7, max_depth=float("inf"), is_regression=True, line_search=False):
        super(GradientBoostingRegressor, self).__init__(n_estimators=n_estimators, 
            learning_rate=learning_rate, 
            min_samples_split=min_samples_split, 
            min_impurity=min_impurity,
            max_depth=max_depth,
            is_regression=is_regression,
            line_search=line_search)


class GradientBoostingClassifier(GradientBoostingDecisionTree):
    
    def __init__(self, n_estimators=200, learning_rate=1, min_samples_split=2,
                 min_impurity=1e-7, max_depth=float("inf"), is_regression=False, line_search=False):
        super(GradientBoostingClassifier, self).__init__(n_estimators=n_estimators, 
            learning_rate=learning_rate, 
            min_samples_split=min_samples_split, 
            min_impurity=min_impurity,
            max_depth=max_depth,
            is_regression=is_regression,
            line_search=line_search)

        
#####----XGBoost----#######
class XGBoostRegressionTree(DecisionTree):
    """
    XGBoost 回归树。此处基于第五章介绍的决策树，故采用贪心算法找到特征上分裂点 (枚举特征上所有可能的分裂点)。
    """
    def __init__(self, min_samples_split=2, min_impurity=1e-7,
                 max_depth=float("inf"), loss=None, gamma=0., lambd=0.):
        super(XGBoostRegressionTree, self).__init__(min_impurity=min_impurity, 
            min_samples_split=min_samples_split, 
            max_depth=max_depth)
        self.gamma = gamma   # 叶子节点的数目的惩罚系数
        self.lambd = lambd   # 叶子节点的权重的惩罚系数
        self.loss = loss     # 损失函数
    
    def _split(self, y):
        # y 包含 y_true 在左半列，y_pred 在右半列
        col = int(np.shape(y)[1]/2)
        y, y_pred = y[:, :col], y[:, col:]
        return y, y_pred

    def _gain(self, y, y_pred):
        # 计算信息
        nominator = np.power((y * self.loss.grad(y, y_pred)).sum(), 2)
        denominator = self.loss.hess(y, y_pred).sum()
        return nominator / (denominator + self.lambd)

    def _gain_by_taylor(self, y, y1, y2):
        # 分割为左子树和右子树
        y, y_pred = self._split(y)
        y1, y1_pred = self._split(y1)
        y2, y2_pred = self._split(y2)
        true_gain = self._gain(y1, y1_pred)
        false_gain = self._gain(y2, y2_pred)
        gain = self._gain(y, y_pred)
        # 计算信息增益
        return 0.5 * (true_gain + false_gain - gain) - self.gamma

    def _approximate_update(self, y):
        y, y_pred = self._split(y)
        # 计算叶节点权重
        gradient = self.loss.grad(y, y_pred).sum()
        hessian = self.loss.hess(y, y_pred).sum()
        leaf_approximation = -gradient / (hessian + self.lambd)
        return leaf_approximation

    def fit(self, X, y):
        self._impurity_calculation = self._gain_by_taylor
        self._leaf_value_calculation = self._approximate_update
        super(XGBoostRegressionTree, self).fit(X, y)


class XGBoost(object):
    """
    XGBoost学习器。
    """
    def __init__(self, n_estimators=200, learning_rate=0.001, min_samples_split=2,
                 min_impurity=1e-7, max_depth=2, is_regression=False, gamma=0., lambd=0.):
        self.n_estimators = n_estimators            # 树的数目
        self.learning_rate = learning_rate          # 训练过程中沿着负梯度走的步长，也就是学习率
        self.min_samples_split = min_samples_split  # 分割所需的最小样本数
        self.min_impurity = min_impurity            # 分割所需的最小纯度
        self.max_depth = max_depth                  # 树的最大深度
        self.gamma = gamma                          # 叶子节点的数目的惩罚系数
        self.lambd = lambd                          # 叶子节点的权重的惩罚系数
        self.is_regression = is_regression          # 分类或回归问题
        self.progressbar = progressbar.ProgressBar(widgets=bar_widgets)
        # 回归问题采用基础的平方损失，分类问题采用交叉熵损失
        self.loss = SquareLoss()
        if not self.is_regression:
            self.loss = CrossEntropyLoss()

    def fit(self, X, Y):
        # 分类问题将 Y 转化为 one-hot 编码
        if not self.is_regression:
            Y = to_categorical(Y.flatten())
        else:
            Y = Y.reshape(-1, 1) if len(Y.shape) == 1 else Y
        self.out_dims = Y.shape[1]
        self.trees = np.empty((self.n_estimators, self.out_dims), dtype=object)
        Y_pred = np.zeros(np.shape(Y))
        self.weights = np.ones((self.n_estimators, self.out_dims))
        self.weights[1:, :] *= self.learning_rate
        # 迭代过程
        for i in self.progressbar(range(self.n_estimators)):
            for c in range(self.out_dims):
                tree = XGBoostRegressionTree(
                        min_samples_split=self.min_samples_split,
                        min_impurity=self.min_impurity,
                        max_depth=self.max_depth,
                        loss=self.loss,
                        gamma=self.gamma,
                        lambd=self.lambd)
                # 计算损失的梯度，并用梯度进行训练
                if not self.is_regression:   
                    Y_hat = softmax(Y_pred)
                    y, y_pred = Y[:, c], Y_hat[:, c]
                else:
                    y, y_pred = Y[:, c], Y_pred[:, c]

                y, y_pred = y.reshape(-1, 1), y_pred.reshape(-1, 1)
                y_and_ypred = np.concatenate((y, y_pred), axis=1)
                tree.fit(X, y_and_ypred)
                # 用新的基学习器进行预测
                h_pred = tree.predict(X)
                # 加法模型中添加基学习器的预测，得到最新迭代下的加法模型预测
                Y_pred[:, c] += np.multiply(self.weights[i, c], h_pred)
                self.trees[i, c] = tree

    def predict(self, X):
        Y_pred = np.zeros((X.shape[0], self.out_dims))
        # 生成预测
        for c in range(self.out_dims):
            y_pred = np.array([])
            for i in range(self.n_estimators):
                update = np.multiply(self.weights[i, c], self.trees[i, c].predict(X))
                y_pred = update if not y_pred.any() else y_pred + update
            Y_pred[:, c] = y_pred
        if not self.is_regression: 
            # 分类问题输出最可能类别
            Y_pred = Y_pred.argmax(axis=1)
        return Y_pred
    
    def score(self, X, y):
        y_pred = self.predict(X)
        accuracy = np.sum(y == y_pred, axis=0) / len(y)
        return accuracy
    
    
class XGBRegressor(XGBoost):
    
    def __init__(self, n_estimators=200, learning_rate=1, min_samples_split=2,
                 min_impurity=1e-7, max_depth=float("inf"), is_regression=True,
                 gamma=0., lambd=0.):
        super(XGBRegressor, self).__init__(n_estimators=n_estimators, 
            learning_rate=learning_rate, 
            min_samples_split=min_samples_split, 
            min_impurity=min_impurity,
            max_depth=max_depth,
            is_regression=is_regression,
            gamma=gamma,
            lambd=lambd)


class XGBClassifier(XGBoost):
    
    def __init__(self, n_estimators=200, learning_rate=1, min_samples_split=2,
                 min_impurity=1e-7, max_depth=float("inf"), is_regression=False,
                 gamma=0., lambd=0.):
        super(XGBClassifier, self).__init__(n_estimators=n_estimators, 
            learning_rate=learning_rate, 
            min_samples_split=min_samples_split, 
            min_impurity=min_impurity,
            max_depth=max_depth,
            is_regression=is_regression,
            gamma=gamma,
            lambd=lambd)        


================================================
FILE: code/chapter8.py
================================================
from chapter import LayerBase
import numpy as np

######### 优化方法(Optimizer)见 method/optimizer #######


######## 参数初始化(Parameter Initialization) 见method/weight #####


######## BatchNorm1D #####
class BatchNorm1D(LayerBase):

    def __init__(self, momentum=0.9, epsilon=1e-5, optimizer=None):
        """
        参数说明：
        momentum：动量项，越趋于 1 表示对当前 Batch 的依赖程度越小，running_mean和running_var的计算越平滑
                    float型 (default: 0.9)

        epsilon：避免除数为0，float型 (default : 1e-5)
        optimizer：优化器
        """
        super().__init__(optimizer)

        self.n_in = None
        self.n_out = None
        self.epsilon = epsilon
        self.momentum = momentum
        self.params = {
            "scaler": None,
            "intercept": None,
            "running_var": None,
            "running_mean": None,
        }
        self.is_initialized = False

    def _init_params(self):
        scaler = np.random.rand(self.n_in)
        intercept = np.zeros(self.n_in)
        running_mean = np.zeros(self.n_in)
        running_var = np.ones(self.n_in)
        
        self.params = {
            "scaler": scaler,
            "intercept": intercept,
            "running_mean": running_mean,
            "running_var": running_var,
        }
        self.gradients = {
            "scaler": np.zeros_like(scaler),
            "intercept": np.zeros_like(intercept),
        }
        self.is_initialized = True

    def reset_running_stats(self):
        self.params["running_mean"] = np.zeros(self.n_in)
        self.params["running_var"] = np.ones(self.n_in)

    def forward(self, X, is_train=True, retain_derived=True):
        """
        Batch 训练时 BN 的前向传播，原理见上文。

        [train]: Y = scaler * norm(X) + intercept，其中 norm(X) = (X - mean(X)) / sqrt(var(X) + epsilon)

        [test]: Y = scaler * running_norm(X) + intercept，
                    其中 running_norm(X) = (X - running_mean) / sqrt(running_var + epsilon)
            
        参数说明：
        X：输入数组，为（n_samples, n_in），float型
        is_train：是否为训练阶段，bool型
        retain_derived：是否保留中间变量，以便反向传播时再次使用，bool型
        """
        if not self.is_initialized:
            self.n_in = self.n_out = X.shape[1]
            self._init_params()

        epsi, momentum = self.hyperparams["epsilon"], self.hyperparams["momentum"]
        rm, rv = self.params["running_mean"], self.params["running_var"]

        scaler, intercept = self.params["scaler"], self.params["intercept"]
        X_mean, X_var = self.params["running_mean"], self.params["running_var"]

        if is_train and retain_derived:
            X_mean, X_var = X.mean(axis=0), X.var(axis=0) 
            self.params["running_mean"] = momentum * rm + (1.0 - momentum) * X_mean
            self.params["running_var"] = momentum * rv + (1.0 - momentum) * X_var

        if retain_derived:
            self.X.append(X)

        X_hat = (X - X_mean) / np.sqrt(X_var + epsi)
        y = scaler * X_hat + intercept
        return y

    def backward(self, dLda, retain_grads=True):
        """
        BN 的反向传播，原理见上文。
        
        参数说明：
        dLda：关于损失的梯度，为（n_samples, n_out），float型
        retain_grads：是否计算中间变量的参数梯度，bool型
        """
        if not isinstance(dLda, list):
            dLda = [dLda]

        dX = []
        X = self.X
        for da, x in zip(dLda, X):
            dx, dScaler, dIntercept = self._bwd(da, x)
            dX.append(dx)

            if retain_grads:
                self.gradients["scaler"] += dScaler
                self.gradients["intercept"] += dIntercept

        return dX[0] if len(X) == 1 else dX

    def _bwd(self, dLda, X):
        scaler = self.params["scaler"]
        epsi = self.hyperparams["epsilon"]

        n_ex, n_in = X.shape
        X_mean, X_var = X.mean(axis=0), X.var(axis=0)
        X_hat = (X - X_mean) / np.sqrt(X_var + epsi)
        
        dIntercept = dLda.sum(axis=0)
        dScaler = np.sum(dLda * X_hat, axis=0)
        dX_hat = dLda * scaler
        
        dX = (n_ex * dX_hat - dX_hat.sum(axis=0) - X_hat * (dX_hat * X_hat).sum(axis=0)) / (
            n_ex * np.sqrt(X_var + epsi)
        )

        return dX, dScaler, dIntercept
    
    @property
    def hyperparams(self):
        return {
            "layer": "BatchNorm1D",
            "acti_fn": None,
            "n_in": self.n_in,
            "n_out": self.n_out,
            "epsilon": self.epsilon,
            "momentum": self.momentum,
            "optimizer": {
                "cache": self.optimizer.cache,
                "hyperparams": self.optimizer.hyperparams,
            },
        }


================================================
FILE: code/chapter9.py
================================================
from abc import ABC, abstractmethod
import numpy as np
from chapter6 import LayerBase, CrossEntropy, FullyConnected, minibatch, softmax
from collections import OrderedDict


########## Padding ################
def calc_pad_dims_sameconv_2D(X_shape, out_dim, kernel_shape, stride, dilation=1):
    """
    当填充方式为相同卷积时，计算 padding 的数目，保证输入输出的大小相同。这里在卷积过程中考虑填充(Padding)，
    卷积步幅(Stride)，扩张率(Dilation rate)。根据扩张卷积的输出公式可以得到 padding 的数目。
    
    参数说明：
    X_shape：输入数组，为 (n_samples, in_rows, in_cols, in_ch)
    out_dim：输出数组维数，为 (out_rows, out_cols)
    kernel_shape：卷积核形状，为 (fr, fc)
    stride：卷积步幅，int 型
    dilation：扩张率，int 型，default=1
    """
    d = dilation
    fr, fc = kernel_shape
    out_rows, out_cols = out_dim
    n_ex, in_rows, in_cols, in_ch = X_shape

    # 考虑扩张率
    _fr, _fc = fr + (fr-1) * (d-1), fc + (fc-1) * (d-1)

    # 计算 padding 维数
    pr = int((stride * (out_rows-1) + _fr - in_rows) / 2)
    pc = int((stride * (out_cols-1) + _fc - in_cols) / 2)

    # 校验，如不等 (right/bottom处) 添加不对称0填充
    out_rows1 = int(1 + (in_rows + 2 * pr - _fr) / stride)
    out_cols1 = int(1 + (in_cols + 2 * pc - _fc) / stride)
    
    pr1, pr2 = pr, pr
    if out_rows1 == out_rows - 1:
        pr1, pr2 = pr, pr + 1
    elif out_rows1 != out_rows:
        raise AssertionError

    pc1, pc2 = pc, pc
    if out_cols1 == out_cols - 1:
        pc1, pc2 = pc, pc + 1
    elif out_cols1 != out_cols:
        raise AssertionError
        
    # 返回对 X 的 Padding 维数 (left, right, up, down)
    return (pr1, pr2, pc1, pc2)


def pad2D(X, pad, kernel_shape=None, stride=None, dilation=1):
    """
    二维填充
    
    参数说明：
    X：输入数组，为 (n_samples, in_rows, in_cols, in_ch)，
        其中 padding 操作是应用到 in_rows 和 in_cols
    pad：padding 数目，4-tuple, int, 或 'same'，'valid'
        在图片的左、右、上、下 (left, right, up, down) 0填充
        若为int，表示在左、右、上、下均填充数目为 pad 的 0，
        若为same，表示填充后为相同 (same) 卷积，
        若为valid，表示填充后为有效 (valid) 卷积
    kernel_shape：卷积核形状，为 (fr, fc)
    stride：卷积步幅，int 型
    dilation：扩张率，int 型，default=1
    """
    p = pad
    if isinstance(p, int):
        p = (p, p, p, p)

    if isinstance(p, tuple):
        X_pad = np.pad(
            X,
            pad_width=((0, 0), (p[0], p[1]), (p[2], p[3]), (0, 0)),
            mode="constant",
            constant_values=0,
        )

    # 'same'卷积，首先计算 padding 维数
    if p == "same" and kernel_shape and stride is not None:
        p = calc_pad_dims_sameconv_2D(
            X.shape, X.shape[1:3], kernel_shape, stride, dilation=dilation
        )
        X_pad, p = pad2D(X, p)
        
    if p == "valid":
        p = (0, 0, 0, 0)
        X_pad, p = pad2D(X, p)
        
    return X_pad, p


####### conv2D ##################
def conv2D(X, W, stride, pad, dilation=1):
    """
    二维卷积实现过程。

    参数说明：
    X：输入数组，为 (n_samples, in_rows, in_cols, in_ch)
    W：卷积层的卷积核参数，为 (kernel_rows, kernel_cols, in_ch, out_ch)
    stride：卷积核的卷积步幅，int型
    pad：padding 数目，4-tuple, int, 或 'same'，'valid'型
        在图片的左、右、上、下 (left, right, up, down) 0填充
        若为int，表示在左、右、上、下均填充数目为 pad 的 0，
        若为same，表示填充后为相同 (same) 卷积，
        若为valid，表示填充后为有效 (valid) 卷积
    dilation：扩张率，int 型，default=1

    输出说明：
    Z：卷积结果，为 (n_samples, out_rows, out_cols, out_ch)
    """
    s, d = stride, dilation
    X_pad, p = pad2D(X, pad, W.shape[:2], stride=s, dilation=d)

    pr1, pr2, pc1, pc2 = p
    fr, fc, in_ch, out_ch = W.shape
    n_samp, in_rows, in_cols, in_ch = X.shape

    # 考虑扩张率
    _fr, _fc = fr + (fr-1) * (d-1), fc + (fc-1) * (d-1)

    out_rows = int((in_rows + pr1 + pr2 - _fr) / s + 1)
    out_cols = int((in_cols + pc1 + pc2 - _fc) / s + 1)

    Z = np.zeros((n_samp, out_rows, out_cols, out_ch))
    for m in range(n_samp):
        for c in range(out_ch):
            for i in range(out_rows):
                for j in range(out_cols):
                    i0, i1 = i * s, (i * s) + fr + (fr-1) * (d-1)
                    j0, j1 = j * s, (j * s) + fc + (fc-1) * (d-1)

                    window = X_pad[m, i0 : i1 : d, j0 : j1 : d, :]
                    Z[m, i, j, c] = np.sum(window * W[:, :, :, c])
    return Z


####### conv2D GEMM ############
"""
conv2D 的 GEMM 实现过程，将 X 和 W 转化为 2D 矩阵，
这里我们将 X 转化为 (kernel_rows*kernel_cols*in_ch, n_samples*out_rows*out_cols)
W 转化为 (out_ch, kernel_rows*kernel_cols*in_ch)
"""
def _im2col_indices(X_shape, fr, fc, p, s, d=1):
    """
    生成输入矩阵的 (c,h_in,w_in) 三个维度的索引
    
    输出说明：
    i：输入矩阵的i值，(kernel_rows*kernel_cols*in_ch, out_rows*out_cols)，图示中第二维坐标
    j：输入矩阵的j值，(kernel_rows*kernel_cols*in_ch, out_rows*out_cols)，图示中第三维坐标
    k：输入矩阵的c值，(kernel_rows*kernel_cols*in_ch, 1)，图示中第一维坐标
    """
    pr1, pr2, pc1, pc2 = p
    n_ex, n_in, in_rows, in_cols = X_shape

    # 考虑扩张率
    _fr, _fc = fr + (fr-1) * (d-1), fc + (fc-1) * (d-1)

    out_rows = int((in_rows + pr1 + pr2 - _fr) / s + 1)
    out_cols = int((in_cols + pc1 + pc2 - _fc) / s + 1)

    # i0/i1/j0/j1：用于得到i，j，k。i0/j0过程见图示，i1/j1由滑动过程得出
    i0 = np.repeat(np.arange(fr), fc)
    i0 = np.tile(i0, n_in) * d
    i1 = s * np.repeat(np.arange(out_rows), out_cols)
    j0 = np.tile(np.arange(fc), fr * n_in) * d
    j1 = s * np.tile(np.arange(out_cols), out_rows)

    i = i0.reshape(-1, 1) + i1.reshape(1, -1)
    j = j0.reshape(-1, 1) + j1.reshape(1, -1)
    k = np.repeat(np.arange(n_in), fr * fc).reshape(-1, 1)
    return k, i, j



def im2col(X, W_shape, pad, stride, dilation=1):
    """
    im2col 实现

    参数说明：
    X：输入数组，为 (n_samples, in_rows, in_cols, in_ch)，此时还未 0 填充(padding)
    W_shape：卷积层的卷积核的形状，为 (kernel_rows, kernel_cols, in_ch, out_ch)
    pad：padding 数目，4-tuple, int, 或 'same'，'valid'型
        在图片的左、右、上、下 (left, right, up, down) 0填充
        若为int，表示在左、右、上、下均填充数目为 pad 的 0，
        若为same，表示填充后为相同 (same) 卷积，
        若为valid，表示填充后为有效 (valid) 卷积
    stride：卷积核的卷积步幅，int型
    dilation：扩张率，int 型，default=1

    输出说明：
    X_col：输出结果，形状为 (kernel_rows*kernel_cols*n_in, n_samples*out_rows*out_cols)
    p：填充数，4-tuple
    """
    fr, fc, n_in, n_out = W_shape
    s, p, d = stride, pad, dilation
    n_samp, in_rows, in_cols, n_in = X.shape

    X_pad, p = pad2D(X, p, W_shape[:2], stride=s, dilation=d)
    pr1, pr2, pc1, pc2 = p

    # 将输入的通道维数移至第二位
    X_pad = X_pad.transpose(0, 3, 1, 2)

    k, i, j = _im2col_indices((n_samp, n_in, in_rows, in_cols), fr, fc, p, s, d)

    # X_col.shape = (n_samples, kernel_rows*kernel_cols*n_in, out_rows*out_cols)
    X_col = X_pad[:, k, i, j]
    X_col = X_col.transpose(1, 2, 0).reshape(fr * fc * n_in, -1)
    return X_col, p


def conv2D_gemm(X, W, stride=0, pad='same', dilation=1):
    """
    二维卷积实现过程，依靠“im2col”函数将卷积作为单个矩阵乘法执行。

    参数说明：
    X：输入数组，为 (n_samples, in_rows, in_cols, in_ch)
    W：卷积层的卷积核参数，为 (kernel_rows, kernel_cols, in_ch, out_ch)
    stride：卷积核的卷积步幅，int型
    pad：padding 数目，4-tuple, int, 或 'same'，'valid'型
        在图片的左、右、上、下 (left, right, up, down) 0填充
        若为int，表示在左、右、上、下均填充数目为 pad 的 0，
        若为same，表示填充后为相同 (same) 卷积，
        若为valid，表示填充后为有效 (valid) 卷积
    dilation：扩张率，int 型，default=1

    输出说明：
    Z：卷积结果，为 (n_samples, out_rows, out_cols, out_ch)
    """
    s, d = stride, dilation
    _, p = pad2D(X, pad, W.shape[:2], s, dilation=dilation)

    pr1, pr2, pc1, pc2 = p
    fr, fc, in_ch, out_ch = W.shape
    n_samp, in_rows, in_cols, in_ch = X.shape
    
    # 考虑扩张率
    _fr, _fc = fr + (fr-1) * (d-1), fc + (fc-1) * (d-1)

    # 输出维数，根据上面公式可得
    out_rows = int((in_rows + pr1 + pr2 - _fr) / s + 1)
    out_cols = int((in_cols + pc1 + pc2 - _fc) / s + 1)

    # 将 X 和 W 转化为 2D 矩阵并乘积
    X_col, _ = im2col(X, W.shape, p, s, d)
    W_col = W.transpose(3, 2, 0, 1).reshape(out_ch, -1)

    Z = (W_col @ X_col).reshape(out_ch, out_rows, out_cols, n_samp).transpose(3, 1, 2, 0)

    return Z


########### Conv2D ##################
class Conv2D(LayerBase):
    
    def __init__(
        self,
        out_ch,
        kernel_shape,
        pad=0,
        stride=1,
        dilation=1,
        acti_fn=None,
        optimizer=None,
        init_w="glorot_uniform",
    ):
        """
        二维卷积

        参数说明：
        out_ch：卷积核组的数目，int 型
        kernel_shape：单个卷积核形状，2-tuple
        acti_fn：激活函数，str 型
        pad：padding 数目，4-tuple, int, 或 'same'，'valid'型
            在图片的左、右、上、下 (left, right, up, down) 0填充
            若为int，表示在左、右、上、下均填充数目为 pad 的 0，
            若为same，表示填充后为相同 (same) 卷积，
            若为valid，表示填充后为有效 (valid) 卷积
        stride：卷积核的卷积步幅，int型
        dilation：扩张率，int 型，default=1
        init_w：权重初始化方法，str型
        optimizer：优化方法，str型
        """
        super().__init__(optimizer)

        self.pad = pad
        self.in_ch = None
        self.out_ch = out_ch
        self.stride = stride
        self.dilation = dilation
        self.kernel_shape = kernel_shape
        self.init_w = init_w
        self.init_weights = WeightInitializer(mode=init_w)
        self.acti_fn = ActivationInitializer(acti_fn)()
        self.parameters = {"W": None, "b": None}
        self.is_initialized = False

    def _init_params(self):
        fr, fc = self.kernel_shape
        W = self.init_weights((fr, fc, self.in_ch, self.out_ch))
        b = np.zeros((1, 1, 1, self.out_ch))

        self.params = {"W": W, "b": b}
        self.gradients = {"W": np.zeros_like(W), "b": np.zeros_like(b)}
        self.derived_variables = {"Y": []}
        self.is_initialized = True

    def forward(self, X, retain_derived=True):
        """
        卷积层的前向传播，原理见上文。

        参数说明：
        X：输入数组，形状为 (n_samples, in_rows, in_cols, in_ch)
        retain_derived：是否保留中间变量，以便反向传播时再次使用，bool型

        输出说明：
        a：卷积层输出，形状为 (n_samples, out_rows, out_cols, out_ch)
        """
        if not self.is_initialized:
            self.in_ch = X.shape[3]
            self._init_params()

        W = self.params["W"]
        b = self.params["b"]

        n_samp, in_rows, in_cols, in_ch = X.shape
        s, p, d = self.stride, self.pad, self.dilation

        # 卷积操作
        Y = conv2D(X, W, s, p, d) + b
        a = self.acti_fn(Y)

        if retain_derived:
            self.X.append(X)
            self.derived_variables["Y"].append(Y)

        return a

    def backward(self, dLda, retain_grads=True):
        """
        卷积层的反向传播，原理见上文。

        参数说明：
        dLda：关于损失的梯度，为 (n_samples, out_rows, out_cols, out_ch) 
        retain_grads：是否计算中间变量的参数梯度，bool型

        输出说明：
        dXs：即dX，当前卷积层对输入关于损失的梯度，为 (n_samples, in_rows, in_cols, in_ch)
        """
        if not isinstance(dLda, list):
            dLda = [dLda]

        W = self.params["W"]
        b = self.params["b"]
        Ys = self.derived_variables["Y"]
        Xs, d = self.X, self.dilation
        (fr, fc), s, p = self.kernel_shape, self.stride, self.pad
        dXs = []
        
        for X, Y, da in zip(Xs, Ys, dLda):
            n_samp, out_rows, out_cols, out_ch = da.shape
            X_pad, (pr1, pr2, pc1, pc2) = pad2D(X, p, self.kernel_shape, s, d)

            dY = da * self.acti_fn.grad(Y)

            dX = np.zeros_like(X_pad)
            dW, db = np.zeros_like(W), np.zeros_like(b)
            for m in range(n_samp):
                for i in range(out_rows):
                    for j in range(out_cols):
                        for c in range(out_ch):
                            i0, i1 = i * s, (i * s) + fr + (fr-1) * (d-1)
                            j0, j1 = j * s, (j * s) + fc + (fc-1) * (d-1)

                            wc = W[:, :, :, c]
                            kernel = dY[m, i, j, c]
                            window = X_pad[m, i0:i1:d, j0:j1:d, :]

                            db[:, :, :, c] += kernel
                            dW[:, :, :, c] += window * kernel
                            dX[m, i0:i1:d, j0:j1:d, :] += (
                                wc * kernel
                            )

            if retain_grads:
                self.gradients["W"] += dW
                self.gradients["b"] += db

            pr2 = None if pr2 == 0 else -pr2
            pc2 = None if pc2 == 0 else -pc2
            dXs.append(dX[:, pr1:pr2, pc1:pc2, :])
            
        return dXs[0] if len(Xs) == 1 else dXs
    
    @property
    def hyperparams(self):
        return {
            "layer": "Conv2D",
            "pad": self.pad,
            "init_w": self.init_w,
            "in_ch": self.in_ch,
            "out_ch": self.out_ch,
            "stride": self.stride,
            "dilation": self.dilation,
            "acti_fn": str(self.acti_fn),
            "kernel_shape": self.kernel_shape,
            "optimizer": {
                "cache": self.optimizer.cache,
                "hyperparams": self.optimizer.hyperparams,
            },
        }


######### Conv2D GEMM #############
def col2im(X_col, X_shape, W_shape, pad, stride, dilation=0):
    """
    col2im 实现，“col2im”函数将 2D 矩阵变为 4D 图像

    参数说明：
    X_col：X 经过 im2col 后 (列) 的矩阵，形状为 (Q, Z)，具体形状见上文
    X_shape：原始的输入数组形状，为 (n_samples, in_rows, in_cols, in_ch)，
             此时还未 0 填充(padding)
    W_shape：卷积核组形状，4-tuple 为 (kernel_rows, kernel_cols, in_ch, out_ch)
    pad：padding 数目，4-tuple
            在图片的左、右、上、下 (left, right, up, down) 0填充
    stride：卷积核的卷积步幅，int型
    dilation：扩张率，int 型，default=1

    输出说明：
    img：输出结果，形状为 (n_samples, in_rows, in_cols, in_ch)
    """
    s, d = stride, dilation
    pr1, pr2, pc1, pc2 = pad
    fr, fc, n_in, n_out = W_shape
    n_samp, in_rows, in_cols, n_in = X_shape

    X_pad = np.zeros((n_samp, n_in, in_rows + pr1 + pr2, in_cols + pc1 + pc2))
    k, i, j = _im2col_indices((n_samp, n_in, in_rows, in_cols), fr, fc, pad, s, d)

    X_col_reshaped = X_col.reshape(n_in * fr * fc, -1, n_samp)
    X_col_reshaped = X_col_reshaped.transpose(2, 0, 1)

    np.add.at(X_pad, (slice(None), k, i, j), X_col_reshaped)

    pr2 = None if pr2 == 0 else -pr2
    pc2 = None if pc2 == 0 else -pc2
    return X_pad[:, :, pr1:pr2, pc1:pc2]


class Conv2D_gemm(LayerBase):
    
    def __init__(
        self,
        out_ch,
        kernel_shape,
        pad=0,
        stride=1,
        dilation=1,
        acti_fn=None,
        optimizer=None,
        init_w="glorot_uniform",
    ):
        """
        二维卷积

        参数说明：
        out_ch：卷积核组的数目，int 型
        kernel_shape：单个卷积核形状，2-tuple
        acti_fn：激活函数，str 型
        pad：padding 数目，4-tuple, int, 或 'same'，'valid'型
            在图片的左、右、上、下 (left, right, up, down) 0填充
            若为int，表示在左、右、上、下均填充数目为 pad 的 0，
            若为same，表示填充后为相同 (same) 卷积，
            若为valid，表示填充后为有效 (valid) 卷积
        stride：卷积核的卷积步幅，int型
        dilation：扩张率，int 型，default=1
        init_w：权重初始化方法，str型
        optimizer：优化方法，str型
        """
        super().__init__(optimizer)

        self.pad = pad
        self.in_ch = None
        self.out_ch = out_ch
        self.stride = stride
        self.dilation = dilation
        self.kernel_shape = kernel_shape
        self.init_w = init_w
        self.init_weights = WeightInitializer(mode=init_w)
        self.acti_fn = ActivationInitializer(acti_fn)()
        self.parameters = {"W": None, "b": None}
        self.is_initialized = False

    def _init_params(self):
        fr, fc = self.kernel_shape
        W = self.init_weights((fr, fc, self.in_ch, self.out_ch))
        b = np.zeros((1, 1, 1, self.out_ch))

        self.params = {"W": W, "b": b}
        self.gradients = {"W": np.zeros_like(W), "b": np.zeros_like(b)}
        self.derived_variables = {"Y": []}
        self.is_initialized = True

    def forward(self, X, retain_derived=True):
        """
        卷积层的前向传播，原理见上文。

        参数说明：
        X：输入数组，形状为 (n_samples, in_rows, in_cols, in_ch)
        retain_derived：是否保留中间变量，以便反向传播时再次使用，bool型

        输出说明：
        a：卷积层输出，形状为 (n_samples, out_rows, out_cols, out_ch)
        """
        if not self.is_initialized:
            self.in_ch = X.shape[3]
            self._init_params()

        W = self.params["W"]
        b = self.params["b"]

        n_samp, in_rows, in_cols, in_ch = X.shape
        s, p, d = self.stride, self.pad, self.dilation

        # 卷积操作
        Y = conv2D_gemm(X, W, s, p, d) + b
        a = self.acti_fn(Y)

        if retain_derived:
            self.X.append(X)
            self.derived_variables["Y"].append(Y)

        return a

    def backward(self, dLda, retain_grads=True):
        """
        卷积层的反向传播，原理见上文。

        参数说明：
        dLda：关于损失的梯度，为 (n_samples, out_rows, out_cols, out_ch) 
        retain_grads：是否计算中间变量的参数梯度，bool型

        输出说明：
        dX：当前卷积层对输入关于损失的梯度，为 (n_samples, in_rows, in_cols, in_ch)
        """
        if not isinstance(dLda, list):
            dLda = [dLda]

        dX = []
        X = self.X
        Y = self.derived_variables["Y"]

        for da, x, y in zip(dLda, X, Y):
            dx, dw, db = self._bwd(da, x, y)
            dX.append(dx)

            if retain_grads:
                self.gradients["W"] += dw
                self.gradients["b"] += db

        return dX[0] if len(X) == 1 else dX

    def _bwd(self, dLda, X, Y):
        W = self.params["W"]
        d = self.dilation
        fr, fc, in_ch, out_ch = W.shape
        n_samp, out_rows, out_cols, out_ch = dLda.shape
        (fr, fc), s, p = self.kernel_shape, self.stride, self.pad
        
        dLdy = dLda * self.acti_fn.grad(Y)
        dLdy_col = dLdy.transpose(3, 1, 2, 0).reshape(out_ch, -1)
        W_col = W.transpose(3, 2, 0, 1).reshape(out_ch, -1).T
        X_col, p = im2col(X, W.shape, p, s, d)

        dW = (dLdy_col @ X_col.T).reshape(out_ch, in_ch, fr, fc).transpose(2, 3, 1, 0)
        db = dLdy_col.sum(axis=1).reshape(1, 1, 1, -1)

        dX_col = W_col @ dLdy_col
        dX = col2im(dX_col, X.shape, W.shape, p, s, d).transpose(0, 2, 3, 1)

        return dX, dW, db
    
    @property
    def hyperparams(self):
        return {
            "layer": "Conv2D",
            "pad": self.pad,
            "init_w": self.init_w,
            "in_ch": self.in_ch,
            "out_ch": self.out_ch,
            "stride": self.stride,
            "dilation": self.dilation,
            "acti_fn": str(self.acti_fn),
            "kernel_shape": self.kernel_shape,
            "optimizer": {
                "cache": self.optimizer.cache,
                "hyperparams": self.optimizer.hyperparams,
            },
        }


######## Pool2D ################
class Pool2D(LayerBase):
    
    def __init__(self, kernel_shape, stride=1, pad=0, mode="max", optimizer=None):
        """
        二维池化

        参数说明：
        kernel_shape：池化窗口的大小，2-tuple
        stride：和卷积类似，窗口在每一个维度上滑动的步长，int型
        pad：padding 数目，4-tuple, int, 或 str('same','valid')型 (default: 0)
            和卷积类似
        mode：池化函数，str型 (default: 'max')，可选{"max","average"}
        optimizer：优化方法，str型
        """
        super().__init__(optimizer)

        self.pad = pad
        self.mode = mode
        self.in_ch = None
        self.out_ch = None
        self.stride = stride
        self.kernel_shape = kernel_shape
        self.is_initialized = False

    def _init_params(self):
        self.derived_variables = {"out_rows": [], "out_cols": []}
        self.is_initialized = True

    def forward(self, X, retain_derived=True):
        """
        池化层前向传播

        参数说明：
        X：输入数组，形状为 (n_samp, in_rows, in_cols, in_ch)
        retain_derived：是否保留中间变量，以便反向传播时再次使用，bool型
        
        输出说明：
        Y：输出结果，形状为 (n_samp, out_rows, out_cols, out_ch)
        """
        if not self.is_initialized:
            self.in_ch = self.out_ch = X.shape[3]
            self._init_params()

        n_samp, in_rows, in_cols, nc_in = X.shape
        (fr, fc), s, p = self.kernel_shape, self.stride, self.pad
        X_pad, (pr1, pr2, pc1, pc2) = pad2D(X, p, self.kernel_shape, s)

        out_rows = int((in_rows + pr1 + pr2 - fr) / s + 1)
        out_cols = int((in_cols + pc1 + pc2 - fc) / s + 1)

        if self.mode == "max":
            pool_fn = np.max
        elif self.mode == "average":
            pool_fn = np.mean

        Y = np.zeros((n_samp, out_rows, out_cols, self.out_ch))
        for m in range(n_samp):
            for i in range(out_rows):
                for j in range(out_cols):
                    for c in range(self.out_ch):
                        i0, i1 = i * s, (i * s) + fr
                        j0, j1 = j * s, (j * s) + fc

                        xi = X_pad[m, i0:i1, j0:j1, c]
                        Y[m, i, j, c] = pool_fn(xi)

        if retain_derived:
            self.X.append(X)
            self.derived_variables["out_rows"].append(out_rows)
            self.derived_variables["out_cols"].append(out_cols)

        return Y

    def backward(self, dLdy, retain_grads=True):
        """
        池化层的反向传播，原理见上文。

        参数说明：
        dLdy：关于损失的梯度，为 (n_samples, out_rows, out_cols, out_ch) 
        retain_grads：是否计算中间变量的参数梯度，bool型

        输出说明：
        dXs：即dX，当前卷积层对输入关于损失的梯度，为 (n_samples, in_rows, in_cols, in_ch)
        """
        if not isinstance(dLdy, list):
            dLdy = [dLdy]

        Xs = self.X
        out_rows = self.derived_variables["out_rows"]
        out_cols = self.derived_variables["out_cols"]

        (fr, fc), s, p = self.kernel_shape, self.stride, self.pad

        dXs = []
        for X, dy, out_row, out_col in zip(Xs, dLdy, out_rows, out_cols):
            n_samp, in_rows, in_cols, nc_in = X.shape
            X_pad, (pr1, pr2, pc1, pc2) = pad2D(X, p, self.kernel_shape, s)

            dX = np.zeros_like(X_pad)
            for m in range(n_samp):
                for i in range(out_row):
                    for j in range(out_col):
                        for c in range(self.out_ch):
                            i0, i1 = i * s, (i * s) + fr
                            j0, j1 = j * s, (j * s) + fc

                            if self.mode == "max":
                                xi = X[m, i0:i1, j0:j1, c]
                                mask = np.zeros_like(xi).astype(bool)
                                x, y = np.argwhere(xi == np.max(xi))[0]
                                mask[x, y] = True
                                dX[m, i0:i1, j0:j1, c] += mask * dy[m, i, j, c]
                                
                            elif self.mode == "average":
                                frame = np.ones((fr, fc)) * dy[m, i, j, c]
                                dX[m, i0:i1, j0:j1, c] += frame / np.prod((fr, fc))

            pr2 = None if pr2 == 0 else -pr2
            pc2 = None if pc2 == 0 else -pc2
            dXs.append(dX[:, pr1:pr2, pc1:pc2, :])
            
        return dXs[0] if len(Xs) == 1 else dXs

    @property
    def hyperparams(self):
        return {
            "layer": "Pool2D",
            "acti_fn": None,
            "pad": self.pad,
            "mode": self.mode,
            "in_ch": self.in_ch,
            "out_ch": self.out_ch,
            "stride": self.stride,
            "kernel_shape": self.kernel_shape,
            "optimizer": {
                "cache": self.optimizer.cache,
                "hyperparams": self.optimizer.hyperparams,
            },
        }


############### Flatten ##################
class Flatten(LayerBase):
    
    def __init__(self, keep_dim="first", optimizer=None):
        """
        将多维输入展开

        参数说明：
        keep_dim：展开形状，str (default : 'first')
                对于输入 X，keep_dim可选 'first'->将 X 重构为(X.shape[0], -1)，
                'last'->将 X 重构为(-1, X.shape[0])，'none'->将 X 重构为(1,-1)
        optimizer：优化方法
        """
        super().__init__(optimizer)

        self.keep_dim = keep_dim
        self._init_params()

    def _init_params(self):
        self.X = []
        self.gradients = {}
        self.params = {}
        self.derived_variables = {"in_dims": []}

    def forward(self, X, retain_derived=True):
        """
        前向传播

        参数说明：
        X：输入数组
        retain_derived：是否保留中间变量，以便反向传播时再次使用，bool型
        """
        if retain_derived:
            self.derived_variables["in_dims"].append(X.shape)
        if self.keep_dim == "none":
            return X.flatten().reshape(1, -1)
        rs = (X.shape[0], -1) if self.keep_dim == "first" else (-1, X.shape[-1])
        return X.reshape(*rs)

    def backward(self, dLdy, retain_grads=True):
        """
        反向传播

        参数说明：
        dLdy：关于损失的梯度
        retain_grads：是否计算中间变量的参数梯度，bool型

        输出说明：
        dX：将对输入的梯度进行重构为原始输入的形状
        """
        if not isinstance(dLdy, list):
            dLdy = [dLdy]
        in_dims = self.derived_variables["in_dims"]
        dX = [dy.reshape(*dims) for dy, dims in zip(dLdy, in_dims)]
        return dX[0] if len(dLdy) == 1 else dX

    @property
    def hyperparams(self):
        return {
            "layer": "Flatten",
            "keep_dim": self.keep_dim,
            "optimizer": {
                "cache": self.optimizer.cache,
                "hyperparams": self.optimizer.hyperparams,
            },
        }


########### LeNet ################
class LeNet(object):
    
    def __init__(
        self,
        fc3_out=128,
        fc4_out=84,
        fc5_out=10,
        conv1_pad=0,
        conv2_pad=0,
        conv1_out_ch=6,
        conv2_out_ch=16,
        conv1_stride=1,
        pool1_stride=2,
        conv2_stride=1,
        pool2_stride=2,
        conv1_kernel_shape=(5, 5),
        pool1_kernel_shape=(2, 2),
        conv2_kernel_shape=(5, 5),
        pool2_kernel_shape=(2, 2),
        optimizer="adam",
        init_w="glorot_normal",
        loss=CrossEntropy()
    ):
        self.optimizer = optimizer
        self.init_w = init_w
        self.loss = loss
        self.fc3_out = fc3_out
        self.fc4_out = fc4_out
        self.fc5_out = fc5_out
        self.conv1_pad = conv1_pad
        self.conv2_pad = conv2_pad
        self.conv1_stride = conv1_stride
        self.conv1_out_ch = conv1_out_ch
        self.pool1_stride = pool1_stride
        self.conv2_out_ch = conv2_out_ch
        self.conv2_stride = conv2_stride
        self.pool2_stride = pool2_stride
        self.conv2_kernel_shape = conv2_kernel_shape
        self.pool2_kernel_shape = pool2_kernel_shape
        self.conv1_kernel_shape = conv1_kernel_shape
        self.pool1_kernel_shape = pool1_kernel_shape
        
        self.is_initialized = False
    
    def _set_params(self):
        """
        函数作用：模型初始化
        Conv1 -> Pool1 -> Conv2 -> Pool2 -> Flatten -> FC3 -> FC4 -> FC5 -> Softmax
        """
        self.layers = OrderedDict()
        self.layers["Conv1"] = Conv2D(
            out_ch=self.conv1_out_ch,
            kernel_shape=self.conv1_kernel_shape,
            pad=self.conv1_pad,
            stride=self.conv1_stride,
            acti_fn="sigmoid",
            optimizer=self.optimizer,
            init_w=self.init_w,
        )
        self.layers["Pool1"] = Pool2D(
            mode="max",
            optimizer=self.optimizer,
            stride=self.pool1_stride,
            kernel_shape=self.pool1_kernel_shape,
        )
        self.layers["Conv2"] = Conv2D(
            out_ch=self.conv1_out_ch,
            kernel_shape=self.conv1_kernel_shape,
            pad=self.conv1_pad,
            stride=self.conv1_stride,
            acti_fn="sigmoid",
            optimizer=self.optimizer,
            init_w=self.init_w,
        )
        self.layers["Pool2"] = Pool2D(
            mode="max",
            optimizer=self.optimizer,
            stride=self.pool2_stride,
            kernel_shape=self.pool2_kernel_shape,
        )
        self.layers["Flatten"] = Flatten(optimizer=self.optimizer)
        self.layers["FC3"] = FullyConnected(
            n_out=self.fc3_out,
            acti_fn="sigmoid",
            init_w=self.init_w,
            optimizer=self.optimizer
        )
        self.layers["FC4"] = FullyConnected(
            n_out=self.fc4_out,
            acti_fn="sigmoid",
            init_w=self.init_w,
            optimizer=self.optimizer
        )
        self.layers["FC5"] = FullyConnected(
            n_out=self.fc5_out,
            acti_fn="affine(slope=1, intercept=0)",
            init_w=self.init_w,
            optimizer=self.optimizer
        )
        self.is_initialized = True
    
    def forward(self, X_train):
        Xs = {}
        out = X_train
        for k, v in self.layers.items():
            Xs[k] = out
            out = v.forward(out)
        return out, Xs
    
    def backward(self, grad):
        dXs = {}
        out = grad
        for k, v in reversed(list(self.layers.items())):
            dXs[k] = out
            out = v.backward(out)
        return out, dXs
    
    def update(self):
        """
        函数作用：梯度更新
        """
        for k, v in reversed(list(self.layers.items())):
            v.update()
        self.flush_gradients()
    
    def flush_gradients(self, curr_loss=None):
        """
        函数作用：更新后重置梯度
        """
        for k, v in self.layers.items():
            v.flush_gradients()
    
    def fit(self, X_train, y_train, n_epochs=20, batch_size=64, verbose=False, epo_verbose=True):
        """
        参数说明：
        X_train：训练数据
        y_train：训练数据标签
        n_epochs：epoch 次数
        batch_size：每次 epoch 的 batch size
        verbose：是否每个 batch 输出损失
        epo_verbose：是否每个 epoch 输出损失
        """
        self.verbose = verbose
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        
        if not self.is_initialized:
            self.n_features = X_train.shape[1]
            self._set_params()
        
        prev_loss = np.inf
        for i in range(n_epochs):
            loss, epoch_start = 0.0, time.time()
            batch_generator, n_batch = minibatch(X_train, self.batch_size, shuffle=True)

            for j, batch_idx in enumerate(batch_generator):
                batch_len, batch_start = len(batch_idx), time.time()
                X_batch, y_batch = X_train[batch_idx], y_train[batch_idx]
                out, _ = self.forward(X_batch)
                y_pred_batch = softmax(out)
                batch_loss = self.loss(y_batch, y_pred_batch)
                grad = self.loss.grad(y_batch, y_pred_batch)
                _, _ = self.backward(grad)
                self.update()
                loss += batch_loss

                if self.verbose:
                    fstr = "\t[Batch {}/{}] Train loss: {:.3f} ({:.1f}s/batch)"
                    print(fstr.format(j + 1, n_batch, batch_loss, time.time() - batch_start))

            loss /= n_batch
            if epo_verbose:
                fstr = "[Epoch {}] Avg. loss: {:.3f}  Delta: {:.3f} ({:.2f}m/epoch)"
                print(fstr.format(i + 1, loss, prev_loss - loss, (time.time() - epoch_start) / 60.0))
            prev_loss = loss
            
    def evaluate(self, X_test, y_test, batch_size=128):
        acc = 0.0
        batch_generator, n_batch = minibatch(X_test, batch_size, shuffle=True)
        for j, batch_idx in enumerate(batch_generator):
            batch_len, batch_start = len(batch_idx), time.time()
            X_batch, y_batch = X_test[batch_idx], y_test[batch_idx]
            y_pred_batch, _ = self.forward(X_batch)
            y_pred_batch = np.argmax(y_pred_batch, axis=1)
            y_batch = np.argmax(y_batch, axis=1)
            acc += np.sum(y_pred_batch == y_batch)
        return acc / X_test.shape[0]
    
    @property
    def hyperparams(self):
        return {
            "init_w": self.init_w,
            "loss": str(self.loss),
            "optimizer": self.optimizer,
            "fc3_out": self.fc3_out, 
            "fc4_out": self.fc4_out,
            "fc5_out": self.fc5_out,
            "conv1_pad": self.conv1_pad, 
            "conv2_pad": self.conv2_pad, 
            "conv1_stride": self.conv1_stride,
            "conv1_out_ch": self.conv1_out_ch,
            "pool1_stride": self.pool1_stride,
            "conv2_out_ch": self.conv2_out_ch,
            "conv2_stride": self.conv2_stride, 
            "pool2_stride": self.pool2_stride,
            "conv2_kernel_shape": self.conv2_kernel_shape,
            "pool2_kernel_shape": self.pool2_kernel_shape,
            "conv1_kernel_shape": self.conv1_kernel_shape,
            "pool1_kernel_shape": self.pool1_kernel_shape,
            "components": {k: v.params for k, v in self.layers.items()}
        }


############# LeNet GEMM ################
class LeNet_gemm(object):
    
    def __init__(
        self,
        fc3_out=128,
        fc4_out=84,
        fc5_out=10,
        conv1_pad=0,
        conv2_pad=0,
        conv1_out_ch=6,
        conv2_out_ch=16,
        conv1_stride=1,
        pool1_stride=2,
        conv2_stride=1,
        pool2_stride=2,
        conv1_kernel_shape=(5, 5),
        pool1_kernel_shape=(2, 2),
        conv2_kernel_shape=(5, 5),
        pool2_kernel_shape=(2, 2),
        optimizer="adam",
        init_w="glorot_normal",
        loss=CrossEntropy()
    ):
        self.optimizer = optimizer
        self.init_w = init_w
        self.loss = loss
        self.fc3_out = fc3_out
        self.fc4_out = fc4_out
        self.fc5_out = fc5_out
        self.conv1_pad = conv1_pad
        self.conv2_pad = conv2_pad
        self.conv1_stride = conv1_stride
        self.conv1_out_ch = conv1_out_ch
        self.pool1_stride = pool1_stride
        self.conv2_out_ch = conv2_out_ch
        self.conv2_stride = conv2_stride
        self.pool2_stride = pool2_stride
        self.conv2_kernel_shape = conv2_kernel_shape
        self.pool2_kernel_shape = pool2_kernel_shape
        self.conv1_kernel_shape = conv1_kernel_shape
        self.pool1_kernel_shape = pool1_kernel_shape
        
        self.is_initialized = False
    
    def _set_params(self):
        """
        函数作用：模型初始化
        Conv1 -> Pool1 -> Conv2 -> Pool2 -> Flatten -> FC3 -> FC4 -> FC5 -> Softmax
        """
        self.layers = OrderedDict()
        self.layers["Conv1"] = Conv2D_gemm(
            out_ch=self.conv1_out_ch,
            kernel_shape=self.conv1_kernel_shape,
            pad=self.conv1_pad,
            stride=self.conv1_stride,
            acti_fn="sigmoid",
            optimizer=self.optimizer,
            init_w=self.init_w,
        )
        self.layers["Pool1"] = Pool2D(
            mode="max",
            optimizer=self.optimizer,
            stride=self.pool1_stride,
            kernel_shape=self.pool1_kernel_shape,
        )
        self.layers["Conv2"] = Conv2D_gemm(
            out_ch=self.conv1_out_ch,
            kernel_shape=self.conv1_kernel_shape,
            pad=self.conv1_pad,
            stride=self.conv1_stride,
            acti_fn="sigmoid",
            optimizer=self.optimizer,
            init_w=self.init_w,
        )
        self.layers["Pool2"] = Pool2D(
            mode="max",
            optimizer=self.optimizer,
            stride=self.pool2_stride,
            kernel_shape=self.pool2_kernel_shape,
        )
        self.layers["Flatten"] = Flatten(optimizer=self.optimizer)
        self.layers["FC3"] = FullyConnected(
            n_out=self.fc3_out,
            acti_fn="sigmoid",
            init_w=self.init_w,
            optimizer=self.optimizer
        )
        self.layers["FC4"] = FullyConnected(
            n_out=self.fc4_out,
            acti_fn="sigmoid",
            init_w=self.init_w,
            optimizer=self.optimizer
        )
        self.layers["FC5"] = FullyConnected(
            n_out=self.fc5_out,
            acti_fn="affine(slope=1, intercept=0)",
            init_w=self.init_w,
            optimizer=self.optimizer
        )
        self.is_initialized = True
    
    def forward(self, X_train):
        Xs = {}
        out = X_train
        for k, v in self.layers.items():
            Xs[k] = out
            out = v.forward(out)
        return out, Xs
    
    def backward(self, grad):
        dXs = {}
        out = grad
        for k, v in reversed(list(self.layers.items())):
            dXs[k] = out
            out = v.backward(out)
        return out, dXs
    
    def update(self):
        """
        函数作用：梯度更新
        """
        for k, v in reversed(list(self.layers.items())):
            v.update()
        self.flush_gradients()
    
    def flush_gradients(self, curr_loss=None):
        """
        函数作用：更新后重置梯度
        """
        for k, v in self.layers.items():
            v.flush_gradients()
    
    def fit(self, X_train, y_train, n_epochs=20, batch_size=64, verbose=False, epo_verbose=True):
        """
        参数说明：
        X_train：训练数据
        y_train：训练数据标签
        n_epochs：epoch 次数
        batch_size：每次 epoch 的 batch size
        verbose：是否每个 batch 输出损失
        epo_verbose：是否每个 epoch 输出损失
        """
        self.verbose = verbose
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        
        if not self.is_initialized:
            self.n_features = X_train.shape[1]
            self._set_params()
        
        prev_loss = np.inf
        for i in range(n_epochs):
            loss, epoch_start = 0.0, time.time()
            batch_generator, n_batch = minibatch(X_train, self.batch_size, shuffle=True)

            for j, batch_idx in enumerate(batch_generator):
                batch_len, batch_start = len(batch_idx), time.time()
                X_batch, y_batch = X_train[batch_idx], y_train[batch_idx]
                out, _ = self.forward(X_batch)
                y_pred_batch = softmax(out)
                batch_loss = self.loss(y_batch, y_pred_batch)
                grad = self.loss.grad(y_batch, y_pred_batch)
                _, _ = self.backward(grad)
                self.update()
                loss += batch_loss

                if self.verbose:
                    fstr = "\t[Batch {}/{}] Train loss: {:.3f} ({:.1f}s/batch)"
                    print(fstr.format(j + 1, n_batch, batch_loss, time.time() - batch_start))

            loss /= n_batch
            if epo_verbose:
                fstr = "[Epoch {}] Avg. loss: {:.3f}  Delta: {:.3f} ({:.2f}m/epoch)"
                print(fstr.format(i + 1, loss, prev_loss - loss, (time.time() - epoch_start) / 60.0))
            prev_loss = loss
            
    def evaluate(self, X_test, y_test, batch_size=128):
        acc = 0.0
        batch_generator, n_batch = minibatch(X_test, batch_size, shuffle=True)
        for j, batch_idx in enumerate(batch_generator):
            batch_len, batch_start = len(batch_idx), time.time()
            X_batch, y_batch = X_test[batch_idx], y_test[batch_idx]
            y_pred_batch, _ = self.forward(X_batch)
            y_pred_batch = np.argmax(y_pred_batch, axis=1)
            y_batch = np.argmax(y_batch, axis=1)
            acc += np.sum(y_pred_batch == y_batch)
        return acc / X_test.shape[0]
    
    @property
    def hyperparams(self):
        return {
            "init_w": self.init_w,
            "loss": str(self.loss),
            "optimizer": self.optimizer,
            "fc3_out": self.fc3_out, 
            "fc4_out": self.fc4_out,
            "fc5_out": self.fc5_out,
            "conv1_pad": self.conv1_pad, 
            "conv2_pad": self.conv2_pad, 
            "conv1_stride": self.conv1_stride,
            "conv1_out_ch": self.conv1_out_ch,
            "pool1_stride": self.pool1_stride,
            "conv2_out_ch": self.conv2_out_ch,
            "conv2_stride": self.conv2_stride, 
            "pool2_stride": self.pool2_stride,
            "conv2_kernel_shape": self.conv2_kernel_shape,
            "pool2_kernel_shape": self.pool2_kernel_shape,
            "conv1_kernel_shape": self.conv1_kernel_shape,
            "pool1_kernel_shape": self.pool1_kernel_shape,
            "components": {k: v.params for k, v in self.layers.items()}
        }


================================================
FILE: code/method/__init__.py
================================================
from . import optimizer
from . import activation



================================================
FILE: code/method/activation/activation.py
================================================
from abc import ABC, abstractmethod
import numpy as np
import re


class ActivationBase(ABC):
    
    def __init__(self, **kwargs):
        super().__init__()

    def __call__(self, z):
        if z.ndim == 1:
            z = z.reshape(1, -1)
        return self.forward(z)

    @abstractmethod
    def forward(self, z):
        raise NotImplementedError

    @abstractmethod
    def grad(self, x, **kwargs):
        raise NotImplementedError


class Sigmoid(ActivationBase):
    """
    Sigmoid(x) = 1 / (1 + e^(-x))
    """

    def __init__(self):
        super().__init__()

    def __str__(self):
        return "Sigmoid"

    def forward(self, z):
        return 1 / (1 + np.exp(-z))

    def grad(self, x):
        return self.forward(x) * (1 - self.forward(x))


class Tanh(ActivationBase):
    """
    Tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
    """

    def __init__(self):
        super().__init__()

    def __str__(self):
        return "Tanh"

    def forward(self, z):
        return np.tanh(z)

    def grad(self, x):
        return 1 - np.tanh(x) ** 2
    
    
class ReLU(ActivationBase):
    """
    ReLU(x) =
            x   if x > 0
            0   otherwise
    """

    def __init__(self):
        super().__init__()

    def __str__(self):
        return "ReLU"

    def forward(self, z):
        return np.clip(z, 0, np.inf)

    def grad(self, x):
        return (x > 0).astype(int)


class LeakyReLU(ActivationBase):
    """
    LeakyReLU(x) =
            alpha * x   if x < 0
            x           otherwise
    """

    def __init__(self, alpha=0.3):
        self.alpha = alpha
        super().__init__()

    def __str__(self):
        return "Leaky ReLU(alpha={})".format(self.alpha)

    def forward(self, z):
        _z = z.copy()
        _z[z < 0] = _z[z < 0] * self.alpha
        return _z

    def grad(self, x):
        out = np.ones_like(x)
        out[x < 0] *= self.alpha
        return out


class Affine(ActivationBase):
    """
    Affine(x) = slope * x + intercept
    """

    def __init__(self, slope=1, intercept=0):
        self.slope = slope
        self.intercept = intercept
        super().__init__()

    def __str__(self):
        return "Affine(slope={}, intercept={})".format(self.slope, self.intercept)

    def forward(self, z):
        return self.slope * z + self.intercept

    def grad(self, x):
        return self.slope * np.ones_like(x)


class SoftPlus(ActivationBase):
    """
    SoftPlus(x) = log(1 + e^x)
    """

    def __init__(self):
        super().__init__()

    def __str__(self):
        return "SoftPlus"

    def forward(self, z):
        return np.log(np.exp(z) + 1)

    def grad(self, x):
        return np.exp(x) / (np.exp(x) + 1)
    
    
class ELU(ActivationBase):
    """
    ELU(x) =
            x                   if x >= 0
            alpha * (e^x - 1)   otherwise
    """

    def __init__(self, alpha=1.0):
        self.alpha = alpha
        super().__init__()

    def __str__(self):
        return "ELU(alpha={})".format(self.alpha)

    def forward(self, z):
        return np.where(z > 0, z, self.alpha * (np.exp(z) - 1))

    def grad(self, x):
        return np.where(x >= 0, np.ones_like(x), self.alpha * np.exp(x))


class Exponential(ActivationBase):
    """
    Exponential(x) = e^x
    """

    def __init__(self):
        super().__init__()

    def __str__(self):
        return "Exponential"

    def forward(self, z):
        return np.exp(z)

    def grad(self, x):
        return np.exp(x)


class SELU(ActivationBase):
    """
    SELU(x) = scale * ELU(x, alpha)
            = scale * x                     if x >= 0
              scale * [alpha * (e^x - 1)]   otherwise
    """

    def __init__(self):
        self.alpha = 1.6732632423543772848170429916717
        self.scale = 1.0507009873554804934193349852946
        self.elu = ELU(alpha=self.alpha)
        super().__init__()

    def __str__(self):
        return "SELU"

    def forward(self, z):
        return self.scale * self.elu.forward(z)

    def grad(self, x):
        return np.where(
            x >= 0, np.ones_like(x) * self.scale, np.exp(x) * self.alpha * self.scale
        )


class HardSigmoid(ActivationBase):
    """
    HardSigmoid(x) =
            0               if x < -2.5
            0.2 * x + 0.5   if -2.5 <= x <= 2.5.
            1               if x > 2.5
    """

    def __init__(self):
        super().__init__()

    def __str__(self):
        return "Hard Sigmoid"

    def forward(self, z):
        return np.clip((0.2 * z) + 0.5, 0.0, 1.0)

    def grad(self, x):
        return np.where((x >= -2.5) & (x <= 2.5), 0.2, 0)

    
class ActivationInitializer(object):
    
    def __init__(self, acti_name="affine(slope=1, intercept=0)"):
        self.acti_name = acti_name

    def __call__(self):
        acti_str = self.acti_name.lower()
        if acti_str == "relu":
            acti_fn = ReLU()
        elif acti_str == "tanh":
            acti_fn = Tanh()
        elif acti_str == "sigmoid":
            acti_fn = Sigmoid()
        elif "affine" in acti_str:
            r = r"affine\(slope=(.*), intercept=(.*)\)"
            slope, intercept = re.match(r, acti_str).groups()
            acti_fn = Affine(float(slope), float(intercept))
        elif "leaky relu" in acti_str:
            r = r"leaky relu\(alpha=(.*)\)"
            alpha = re.match(r, acti_str).groups()[0]
            acti_fn = LeakyReLU(float(alpha))
        else:
            raise ValueError("Unknown activation: {}".format(acti_str))
        return acti_fn


================================================
FILE: code/method/optimizer/optimizer.py
================================================
from abc import ABC, abstractmethod
import numpy as np
import re


class OptimizerBase(ABC):
    
    def __init__(self):
        pass
        
    def __call__(self, params, params_grad, params_name):
        """
        参数说明：
        params：待更新参数， 如权重矩阵 W；
        params_grad：待更新参数的梯度；
        params_name：待更新参数名；
        """
        return self.update(params, params_grad, params_name)
    
    @abstractmethod
    def update(self, params, params_grad, params_name):
        raise NotImplementedError

        
class SGD(OptimizerBase):
    """
    sgd 优化方法
    """
    
    def __init__(self, lr=0.01):
        super().__init__()
        self.lr = lr 
        self.cache = {}
        
    def __str__(self):
        return "SGD(lr={})".format(self.hyperparams["lr"])
    
    def update(self, params, params_grad, params_name):
        update_value = self.lr * params_grad
        return params - update_value
    
    @property
    def hyperparams(self):
        return {
            "op": "SGD",
            "lr": self.lr
        }

class Momentum(OptimizerBase):
    
    def __init__(
        self, lr=0.001, momentum=0.0, **kwargs
    ):
        """
        参数说明：
        lr： 学习率，float (default: 0.001)
        momentum：考虑 Momentum 时的 alpha，决定了之前的梯度贡献衰减得有多快，取值范围[0, 1]，默认0
        """
        super().__init__()
        self.lr = lr 
        self.momentum = momentum
        self.cache = {}

    def __str__(self):
        return "Momentum(lr={}, momentum={})".format(self.lr, self.momentum)

    def update(self, param, param_grad, param_name):
        C = self.cache
        lr, momentum = self.lr, self.momentum

        if param_name not in C:  # save v
            C[param_name] = np.zeros_like(param_grad)

        update = momentum * C[param_name] - lr * param_grad
        self.cache[param_name] = update
        return param + update
    
    @property
    def hyperparams(self):
        return {
            "op": "Momentum",
            "lr": self.lr,
            "momentum": self.momentum
        }
    

class AdaGrad(OptimizerBase):

    def __init__(self, lr=0.001, eps=1e-7, **kwargs):
        """
        参数说明：
        lr： 学习率，float (default: 0.001)
        eps：delta 项，防止分母为0
        """
        super().__init__()
        self.lr = lr
        self.eps = eps
        self.cache = {}

    def __str__(self):
        return "AdaGrad(lr={}, eps={})".format(self.lr, self.eps)

    def update(self, param, param_grad, param_name):
        C = self.cache
        lr, eps = self.hyperparams["lr"], self.hyperparams["eps"]

        if param_name not in C:  # save r
            C[param_name] = np.zeros_like(param_grad)

        C[param_name] += param_grad ** 2
        update = lr * param_grad / (np.sqrt(C[param_name]) + eps)
        self.cache = C
        return param - update

    @property
    def hyperparams(self):
        return {
            "op": "AdaGrad",
            "lr": self.lr,
            "eps": self.eps
        }
    
    
class RMSProp(OptimizerBase):
    
    def __init__(
        self, lr=0.001, decay=0.9, eps=1e-7, **kwargs
    ):
        """
        参数说明：
        lr： 学习率，float (default: 0.001)
        eps：delta 项，防止分母为0
        decay：衰减速率
        """
        super().__init__()
        self.lr = lr
        self.eps = eps
        self.decay = decay
        self.cache = {}

    def __str__(self):
        return "RMSProp(lr={}, eps={}, decay={})".format(
            self.lr, self.eps, self.decay
        )

    def update(self, param, param_grad, param_name):
        C = self.cache
        lr, eps = self.hyperparams["lr"], self.hyperparams["eps"]
        decay = self.hyperparams["decay"]

        if param_name not in C:  # save r
            C[param_name] = np.zeros_like(param_grad)

        C[param_name] = decay * C[param_name] + (1 - decay) * param_grad ** 2
        update = lr * param_grad / (np.sqrt(C[param_name]) + eps)
        self.cache = C
        return param - update
    
    @property
    def hyperparams(self):
        return {
            "op": "RMSProp",
            "lr": self.lr,
            "eps": self.eps,
            "decay": self.decay
        }    
    
    
class AdaDelta(OptimizerBase):
    
    def __init__(
        self, lr=0.001, decay=0.95, eps=1e-7, **kwargs
    ):
        """
        参数说明：
        lr： 学习率，float (default: 0.001)
        eps：delta 项，防止分母为0
        decay：衰减速率
        """
        super().__init__()
        self.lr = lr
        self.eps = eps
        self.decay = decay
        self.cache = {}

    def __str__(self):
        return "AdaDelta(eps={}, decay={})".format(self.eps, self.decay)

    def update(self, param, param_grad, param_name):
        C = self.cache
        eps = self.hyperparams["eps"]
        decay = self.hyperparams["decay"]

        if param_name not in C:  # save r, delta_theta
            C[param_name] = {
                "r": np.zeros_like(param_grad),
                "d": np.zeros_like(param_grad)
            }

        C[param_name]["r"] = decay * C[param_name]["r"] + (1 - decay) * param_grad ** 2
        update = (np.sqrt(C[param_name]["d"] + eps)) * param_grad / (np.sqrt(C[param_name]["r"]) + eps)
        C[param_name]["d"] = decay * C[param_name]["d"] + (1 - decay) * update ** 2
        self.cache = C
        return param - update
    
    @property
    def hyperparams(self):
        return {
            "op": "AdaDelta",
            "eps": self.eps,
            "decay": self.decay
        }
    
    
class Adam(OptimizerBase):
    
    def __init__(
        self,
        lr=0.001,
        decay1=0.9,
        decay2=0.999,
        eps=1e-7,
        **kwargs
    ):
        """
        参数说明：
        lr： 学习率，float (default: 0.01)
        eps：delta 项，防止分母为0
        decay1：历史梯度的指数衰减速率，可以理解为考虑梯度均值 (default: 0.9)
        decay2：历史梯度平方的指数衰减速率，可以理解为考虑梯度方差 (default: 0.999)
        """
        super().__init__()
        self.lr = lr
        self.decay1 = decay1
        self.decay2 = decay2
        self.eps = eps
        self.cache = {}

    def __str__(self):
        return "Adam(lr={}, decay1={}, decay2={}, eps={})".format(
            self.lr, self.decay1, self.decay2, self.eps
        )

    def update(self, param, param_grad, param_name, cur_loss=None):
        C = self.cache
        d1, d2 = self.hyperparams["decay1"], self.hyperparams["decay2"]
        lr, eps= self.hyperparams["lr"], self.hyperparams["eps"]

        if param_name not in C:
            C[param_name] = {
                "t": 0,
                "mean": np.zeros_like(param_grad),
                "var": np.zeros_like(param_grad),
            }

        t = C[param_name]["t"] + 1
        mean = C[param_name]["mean"]
        var = C[param_name]["var"]

        C[param_name]["t"] = t
        C[param_name]["mean"] = d1 * mean + (1 - d1) * param_grad
        C[param_name]["var"] = d2 * var + (1 - d2) * param_grad ** 2
        self.cache = C

        m_hat = C[param_name]["mean"] / (1 - d1 ** t)
        v_hat = C[param_name]["var"] / (1 - d2 ** t)
        update = lr * m_hat / (np.sqrt(v_hat) + eps)
        return param - update

    @property
    def hyperparams(self):
        return {
            "op": "Adam",
            "lr": self.lr,
            "eps": self.eps,
            "decay1": self.decay1,
            "decay2": self.decay2
        }    
    
    
class OptimizerInitializer(ABC):
    
    def __init__(self, opti_name="sgd"):
        self.opti_name = opti_name
    
    def __call__(self):
        r = r"([a-zA-Z]*)=([^,)]*)"
        opti_str = self.opti_name.lower()
        kwargs = dict([(i, eval(j)) for (i, j) in re.findall(r, opti_str)])
        if "sgd" in opti_str:
            optimizer = SGD(**kwargs)
        elif "momentum" in opti_str:
            optimizer = Momentum(**kwargs)    
        elif "adagrad" in opti_str:
            optimizer = AdaGrad(**kwargs)
        elif "rmsprop" in opti_str:
            optimizer = RMSProp(**kwargs)
        elif "adadelta" in opti_str:
            optimizer = AdaDelta(**kwargs)
        elif "adam" in opti_str:
            optimizer = Adam(**kwargs)
        else:
            raise NotImplementedError("{}".format(opt_str))
        return optimizer
        


================================================
FILE: code/method/weight/weight.py
================================================
from abc import ABC, abstractmethod
import numpy as np
import re


def calc_fan(weight_shape):
    """
    对权重矩阵计算 fan-in 和 fan-out

    参数说明：   
    weight_shape：权重形状
    """
    if len(weight_shape) == 2:  
        fan_in, fan_out = weight_shape
    elif len(weight_shape) in [3, 4]:
        in_ch, out_ch = weight_shape[-2:]
        kernel_size = np.prod(weight_shape[:-2])
        fan_in, fan_out = in_ch * kernel_size, out_ch * kernel_size
    else:
        raise ValueError("Unrecognized weight dimension: {}".format(weight_shape))
    return fan_in, fan_out


class random_uniform:
    """
    初始化网络权重 W--- 基于 Uniform(-b, b)

    参数说明：
    weight_shape：权重形状
    """
    def __init__(self, b=1.0):
        self.b = b
        
    def __call__(self, weight_shape):
        return np.random.uniform(-b, b, size=weight_shape)


class random_normal:
    """
    初始化网络权重 W--- 基于 TruncatedNormal(0, std)

    参数说明：   
    weight_shape：权重形状
    std：权重标准差
    """
    def __init__(self, std=0.01):
        self.std = std
        
    def __call__(self, weight_shape):
        return truncated_normal(0, std, weight_shape)

    
# def random_uniform(weight_shape, b=1.0):
#     """
#     初始化网络权重 W--- 基于 Uniform(-b, b)

#     参数说明：
#     weight_shape：权重形状
#     """
#     return np.random.uniform(-b, b, size=weight_shape)


# def random_normal(weight_shape, std=1.0):
#     """
#     初始化网络权重 W--- 基于 TruncatedNormal(0, std)

#     参数说明：   
#     weight_shape：权重形状
#     std：权重标准差
#     """
#     return truncated_normal(0, std, weight_shape)
    

class he_uniform:
    """
    初始化网络权重 W--- 基于 Uniform(-b, b)，其中 b=sqrt(6/fan_in)，常用于 ReLU 激活层

    参数说明：
    weight_shape：权重形状
    """
    def __init__(self):
        pass
    
    def __call__(self, weight_shape):
        fan_in, fan_out = calc_fan(weight_shape)
        b = np.sqrt(6 / fan_in)
        return np.random.uniform(-b, b, size=weight_shape)
    
    
class he_normal:
    """
    初始化网络权重 W--- 基于 TruncatedNormal(0, std)，其中 std=2/fan_in，常用于 ReLU 激活层

    参数说明：   
    weight_shape：权重形状
    """
    def __init__(self):
        pass
    
    def __call__(self, weight_shape):
        fan_in, fan_out = calc_fan(weight_shape)
        std = np.sqrt(2 / fan_in)
        return truncated_normal(0, std, weight_shape)
    
    
    
# def he_uniform(weight_shape):
#     """
#     初始化网络权重 W--- 基于 Uniform(-b, b)，其中 b=sqrt(6/fan_in)，常用于 ReLU 激活层

#     参数说明：
#     weight_shape：权重形状
#     """
#     fan_in, fan_out = calc_fan(weight_shape)
#     b = np.sqrt(6 / fan_in)
#     return np.random.uniform(-b, b, size=weight_shape)
    
    
# def he_normal(weight_shape):
#     """
#     初始化网络权重 W--- 基于 TruncatedNormal(0, std)，其中 std=2/fan_in，常用于 ReLU 激活层

#     参数说明：   
#     weight_shape：权重形状
#     """
#     fan_in, fan_out = calc_fan(weight_shape)
#     std = np.sqrt(2 / fan_in)
#     return truncated_normal(0, std, weight_shape)
    

class glorot_uniform:
    """
    初始化网络权重 W--- 基于 Uniform(-b, b)，其中 b=gain*sqrt(6/(fan_in+fan_out))，
                        常用于 tanh 和 sigmoid 激活层

    参数说明：
    weight_shape：权重形状
    """
    def __init__(self, gain=1.0):
        self.gain = gain
        
    def __call__(self, weight_shape):
        fan_in, fan_out = calc_fan(weight_shape)
        b = self.gain * np.sqrt(6 / (fan_in + fan_out))
        return np.random.uniform(-b, b, size=weight_shape)
    

class glorot_normal:
    """
    初始化网络权重 W--- 基于 TruncatedNormal(0, std)，其中 std=gain^2*2/(fan_in+fan_out)，
                        常用于 tanh 和 sigmoid 激活层

    参数说明：
    weight_shape：权重形状
    """
    def __init__(self, gain=1.0):
        self.gain = gain
        
    def __call__(self, weight_shape):
        fan_in, fan_out = calc_fan(weight_shape)
        std = self.gain * np.sqrt(2 / (fan_in + fan_out))
        return truncated_normal(0, std, weight_shape)
    
    
# def glorot_uniform(weight_shape, gain=1.0):
#     """
#     初始化网络权重 W--- 基于 Uniform(-b, b)，其中 b=gain*sqrt(6/(fan_in+fan_out))，
#                         常用于 tanh 和 sigmoid 激活层

#     参数说明：
#     weight_shape：权重形状
#     """
#     fan_in, fan_out = calc_fan(weight_shape)
#     b = gain * np.sqrt(6 / (fan_in + fan_out))
#     return np.random.uniform(-b, b, size=weight_shape)
    
    
# def glorot_normal(weight_shape, gain=1.0):
#     """
#     初始化网络权重 W--- 基于 TruncatedNormal(0, std)，其中 std=gain^2*2/(fan_in+fan_out)，
#                         常用于 tanh 和 sigmoid 激活层

#     参数说明：
#     weight_shape：权重形状
#     """
#     fan_in, fan_out = calc_fan(weight_shape)
#     std = gain * np.sqrt(2 / (fan_in + fan_out))
#     return truncated_normal(0, std, weight_shape)


def truncated_normal(mean, std, out_shape):
    """
    通过拒绝采样生成截断正态分布

    参数说明：
    mean：正态分布均值
    std：正态分布标准差
    out_shape：矩阵形状
    """
    samples = np.random.normal(loc=mean, scale=std, size=out_shape)
    reject = np.logical_or(samples >= mean + 2 * std, samples <= mean - 2 * std)
    while any(reject.flatten()):
        resamples = np.random.normal(loc=mean, scale=std, size=reject.sum())
        samples[reject] = resamples
        reject = np.logical_or(samples >= mean + 2 * std, samples <= mean - 2 * std)
    return samples

    
class WeightInitializer(object):

    def __init__(self, mode="he_normal"):
        """
        mode：权重初始化策略 str型 (default: 'he_normal')
        """
        self.mode = mode
        r = r"([a-zA-Z]*)=([^,)]*)"
        mode_str = self.mode.lower()
        kwargs = dict([(i, eval(j)) for (i, j) in re.findall(r, mode_str)])
        
        if "random_uniform" in mode_str:
            self.init_fn = random_uniform(**kwargs)
        elif "random_normal" in mode_str:
            self.init_fn = random_normal(**kwargs)
        elif "he_uniform" in mode_str:
            self.init_fn = he_uniform(**kwargs)
        elif "he_normal" in mode_str:
            self.init_fn = he_normal(**kwargs)
        elif "glorot_uniform" in mode_str:
            self.init_fn = glorot_uniform(**kwargs)
        elif "glorot_normal" in mode_str:
            self.init_fn = glorot_normal(**kwargs)
        else:
            raise ValueError("Unrecognize initialization mode: {}".format(mode_str))
    
    def __call__(self, weight_shape):
        W = self.init_fn(weight_shape)
        return W


================================================
FILE: contents.txt
================================================
注：目录是基于《深度学习》的目录起的。基于本项目的内容，目录其实可以分的更细致，这里就分到目录的第三级为止。

**目录**:

- 第二章 线性代数
  - 1 标量, 向量, 矩阵, 张量
  - 2 矩阵转置
  - 3 矩阵加法
  - 4 矩阵乘法
  - 5 单位矩阵
  - 6 矩阵的逆
  - 7 范数
  - 8 特征值分解
  - 9 奇异值分解
  - 10 PCA (主成分分析)


- 第三章 概率与信息论
  - 1 概率
    - 1.1 概率与随机变量
    - 1.2 概率分布
      - 1.2.1 概率质量函数
      - 1.2.2 概率密度函数
      - 1.2.3 累积分布函数
    - 1.3 条件概率与条件独立
    - 1.4 随机变量的度量
    - 1.5 常用概率分布
      - 1.5.1 伯努利分布 (两点分布)
      - 1.5.2 范畴分布 (分类分布)
      - 1.5.3 高斯分布 (正态分布)
      - 1.5.4 多元高斯分布 (多元正态分布)
      - 1.5.5 指数分布
      - 1.5.6 拉普拉斯分布
      - 1.5.7 Dirac 分布
    - 1.6 常用函数的有用性质
      - 1.6.1 logistic sigmoid 函数
      - 1.6.2 softplus 函数
  - 2 信息论
  - 3 图模型
    - 3.1 有向图模型
      - 3.1.1 贝叶斯网的独立性
    - 3.2 无向图模型
      - 3.1.2 马尔可夫网的条件独立性


- 第四章 数值计算
  - 1 上溢和下溢
  - 2 优化方法
    - 2.1 梯度下降法
    - 2.2 牛顿法
    - 2.3 约束优化


- 第五章 机器学习基础
  - 1 学习算法
    - 1.1 举例:线性回归 
  - 2 容量、过拟合、欠拟合
    - 2.1 泛化问题
    - 2.2 容量
  - 3 超参数与验证集
  - 4 偏差和方差
    - 4.1 偏差
    - 4.2 方差
    - 4.3 误差与偏差和方差的关系
  - 5 最大似然估计
  - 6 贝叶斯统计
  - 7 最大后验估计
    - 7.1 举例:线性回归
  - 8 监督学习方法
    - 8.1 概率监督学习
    - 8.2 支持向量机
      - 8.2.1 核技巧
    - 8.3 k-近邻
    - 8.4 决策树
      - 8.4.1 特征选择
      - 8.4.2 决策树生成
      - 8.4.3 决策树正则化
  - 9 无监督学习方法
    - 9.1 主成分分析法
    - 9.2 k-均值聚类


- 第六章 深度前馈网络
  - 1 深度前馈网络
  - 2 DFN 相关设计
    - 2.1 隐藏单元
    - 2.2 输出单元
    - 2.3 代价函数
    - 2.4 架构设计
  - 3 反向传播算法
    - 3.1 单个神经元的训练
    - 3.2 多层神经网络的训练
      - 3.2.1 定义权重初始化方法
      - 3.2.2 定义激活函数
      - 3.2.3 定义优化方法
      - 3.2.4 定义网络层的框架
      - 3.2.5 定义代价函数
      - 3.2.6 定义深度前馈网络
  - 4 神经网络的万能近似定理
  - 5 实例:学习 XOR


- 第七章 深度学习中的正则化
  - 1 参数范数惩罚
    - 1.1 L2 正则化
    - 1.2 L1 正则化
    - 1.3 总结 (L2 正则化与L1 正则化的解)
    - 1.4 作为约束的范数惩罚
    - 1.5 欠约束问题
  - 2 数据增强
    - 2.1 数据集增强
    - 2.2 噪声鲁棒性
  - 3 训练方案
    - 3.1 半监督学习
    - 3.2 多任务学习
    - 3.3 提前终止
  - 4 模型表示
    - 4.1 参数绑定与共享
    - 4.2 稀疏表示
    - 4.3 Bagging 及其他集成方法
      - 4.3.1 Bagging 方法
      - 4.3.2 随机森林
      - 4.3.3 方法解决过拟合
    - 4.4 Dropout
  - 5 样本测试
  - 6 补充材料
    - 6.1 Boosting
      - 6.1.1 前向分步加法模型
      - 6.1.2 AdaBoost 算法
      - 6.1.3 Boosting Tree 算法与 GBDT 算法
      - 6.1.4 XGBoost 算法


- 第八章 深度模型中的优化
  - 1 基本优化算法
    - 1.1 梯度
      - 1.1.1 梯度下降
      - 1.1.2 随机梯度下降
    - 1.2 动量
      - 1.2.1 Momentum 算法
      - 1.2.2 NAG 算法
    - 1.3 自适应学习率
      - 1.3.1 AdaGrad 算法
      - 1.3.2 RMSProp 算法
      - 1.3.3 AdaDelta 算法
      - 1.3.4 Adam 算法
    - 1.4 二阶近似方法
      - 1.4.1 牛顿法
      - 1.4.2 拟牛顿法
  - 2 优化策略
    - 2.1 参数初始化
  - 3 批标准化
  - 4 坐标下降
  - 5 Polyak 平均
  - 6 监督预训练
  - 7 设计有助于优化的模型


- 第九章 卷积网络
  - 1 卷积运算
  - 2 池化
  - 3 深度学习框架下的卷积
    - 3.1 多个并行卷积
    - 3.2 输入值与核
    - 3.3 填充 (Padding)
    - 3.4 卷积步幅 (Stride)
  - 4 更多的卷积策略
    - 4.1 深度可分离卷积 (Depthwise Separable Convolution)
    - 4.2 分组卷积 (Group Convolution)
    - 4.3 扩张卷积 (Dilated Convolution)
  - 5 GEMM 转换
  - 6 卷积网络的训练
    - 6.1 卷积网络示意图
    - 6.2 单层卷积层/池化层
      - 6.2.1 卷积函数的导数及反向传播
      - 6.2.2 池化函数的导数及后向传播
    - 6.3 多层卷积层/池化层
    - 6.4 Flatten 层 & 全连接层
  - 7 平移等变
  - 8 代表性的卷积神经网络
    - 8.1 卷积神经网络 (LeNet)


- 第十一章 实践方法论
  - 1 实践方法论
  - 2 性能度量指标
    - 2.1 错误率与准确性
    - 2.2 查准率、查全率与 F1 值
      - 2.2.1 混淆矩阵
      - 2.2.2 查准率和查全率的定义与关联
      - 2.2.3 F1 值
    - 2.3 PR 曲线
    - 2.4 ROC 曲线与 AUC 值
      - 2.4.1 ROC 曲线
      - 2.4.2 AUC 值的计算方法
    - 2.5 覆盖
    - 2.6 指标性能的瓶颈
  - 3 默认基准模型
  - 4 确定是否收集更多数据
  - 5 选择超参数
    - 5.1 手动超参数调整
    - 5.2 自动超参数优化算法
      - 5.2.1 网格搜索 (Grid Search)
      - 5.2.2 随机搜索 (Random Search)
      - 5.2.3 基于模型的超参数优化 (Model-based Hyperparameter Optimization)


================================================
FILE: reference.txt
================================================
**参考文献**:

- 全局参考
  - https://github.com/exacity/deeplearningbook-chinese/

- 线性代数
  - https://github.com/npetrakis/YearOfML

- 概率与信息论
  - https://github.com/akhilvasvani/Probability-and-Information-Theory
  - https://docs.scipy.org/doc/scipy/reference/stats.html
  - https://www.joinquant.com/view/community/detail/be3d8bc42275ea491897ac13fbf5838f
  - https://my.oschina.net/dillan/blog/134011
  
- 数值计算

- 机器学习基础
  - 《统计学习方法》
  - 《机器学习》
  - https://github.com/paruby/ml-basics
  - https://github.com/akhilvasvani/machine-learning-basics
  - https://blog.csdn.net/ajianyingxiaoqinghan/article/details/72897399

- 深度前馈网络
  - https://peterroelants.github.io/posts/cross-entropy-logistic/
  - https://blog.csdn.net/weixin_36586536/article/details/80468426
  - https://github.com/yg19930918/deep-learning-from-scratch-master
  - https://www.cnblogs.com/34fj/p/9036369.html
  - https://github.com/peterroelants/peterroelants.github.io

- 深度学习中的正则化
  - https://zhuanlan.zhihu.com/p/35893078
  - https://www.jiqizhixin.com/articles/2017-06-23-5
  - https://www.zybuluo.com/songying/note/1400484
  - https://zhuanlan.zhihu.com/p/37120298
  - https://kevinzakka.github.io/2016/09/14/batch_normalization/
  - http://gitlinux.net/2018-10-29-xgboost/
  - https://medium.com/swlh/boosting-and-bagging-explained-with-examples-5353a36eb78d
  - http://www.ccs.neu.edu/home/vip/teach/MLcourse/4_boosting/slides/gradient_boosting.pdf
  - https://blog.csdn.net/liangjun_feng/article/details/79603705
  - https://blog.csdn.net/sinat_22594309/article/details/60957594
  - https://www.zybuluo.com/yxd/note/611571
  - http://freemind.pluskid.org/machine-learning/sparsity-and-some-basics-of-l1-regularization/#ed61992b37932e208ae114be75e42a3e6dc34cb3

- 深度模型中的优化
  - https://zhuanlan.zhihu.com/p/32626442
  - https://github.com/exacity/deeplearningbook-chinese
  - http://cthorey.github.io./backpropagation/
  - http://www.ludoart.cn/2019/02/22/Optimization-Methods/
  - https://blog.csdn.net/itplus/article/details/21897715
  
- 卷积神经网络
  - https://www.slideshare.net/kuwajima/cnnbp
  - https://github.com/exacity/simplified-deeplearning
  - https://github.com/yg19930918/deep-learning-from-scratch-master
  - https://adventuresinmachinelearning.com/keras-tutorial-cnn-11-lines/
  - https://zhangting2020.github.io/2018/05/30/Transform-Invariance/
  - https://zh.gluon.ai/chapter_convolutional-neural-networks/lenet.html
  - https://zhuanlan.zhihu.com/p/32702031
  - https://blog.csdn.net/marsjhao/article/details/73088850

- 实践方法论
  - https://github.com/masakazu-ishihata/BayesianOptimization
  - https://github.com/bjzhao143/MLwithPython
  - https://medium.com/inveterate-learner/deep-learning-book-chapter-11-c6ad1d3c3c08
  - https://www.alexejgossmann.com/auc/
  - https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-imbalanced-classification/
  - https://www.yuque.com/books/share/f4031f65-70c1-4909-ba01-c47c31398466/kqbfug
  - http://bridg.land/posts/gaussian-processes-1
  - https://zhuanlan.zhihu.com/p/76269142
 


================================================
FILE: update.txt
================================================
**更新记录**:

2020/3/：

 	1. 修改第五章决策树部分，补充 ID3 和 CART 的原理，代码实现以 CART 为主。
 	2. 第七章添加 L1 和 L2 正则化最优解的推导 (即 L1稀疏解的原理)。
 	3. 第七章添加集成学习方法的推导与代码实现，包括 Bagging (随机森林)、Boosting (Adaboost、GBDT、XGBoost)
 	4. 第八章添加牛顿法与拟牛顿法 (DFP、BFGS、L-BFGS) 的推导。
 	5. 第十一章节添加高斯过程回归 (GPR) 与贝叶斯优化的推导与代码实现。

Download .txt

gitextract_8qsunllr/

├── .gitattributes
├── LICENSE
├── README.md
├── code/
│   ├── chapter 11.py
│   ├── chapter5.py
│   ├── chapter6.py
│   ├── chapter7.py
│   ├── chapter8.py
│   ├── chapter9.py
│   └── method/
│       ├── __init__.py
│       ├── activation/
│       │   └── activation.py
│       ├── optimizer/
│       │   └── optimizer.py
│       └── weight/
│           └── weight.py
├── contents.txt
├── reference.txt
└── update.txt

Download .txt

SYMBOL INDEX (412 symbols across 9 files)

FILE: code/chapter 11.py
  function cal_conf_matrix (line 10) | def cal_conf_matrix(labels, preds):
  function cal_PRF1 (line 30) | def cal_PRF1(labels, preds):
  function cal_PRcurve (line 41) | def cal_PRcurve(labels, preds):
  function cal_ROCcurve (line 62) | def cal_ROCcurve(labels, preds):
  function timeit (line 84) | def timeit(func):
  function area_auc (line 98) | def area_auc(labels, preds):
  function naive_auc (line 113) | def naive_auc(labels, preds):
  class KernelBase (line 133) | class KernelBase(ABC):
    method __init__ (line 135) | def __init__(self):
    method _kernel (line 141) | def _kernel(self, X, Y):
    method __call__ (line 144) | def __call__(self, X, Y=None):
    method __str__ (line 147) | def __str__(self):
    method summary (line 152) | def summary(self):
  class RBFKernel (line 160) | class RBFKernel(KernelBase):
    method __init__ (line 162) | def __init__(self, sigma=None):
    method _kernel (line 170) | def _kernel(self, X, Y=None):
  class KernelInitializer (line 189) | class KernelInitializer(object):
    method __init__ (line 191) | def __init__(self, param=None):
    method __call__ (line 194) | def __call__(self):
  class GPRegression (line 205) | class GPRegression:
    method __init__ (line 209) | def __init__(self, kernel="RBFKernel", sigma=1e-10):
    method fit (line 214) | def fit(self, X, y):
    method predict (line 229) | def predict(self, X_star, conf_interval=0.95):
  class BayesianOptimization (line 252) | class BayesianOptimization:
    method __init__ (line 254) | def __init__(self):
    method acquisition_function (line 257) | def acquisition_function(self, Xsamples):
    method opt_acquisition (line 263) | def opt_acquisition(self, X, n_samples=20):
    method fit (line 273) | def fit(self, f, X, y):

FILE: code/chapter5.py
  class NaiveBayes (line 7) | class NaiveBayes():
    method __init__ (line 9) | def __init__(self):
    method fit (line 14) | def fit(self, X, y):
    method _calculate_prior (line 27) | def _calculate_prior(self, c):
    method _calculate_likelihood (line 34) | def _calculate_likelihood(self, mean, var, X):
    method _calculate_probabilities (line 44) | def _calculate_probabilities(self, X):
    method predict (line 57) | def predict(self, X):
    method score (line 61) | def score(self, X, y):
  function Sigmoid (line 68) | def Sigmoid(x):
  class LogisticRegression (line 71) | class LogisticRegression():
    method __init__ (line 73) | def __init__(self, learning_rate=.1):
    method _initialize_parameters (line 78) | def _initialize_parameters(self, X):
    method fit (line 84) | def fit(self, X, y, n_iterations=4000):
    method predict (line 93) | def predict(self, X):
    method score (line 97) | def score(self, X, y):
  function linear_kernel (line 107) | def linear_kernel(**kwargs):
  function polynomial_kernel (line 115) | def polynomial_kernel(power, coef, **kwargs):
  function rbf_kernel (line 123) | def rbf_kernel(gamma, **kwargs):
  class SupportVectorMachine (line 132) | class SupportVectorMachine():
    method __init__ (line 134) | def __init__(self, kernel=linear_kernel, power=4, gamma=None, coef=4):
    method fit (line 144) | def fit(self, X, y):
    method predict (line 192) | def predict(self, X):
    method score (line 204) | def score(self, X, y):
  class KNN (line 211) | class KNN():
    method __init__ (line 213) | def __init__(self, k=10):
    method fit (line 216) | def fit(self, X, y):
    method predict (line 222) | def predict(self, X):
    method predict (line 242) | def predict(self, X):
    method score (line 254) | def score(self, X, y):
  class DecisionNode (line 261) | class DecisionNode():
    method __init__ (line 263) | def __init__(self, feature_i=None, threshold=None,
  function divide_on_feature (line 272) | def divide_on_feature(X, feature_i, threshold):
  class DecisionTree (line 288) | class DecisionTree(object):
    method __init__ (line 290) | def __init__(self, min_samples_split=2, min_impurity=1e-7,
    method fit (line 300) | def fit(self, X, y):
    method _build_tree (line 304) | def _build_tree(self, X, y, current_depth=0):
    method predict_value (line 362) | def predict_value(self, x, tree=None):
    method predict (line 386) | def predict(self, X):
    method score (line 390) | def score(self, X, y):
    method print_tree (line 395) | def print_tree(self, tree=None, indent=" "):
  function calculate_entropy (line 412) | def calculate_entropy(y):
  function calculate_gini (line 423) | def calculate_gini(y):
  class ClassificationTree (line 433) | class ClassificationTree(DecisionTree):
    method _calculate_gini_index (line 437) | def _calculate_gini_index(self, y, y1, y2):
    method _calculate_information_gain (line 449) | def _calculate_information_gain(self, y, y1, y2):
    method _majority_vote (line 460) | def _majority_vote(self, y):
    method fit (line 473) | def fit(self, X, y):
  function calculate_mse (line 479) | def calculate_mse(y):
  function calculate_variance (line 483) | def calculate_variance(y):
  class RegressionTree (line 489) | class RegressionTree(DecisionTree):
    method _calculate_mse (line 493) | def _calculate_mse(self, y, y1, y2):
    method _calculate_variance_reduction (line 505) | def _calculate_variance_reduction(self, y, y1, y2):
    method _mean_of_y (line 517) | def _mean_of_y(self, y):
    method fit (line 524) | def fit(self, X, y):
  class PCA (line 531) | class PCA():
    method __init__ (line 533) | def __init__(self):
    method fit (line 536) | def fit(self, X, n_components):
  function distEclud (line 553) | def distEclud(x,y):
  function randomCent (line 559) | def randomCent(dataSet,k):
  class KMeans (line 570) | class KMeans():
    method __init__ (line 572) | def __init__(self):
    method fit (line 576) | def fit(self, dataSet, k):

FILE: code/chapter6.py
  function sigmoid (line 15) | def sigmoid(x):
  function softmax (line 19) | def softmax(x):
  class LayerBase (line 24) | class LayerBase(ABC):
    method __init__ (line 26) | def __init__(self, optimizer="sgd"):
    method _init_params (line 34) | def _init_params(self, **kwargs):
    method forward (line 41) | def forward(self, X, **kwargs):
    method backward (line 48) | def backward(self, out, **kwargs):
    method flush_gradients (line 54) | def flush_gradients(self):
    method update (line 65) | def update(self):
  class FullyConnected (line 74) | class FullyConnected(LayerBase):
    method __init__ (line 79) | def __init__(self, n_out, acti_fn, init_w, optimizer=None):
    method _init_params (line 96) | def _init_params(self):
    method forward (line 104) | def forward(self, X, retain_derived=True):
    method backward (line 126) | def backward(self, dLda, retain_grads=True):
    method _bwd (line 149) | def _bwd(self, dLda, X):
    method hyperparams (line 162) | def hyperparams(self):
  class ObjectiveBase (line 178) | class ObjectiveBase(ABC):
    method __init__ (line 180) | def __init__(self):
    method loss (line 184) | def loss(self, y_true, y_pred):
    method grad (line 191) | def grad(self, y_true, y_pred, **kwargs):
  class SquaredError (line 198) | class SquaredError(ObjectiveBase):
    method __init__ (line 202) | def __init__(self):
    method __call__ (line 205) | def __call__(self, y_true, y_pred):
    method __str__ (line 208) | def __str__(self):
    method loss (line 212) | def loss(y_true, y_pred):
    method grad (line 222) | def grad(y_true, y_pred, z, acti_fn):
  class CrossEntropy (line 227) | class CrossEntropy(ObjectiveBase):
    method __init__ (line 231) | def __init__(self):
    method __call__ (line 234) | def __call__(self, y_true, y_pred):
    method __str__ (line 237) | def __str__(self):
    method loss (line 241) | def loss(y_true, y_pred):
    method grad (line 253) | def grad(y_true, y_pred):
  function minibatch (line 259) | def minibatch(X, batchsize=256, shuffle=True):
  class DFN (line 277) | class DFN(object):
    method __init__ (line 279) | def __init__(
    method _set_params (line 294) | def _set_params(self):
    method forward (line 314) | def forward(self, X_train):
    method backward (line 322) | def backward(self, grad):
    method update (line 330) | def update(self):
    method flush_gradients (line 338) | def flush_gradients(self, curr_loss=None):
    method fit (line 345) | def fit(self, X_train, y_train, n_epochs=20, batch_size=64, verbose=Fa...
    method evaluate (line 389) | def evaluate(self, X_test, y_test, batch_size=128):
    method hyperparams (line 402) | def hyperparams(self):

FILE: code/chapter7.py
  class RegularizerBase (line 9) | class RegularizerBase(ABC):
    method __init__ (line 11) | def __init__(self, **kwargs):
    method loss (line 15) | def loss(self, **kwargs):
    method grad (line 19) | def grad(self, **kwargs):
  class L1Regularizer (line 22) | class L1Regularizer(RegularizerBase):
    method __init__ (line 24) | def __init__(self, lambd=0.001):
    method loss (line 28) | def loss(self, params):
    method grad (line 36) | def grad(self, params):
  class L2Regularizer (line 41) | class L2Regularizer(RegularizerBase):
    method __init__ (line 43) | def __init__(self, lambd=0.001):
    method loss (line 47) | def loss(self, params):
    method grad (line 53) | def grad(self, params):
  class RegularizerInitializer (line 58) | class RegularizerInitializer(object):
    method __init__ (line 60) | def __init__(self, regular_name="l2"):
    method __call__ (line 63) | def __call__(self):
  class Image (line 77) | class Image(object):
    method __init__ (line 79) | def __init__(self, image):
    method _set_params (line 82) | def _set_params(self, image):
    method Translation (line 88) | def Translation(self, delta_x, delta_y):
    method Resize (line 100) | def Resize(self, alpha):
    method HorMirror (line 111) | def HorMirror(self):
    method VerMirror (line 119) | def VerMirror(self):
    method Rotate (line 127) | def Rotate(self, angle):
    method operate (line 138) | def operate(self):
    method __call__ (line 153) | def __call__(self, act):
  function early_stopping (line 171) | def early_stopping(valid):
  function bootstrap_sample (line 183) | def bootstrap_sample(X, Y):
  class BaggingModel (line 188) | class BaggingModel(object):
    method __init__ (line 190) | def __init__(self, n_models):
    method fit (line 198) | def fit(self, X, Y):
    method predict (line 207) | def predict(self, X):
    method _vote (line 211) | def _vote(self, predictions):
    method evaluate (line 215) | def evaluate(self, X_test, y_test):
  class Dropout (line 224) | class Dropout(ABC):
    method __init__ (line 226) | def __init__(self, wrapped_layer, p):
    method _init_wrapper_params (line 237) | def _init_wrapper_params(self):
    method flush_gradients (line 241) | def flush_gradients(self):
    method update (line 247) | def update(self):
    method forward (line 253) | def forward(self, X, is_train=True):
    method backward (line 266) | def backward(self, dLda):
    method hyperparams (line 270) | def hyperparams(self):
  function get_random_subsets (line 287) | def get_random_subsets(X, y, n_subsets, replacements=True):
  class Bagging (line 309) | class Bagging():
    method __init__ (line 313) | def __init__(self, n_estimators=100, max_features=None, min_samples_sp...
    method fit (line 330) | def fit(self, X, y):
    method predict (line 338) | def predict(self, X):
    method score (line 352) | def score(self, X, y):
  class RandomForest (line 359) | class RandomForest():
    method __init__ (line 363) | def __init__(self, n_estimators=100, max_features=None, min_samples_sp...
    method fit (line 381) | def fit(self, X, y):
    method predict (line 401) | def predict(self, X):
    method score (line 417) | def score(self, X, y):
  class DecisionStump (line 425) | class DecisionStump():
    method __init__ (line 427) | def __init__(self):
  class Adaboost (line 433) | class Adaboost():
    method __init__ (line 437) | def __init__(self, n_estimators=5):
    method fit (line 441) | def fit(self, X, y):
    method predict (line 489) | def predict(self, X):
    method score (line 504) | def score(self, X, y):
  class Loss (line 511) | class Loss(ABC):
    method __init__ (line 513) | def __init__(self):
    method loss (line 517) | def loss(self, y_true, y_pred):
    method grad (line 521) | def grad(self, y, y_pred):
  class SquareLoss (line 524) | class SquareLoss(Loss):
    method __init__ (line 526) | def __init__(self):
    method loss (line 529) | def loss(self, y, y_pred):
    method grad (line 532) | def grad(self, y, y_pred):
    method hess (line 535) | def hess(self, y, y_pred):
  class CrossEntropyLoss (line 538) | class CrossEntropyLoss(Loss):
    method __init__ (line 540) | def __init__(self):
    method loss (line 543) | def loss(self, y, y_pred):
    method grad (line 546) | def grad(self, y, y_pred):
    method hess (line 549) | def hess(self, y, y_pred):
  function softmax (line 553) | def softmax(x):
  function line_search (line 558) | def line_search(self, y, y_pred, h_pred):
  function to_categorical (line 564) | def to_categorical(x, n_classes=None):
  class GradientBoostingDecisionTree (line 575) | class GradientBoostingDecisionTree(object):
    method __init__ (line 579) | def __init__(self, n_estimators, learning_rate=1, min_samples_split=2,
    method fit (line 594) | def fit(self, X, Y):
    method predict (line 629) | def predict(self, X):
    method score (line 643) | def score(self, X, y):
  class GradientBoostingRegressor (line 649) | class GradientBoostingRegressor(GradientBoostingDecisionTree):
    method __init__ (line 651) | def __init__(self, n_estimators=200, learning_rate=1, min_samples_spli...
  class GradientBoostingClassifier (line 662) | class GradientBoostingClassifier(GradientBoostingDecisionTree):
    method __init__ (line 664) | def __init__(self, n_estimators=200, learning_rate=1, min_samples_spli...
  class XGBoostRegressionTree (line 676) | class XGBoostRegressionTree(DecisionTree):
    method __init__ (line 680) | def __init__(self, min_samples_split=2, min_impurity=1e-7,
    method _split (line 689) | def _split(self, y):
    method _gain (line 695) | def _gain(self, y, y_pred):
    method _gain_by_taylor (line 701) | def _gain_by_taylor(self, y, y1, y2):
    method _approximate_update (line 712) | def _approximate_update(self, y):
    method fit (line 720) | def fit(self, X, y):
  class XGBoost (line 726) | class XGBoost(object):
    method __init__ (line 730) | def __init__(self, n_estimators=200, learning_rate=0.001, min_samples_...
    method fit (line 746) | def fit(self, X, Y):
    method predict (line 783) | def predict(self, X):
    method score (line 797) | def score(self, X, y):
  class XGBRegressor (line 803) | class XGBRegressor(XGBoost):
    method __init__ (line 805) | def __init__(self, n_estimators=200, learning_rate=1, min_samples_spli...
  class XGBClassifier (line 818) | class XGBClassifier(XGBoost):
    method __init__ (line 820) | def __init__(self, n_estimators=200, learning_rate=1, min_samples_spli...

FILE: code/chapter8.py
  class BatchNorm1D (line 11) | class BatchNorm1D(LayerBase):
    method __init__ (line 13) | def __init__(self, momentum=0.9, epsilon=1e-5, optimizer=None):
    method _init_params (line 36) | def _init_params(self):
    method reset_running_stats (line 54) | def reset_running_stats(self):
    method forward (line 58) | def forward(self, X, is_train=True, retain_derived=True):
    method backward (line 94) | def backward(self, dLda, retain_grads=True):
    method _bwd (line 117) | def _bwd(self, dLda, X):
    method hyperparams (line 136) | def hyperparams(self):

FILE: code/chapter9.py
  function calc_pad_dims_sameconv_2D (line 8) | def calc_pad_dims_sameconv_2D(X_shape, out_dim, kernel_shape, stride, di...
  function pad2D (line 52) | def pad2D(X, pad, kernel_shape=None, stride=None, dilation=1):
  function conv2D (line 95) | def conv2D(X, W, stride, pad, dilation=1):
  function _im2col_indices (line 145) | def _im2col_indices(X_shape, fr, fc, p, s, d=1):
  function im2col (line 177) | def im2col(X, W_shape, pad, stride, dilation=1):
  function conv2D_gemm (line 214) | def conv2D_gemm(X, W, stride=0, pad='same', dilation=1):
  class Conv2D (line 256) | class Conv2D(LayerBase):
    method __init__ (line 258) | def __init__(
    method _init_params (line 300) | def _init_params(self):
    method forward (line 310) | def forward(self, X, retain_derived=True):
    method backward (line 341) | def backward(self, dLda, retain_grads=True):
    method hyperparams (line 398) | def hyperparams(self):
  function col2im (line 417) | def col2im(X_col, X_shape, W_shape, pad, stride, dilation=0):
  class Conv2D_gemm (line 452) | class Conv2D_gemm(LayerBase):
    method __init__ (line 454) | def __init__(
    method _init_params (line 496) | def _init_params(self):
    method forward (line 506) | def forward(self, X, retain_derived=True):
    method backward (line 537) | def backward(self, dLda, retain_grads=True):
    method _bwd (line 565) | def _bwd(self, dLda, X, Y):
    method hyperparams (line 586) | def hyperparams(self):
  class Pool2D (line 605) | class Pool2D(LayerBase):
    method __init__ (line 607) | def __init__(self, kernel_shape, stride=1, pad=0, mode="max", optimize...
    method _init_params (line 629) | def _init_params(self):
    method forward (line 633) | def forward(self, X, retain_derived=True):
    method backward (line 678) | def backward(self, dLdy, retain_grads=True):
    method hyperparams (line 729) | def hyperparams(self):
  class Flatten (line 747) | class Flatten(LayerBase):
    method __init__ (line 749) | def __init__(self, keep_dim="first", optimizer=None):
    method _init_params (line 764) | def _init_params(self):
    method forward (line 770) | def forward(self, X, retain_derived=True):
    method backward (line 785) | def backward(self, dLdy, retain_grads=True):
    method hyperparams (line 803) | def hyperparams(self):
  class LeNet (line 815) | class LeNet(object):
    method __init__ (line 817) | def __init__(
    method _set_params (line 859) | def _set_params(self):
    method forward (line 916) | def forward(self, X_train):
    method backward (line 924) | def backward(self, grad):
    method update (line 932) | def update(self):
    method flush_gradients (line 940) | def flush_gradients(self, curr_loss=None):
    method fit (line 947) | def fit(self, X_train, y_train, n_epochs=20, batch_size=64, verbose=Fa...
    method evaluate (line 991) | def evaluate(self, X_test, y_test, batch_size=128):
    method hyperparams (line 1004) | def hyperparams(self):
  class LeNet_gemm (line 1029) | class LeNet_gemm(object):
    method __init__ (line 1031) | def __init__(
    method _set_params (line 1073) | def _set_params(self):
    method forward (line 1130) | def forward(self, X_train):
    method backward (line 1138) | def backward(self, grad):
    method update (line 1146) | def update(self):
    method flush_gradients (line 1154) | def flush_gradients(self, curr_loss=None):
    method fit (line 1161) | def fit(self, X_train, y_train, n_epochs=20, batch_size=64, verbose=Fa...
    method evaluate (line 1205) | def evaluate(self, X_test, y_test, batch_size=128):
    method hyperparams (line 1218) | def hyperparams(self):

FILE: code/method/activation/activation.py
  class ActivationBase (line 6) | class ActivationBase(ABC):
    method __init__ (line 8) | def __init__(self, **kwargs):
    method __call__ (line 11) | def __call__(self, z):
    method forward (line 17) | def forward(self, z):
    method grad (line 21) | def grad(self, x, **kwargs):
  class Sigmoid (line 25) | class Sigmoid(ActivationBase):
    method __init__ (line 30) | def __init__(self):
    method __str__ (line 33) | def __str__(self):
    method forward (line 36) | def forward(self, z):
    method grad (line 39) | def grad(self, x):
  class Tanh (line 43) | class Tanh(ActivationBase):
    method __init__ (line 48) | def __init__(self):
    method __str__ (line 51) | def __str__(self):
    method forward (line 54) | def forward(self, z):
    method grad (line 57) | def grad(self, x):
  class ReLU (line 61) | class ReLU(ActivationBase):
    method __init__ (line 68) | def __init__(self):
    method __str__ (line 71) | def __str__(self):
    method forward (line 74) | def forward(self, z):
    method grad (line 77) | def grad(self, x):
  class LeakyReLU (line 81) | class LeakyReLU(ActivationBase):
    method __init__ (line 88) | def __init__(self, alpha=0.3):
    method __str__ (line 92) | def __str__(self):
    method forward (line 95) | def forward(self, z):
    method grad (line 100) | def grad(self, x):
  class Affine (line 106) | class Affine(ActivationBase):
    method __init__ (line 111) | def __init__(self, slope=1, intercept=0):
    method __str__ (line 116) | def __str__(self):
    method forward (line 119) | def forward(self, z):
    method grad (line 122) | def grad(self, x):
  class SoftPlus (line 126) | class SoftPlus(ActivationBase):
    method __init__ (line 131) | def __init__(self):
    method __str__ (line 134) | def __str__(self):
    method forward (line 137) | def forward(self, z):
    method grad (line 140) | def grad(self, x):
  class ELU (line 144) | class ELU(ActivationBase):
    method __init__ (line 151) | def __init__(self, alpha=1.0):
    method __str__ (line 155) | def __str__(self):
    method forward (line 158) | def forward(self, z):
    method grad (line 161) | def grad(self, x):
  class Exponential (line 165) | class Exponential(ActivationBase):
    method __init__ (line 170) | def __init__(self):
    method __str__ (line 173) | def __str__(self):
    method forward (line 176) | def forward(self, z):
    method grad (line 179) | def grad(self, x):
  class SELU (line 183) | class SELU(ActivationBase):
    method __init__ (line 190) | def __init__(self):
    method __str__ (line 196) | def __str__(self):
    method forward (line 199) | def forward(self, z):
    method grad (line 202) | def grad(self, x):
  class HardSigmoid (line 208) | class HardSigmoid(ActivationBase):
    method __init__ (line 216) | def __init__(self):
    method __str__ (line 219) | def __str__(self):
    method forward (line 222) | def forward(self, z):
    method grad (line 225) | def grad(self, x):
  class ActivationInitializer (line 229) | class ActivationInitializer(object):
    method __init__ (line 231) | def __init__(self, acti_name="affine(slope=1, intercept=0)"):
    method __call__ (line 234) | def __call__(self):

FILE: code/method/optimizer/optimizer.py
  class OptimizerBase (line 6) | class OptimizerBase(ABC):
    method __init__ (line 8) | def __init__(self):
    method __call__ (line 11) | def __call__(self, params, params_grad, params_name):
    method update (line 21) | def update(self, params, params_grad, params_name):
  class SGD (line 25) | class SGD(OptimizerBase):
    method __init__ (line 30) | def __init__(self, lr=0.01):
    method __str__ (line 35) | def __str__(self):
    method update (line 38) | def update(self, params, params_grad, params_name):
    method hyperparams (line 43) | def hyperparams(self):
  class Momentum (line 49) | class Momentum(OptimizerBase):
    method __init__ (line 51) | def __init__(
    method __str__ (line 64) | def __str__(self):
    method update (line 67) | def update(self, param, param_grad, param_name):
    method hyperparams (line 79) | def hyperparams(self):
  class AdaGrad (line 87) | class AdaGrad(OptimizerBase):
    method __init__ (line 89) | def __init__(self, lr=0.001, eps=1e-7, **kwargs):
    method __str__ (line 100) | def __str__(self):
    method update (line 103) | def update(self, param, param_grad, param_name):
    method hyperparams (line 116) | def hyperparams(self):
  class RMSProp (line 124) | class RMSProp(OptimizerBase):
    method __init__ (line 126) | def __init__(
    method __str__ (line 141) | def __str__(self):
    method update (line 146) | def update(self, param, param_grad, param_name):
    method hyperparams (line 160) | def hyperparams(self):
  class AdaDelta (line 169) | class AdaDelta(OptimizerBase):
    method __init__ (line 171) | def __init__(
    method __str__ (line 186) | def __str__(self):
    method update (line 189) | def update(self, param, param_grad, param_name):
    method hyperparams (line 207) | def hyperparams(self):
  class Adam (line 215) | class Adam(OptimizerBase):
    method __init__ (line 217) | def __init__(
    method __str__ (line 239) | def __str__(self):
    method update (line 244) | def update(self, param, param_grad, param_name, cur_loss=None):
    method hyperparams (line 271) | def hyperparams(self):
  class OptimizerInitializer (line 281) | class OptimizerInitializer(ABC):
    method __init__ (line 283) | def __init__(self, opti_name="sgd"):
    method __call__ (line 286) | def __call__(self):

FILE: code/method/weight/weight.py
  function calc_fan (line 6) | def calc_fan(weight_shape):
  class random_uniform (line 24) | class random_uniform:
    method __init__ (line 31) | def __init__(self, b=1.0):
    method __call__ (line 34) | def __call__(self, weight_shape):
  class random_normal (line 38) | class random_normal:
    method __init__ (line 46) | def __init__(self, std=0.01):
    method __call__ (line 49) | def __call__(self, weight_shape):
  class he_uniform (line 74) | class he_uniform:
    method __init__ (line 81) | def __init__(self):
    method __call__ (line 84) | def __call__(self, weight_shape):
  class he_normal (line 90) | class he_normal:
    method __init__ (line 97) | def __init__(self):
    method __call__ (line 100) | def __call__(self, weight_shape):
  class glorot_uniform (line 131) | class glorot_uniform:
    method __init__ (line 139) | def __init__(self, gain=1.0):
    method __call__ (line 142) | def __call__(self, weight_shape):
  class glorot_normal (line 148) | class glorot_normal:
    method __init__ (line 156) | def __init__(self, gain=1.0):
    method __call__ (line 159) | def __call__(self, weight_shape):
  function truncated_normal (line 191) | def truncated_normal(mean, std, out_shape):
  class WeightInitializer (line 209) | class WeightInitializer(object):
    method __init__ (line 211) | def __init__(self, mode="he_normal"):
    method __call__ (line 235) | def __call__(self, weight_shape):

Download .json

Condensed preview — 16 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (151K chars).

[
  {
    "path": ".gitattributes",
    "chars": 31,
    "preview": "*.txt linguist-language=python\n"
  },
  {
    "path": "LICENSE",
    "chars": 1069,
    "preview": "MIT License\n\nCopyright (c) 2020 Mingchao Zhu\n\nPermission is hereby granted, free of charge, to any person obtaining a co"
  },
  {
    "path": "README.md",
    "chars": 5127,
    "preview": "# Deep Learning\n\n《**深度学习**》是深度学习领域唯一的综合性图书，全称也叫做**深度学习 AI圣经(Deep Learning)**，由三位全球知名专家IanGoodfellow、YoshuaBengio、AaronCo"
  },
  {
    "path": "code/chapter 11.py",
    "chars": 8784,
    "preview": "import pandas as pd\nimport numpy as np\nimport itertools\nimport time\nimport re\nfrom scipy.stats import norm\nimport matplo"
  },
  {
    "path": "code/chapter5.py",
    "chars": 18739,
    "preview": "import numpy as np\nimport cvxopt\nimport math\n\n\n########-----NaiveBayes------#########\nclass NaiveBayes():\n    \n    def _"
  },
  {
    "path": "code/chapter6.py",
    "chars": 11260,
    "preview": "from abc import ABC, abstractmethod\nimport numpy as np\nimport time\nimport re\nimport inspect\nfrom collections import Orde"
  },
  {
    "path": "code/chapter7.py",
    "chars": 28055,
    "preview": "from abc import ABC, abstractmethod\nimport numpy as np\nimport math\nimport re\nimport progressbar\nfrom chapter5 import Reg"
  },
  {
    "path": "code/chapter8.py",
    "chars": 4565,
    "preview": "from chapter import LayerBase\nimport numpy as np\n\n######### 优化方法(Optimizer)见 method/optimizer #######\n\n\n######## 参数初始化(P"
  },
  {
    "path": "code/chapter9.py",
    "chars": 39262,
    "preview": "from abc import ABC, abstractmethod\nimport numpy as np\nfrom chapter6 import LayerBase, CrossEntropy, FullyConnected, min"
  },
  {
    "path": "code/method/__init__.py",
    "chars": 50,
    "preview": "from . import optimizer\nfrom . import activation\n\n"
  },
  {
    "path": "code/method/activation/activation.py",
    "chars": 5546,
    "preview": "from abc import ABC, abstractmethod\nimport numpy as np\nimport re\n\n\nclass ActivationBase(ABC):\n    \n    def __init__(self"
  },
  {
    "path": "code/method/optimizer/optimizer.py",
    "chars": 8166,
    "preview": "from abc import ABC, abstractmethod\nimport numpy as np\nimport re\n\n\nclass OptimizerBase(ABC):\n    \n    def __init__(self)"
  },
  {
    "path": "code/method/weight/weight.py",
    "chars": 6221,
    "preview": "from abc import ABC, abstractmethod\nimport numpy as np\nimport re\n\n\ndef calc_fan(weight_shape):\n    \"\"\"\n    对权重矩阵计算 fan-i"
  },
  {
    "path": "contents.txt",
    "chars": 3472,
    "preview": "注：目录是基于《深度学习》的目录起的。基于本项目的内容，目录其实可以分的更细致，这里就分到目录的第三级为止。\n\n**目录**:\n\n- 第二章 线性代数\n  - 1 标量, 向量, 矩阵, 张量\n  - 2 矩阵转置\n  - 3 矩阵加法\n "
  },
  {
    "path": "reference.txt",
    "chars": 3027,
    "preview": "**参考文献**:\n\n- 全局参考\n  - https://github.com/exacity/deeplearningbook-chinese/\n\n- 线性代数\n  - https://github.com/npetrakis/Year"
  },
  {
    "path": "update.txt",
    "chars": 273,
    "preview": "**更新记录**:\n\n2020/3/：\n\n \t1. 修改第五章决策树部分，补充 ID3 和 CART 的原理，代码实现以 CART 为主。\n \t2. 第七章添加 L1 和 L2 正则化最优解的推导 (即 L1稀疏解的原理)。\n \t3. 第七"
  }
]

About this extraction

This page contains the full source code of the MingchaoZhu/DeepLearning GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 16 files (140.3 KB), approximately 44.3k tokens, and a symbol index with 412 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo