Full Code of hzjken/HFT-price-prediction for AI

master ffa850903027 cached

6 files

14.1 MB

6.4k tokens

32 symbols

1 requests

Download .txt

Repository: hzjken/HFT-price-prediction
Branch: master
Commit: ffa850903027
Files: 6
Total size: 14.1 MB

Directory structure:
gitextract_hkpi1sqm/

├── README.md
├── data.csv
├── features.txt
├── lgbm.joblib
├── modelling_pipeline.py
└── rf.joblib

================================================
FILE CONTENTS
================================================

================================================
FILE: README.md
================================================
# HFT-price-prediction
A project of using machine learning model (tree-based) to predict instrument price up or down in high frequency trading.

## Project Background
A data science hands-on exercise of a high frequency trading company. 

## Task
To build a model with the given data to predict whether the trading price will go up or down in a short future. (classification problem)

## Data Explanation
### Feature Columns
<b>timestamp</b>  str, datetime string.<br>
<b>bid_price</b>  float, price of current bid in the market.<br>
<b>bid_qty</b>  float, quantity currently available at the bid price.<br>
<b>bid_price</b>  float, price of current ask in the market.<br>
<b>ask_qty</b>  float, quantity currently available at the ask price.<br>
<b>trade_price</b>  float, last traded price.<br>
<b>sum_trade_1s</b>  float, sum of quantity traded over the last second.<br>
<b>bid_advance_time</b>  float, seconds since bid price last advanced.<br>
<b>ask_advance_time</b>  float, seconds since ask price last advanced.<br>
<b>last_trade_time</b>  float, seconds since last trade.<br>
### Labels
<b>_1s_side</b> int<br>
<b>_3s_side</b> int<br>
<b>_5s_side</b> int<br>
Labels indicate what is type of the first event that will happen in the next x seconds, where:<br>
<b>0</b> -- No price change.<br>
<b>1</b> -- Bid price decreased.<br>
<b>2</b> -- Ask price increased.<br>

## Process
### Preprocessing
<b>data type conversion</b>: **`preprocessing()`**<br>
<b>data check</b>: **`check_null()`**<br>
<b>missing value handling</b>: **`fill_null()`**,
based on the null check and basic logic, most of the sum_trade_1s null value happens when last_trade_time larger
than 1 sec (in this case sum_trade_1s should be 0). Therefore, we make an assumption that all the sum_trade_1s null
value could be filled with 0. Based on such assumption, last_trade_time can be filled with last_trade_time of the
previous record plus a time movement if record interval is smaller than 1 sec.
### Feature Engineering
<b>correlation filter</b>: **`correlation_filter.filter()`**, remove columns that are highly correlated to reduce data redundancy.<br>
<b>logical feature engineering</b>: **`feature_eng.basic_features()`**, build up some features based on trading logic.<br>
<b>time-rolling feature engineering</b>: **`feature_eng.lag_rolling_features()`**, build up features by lagging and rolling of time-series.<br>
### Feature Selection
**`feature_selection.select()`**, Hybrid approach of genetic algorithm selection plus feature importance selection.<br>
<b>genetic algorithm selection</b>: **`feature_selection.GA_features()`** <br>
<b>feature importance selection</b>: **`feature_selection.rf_imp_features()`** <br>
### Modelling
Ensemble of lightGBM and random forest model.<br>
<b>random forest</b>: **`model.random_forest()`** <br>
<b>lightGBM</b>: **`model.lightgbm()`** <br>
### Parameter Tuning
Based on search space to decide whether using grid search or genetic search for lightGBM model's parameter tuning.<br>
<b>grid search</b>: **`model.GS_tune_lgbm()`** <br>
<b>genetic search</b>: **`model.GA_tune_lgbm()`** <br>
## Performance
Out-of-sample classfication accuracy is roughly 76-78%, which means its prediction of the short-term future price movement is acceptable.


================================================
FILE: data.csv
================================================
[File too large to display: 14.1 MB]

================================================
FILE: features.txt
================================================
{"keep_features": ["bid_ask_qty_diff_diff_lag_5", "up_down_rolling_std_5", "spread_diff_rolling_mean_20", "spread_diff_rolling_mean_5s", "bid_price_rolling_std_1s", "bid_advance_time_rolling_mean_1s", "ask_qty_diff_rolling_max_10s", "ask_price_diff_rolling_std_3s", "ask_qty_rolling_std_10s", "bid_ask_qty_diff_rolling_std_20", "ask_advance_time_lag_2", "bid_ask_qty_total_rolling_max_10", "bid_ask_qty_diff_rolling_sum_5", "bid_qty_rolling_min_5", "bid_ask_qty_diff_diff_rolling_sum_3s", "sum_trade_1s_rolling_std_1s", "spread_rolling_mean_1s", "trade_price_diff_rolling_sum_10", "ask_qty_diff_rolling_sum_10s", "ask_price_diff_rolling_mean_5s", "sum_trade_1s_diff_rolling_sum_20", "bid_price_lag_5", "sum_trade_1s_rolling_mean_5", "bid_ask_qty_diff_rolling_min_5", "bid_ask_qty_diff_diff_rolling_std_3s", "bid_ask_qty_total_rolling_min_5", "bid_advance_time_diff_lag_2", "trade_price_compare", "bid_ask_qty_diff_diff_rolling_mean_20", "trade_price_diff_rolling_sum_3s", "bid_ask_qty_diff_rolling_sum_1s", "bid_qty", "ask_advance_time_rolling_mean_5s", "spread_diff_rolling_std_1s", "trade_price_compare_diff_rolling_std_1s", "bid_ask_qty_diff", "ask_qty_lag_1", "ask_qty_diff_rolling_sum_1s", "trade_price_compare_diff_rolling_sum_5", "spread", "bid_qty_lag_1", "bid_ask_qty_diff_rolling_mean_10", "bid_qty_lag_2", "bid_price_lag_3", "ask_qty_rolling_min_3s", "ask_advance_time_lag_4", "spread_diff_rolling_std_3s", "bid_qty_rolling_max_20", "ask_qty_lag_3", "bid_qty_diff_lag_5", "bid_price_diff_rolling_sum_5s", "trade_price_compare_diff_lag_4", "bid_price_diff_lag_4", "bid_qty_diff_rolling_sum_1s", "bid_ask_qty_diff_diff_rolling_max_1s", "bid_advance_time_rolling_mean_3s", "ask_advance_time_diff_lag_1", "ask_qty_rolling_min_5", "spread_rolling_std_3s", "bid_advance_time_rolling_std_20", "ask_qty_diff_rolling_min_20", "sum_trade_1s_rolling_mean_10", "spread_diff_rolling_std_20", "ask_qty_rolling_mean_5", "bid_qty_rolling_min_10", "trade_price_compare_diff_lag_5", "bid_price_rolling_std_5", "trade_price_rolling_mean_10", "sum_trade_1s_diff_rolling_std_10", "bid_advance_time_diff_rolling_sum_5s", "ask_qty_lag_2", "trade_price_pos_diff_rolling_std_10s", "ask_advance_time_diff_rolling_mean_5", "ask_qty_rolling_min_10", "sum_trade_1s_diff_lag_5", "last_trade_time_diff_lag_4", "bid_qty_diff_rolling_std_5", "bid_price_diff_lag_3", "ask_advance_time_lag_3", "ask_qty_rolling_mean_20", "ask_qty_diff_rolling_mean_5", "bid_ask_qty_diff_diff_rolling_sum_10s", "bid_advance_time_rolling_mean_5s", "sum_trade_1s_lag_1", "bid_qty_rolling_min_3s", "bid_qty_rolling_max_5s", "sum_trade_1s_diff_lag_2", "bid_ask_qty_total_rolling_max_10s", "bid_qty_rolling_mean_10", "bid_advance_time_lag_1", "bid_ask_qty_diff_lag_1", "bid_ask_qty_diff_diff_rolling_min_1s", "bid_qty_diff_rolling_std_10s", "bid_price_rolling_std_5s", "ask_qty_diff_rolling_std_5s", "bid_qty_diff_rolling_max_10", "last_trade_time", "ask_qty_diff_rolling_mean_1s", "trade_price_pos_diff_rolling_mean_3s", "bid_ask_qty_total_diff_rolling_max_3s", "ask_qty_diff_rolling_sum_3s", "last_trade_time_diff_rolling_mean_5s", "bid_ask_qty_total_diff_rolling_max_10", "bid_qty_rolling_mean_5", "ask_qty", "bid_ask_qty_diff_diff_rolling_mean_5s", "bid_ask_qty_total_diff_rolling_sum_5", "bid_qty_rolling_min_20", "last_trade_time_diff_rolling_sum_5", "bid_price_rolling_mean_10s", "ask_advance_time_diff_rolling_mean_1s", "sum_trade_1s_diff"], "correlation_remove": ["ask_price"]}

================================================
FILE: modelling_pipeline.py
================================================
import pandas as pd
import numpy as np
import json
from itertools import product
from bisect import bisect_left
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import TimeSeriesSplit
from genetic_selection import GeneticSelectionCV
from lightgbm import LGBMClassifier
from evolutionary_search import EvolutionaryAlgorithmSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.externals import joblib
from scipy.stats import mode


def preprocessing(data):
    '''align data type and time order'''
    float_list = [
        'bid_price',
        'bid_qty',
        'ask_price',
        'ask_qty',
        'trade_price',
        'sum_trade_1s',
        'bid_advance_time',
        'ask_advance_time',
        'last_trade_time',
    ]

    data['timestamp'] = pd.to_datetime(data['timestamp'])
    for i in float_list:
        data[i] = data[i].astype(float)

    data = data.sort_values(by='timestamp', ascending=True).reset_index(drop=True)
    return data


def check_null(data):
    '''check null values in dataframe'''
    data = data.copy()
    have_null_cols = list(data.columns[data.isnull().any()])
    print('Columns with null values are {}'.format(', '.join(have_null_cols)))
    for i in have_null_cols:
        print('number of rows that column {} is null: {}'.format(i, data[i].isnull().sum()))
        print('null percentage is {}'.format(round(data[i].isnull().sum() / data.shape[0], 2)))

    stat1 = data['sum_trade_1s'][data['last_trade_time'].isnull()].notnull().sum()
    stat2 = data['last_trade_time'][data['sum_trade_1s'].isnull()].notnull().sum()
    stat3 = data['sum_trade_1s'][data['last_trade_time'] >= 1].isnull().sum()
    stat4 = stat3 / data['sum_trade_1s'].isnull().sum()
    print('number of rows sum_trade_1s is not null when last_trade_time is not: {}'.format(stat1))
    print('number of rows last_trade_time is null when sum_trade_1s is not: {}'.format(stat2))
    print('number of rows sum_trade_1s null at last_trade_time > 1: {}, percentage: {}'.format(stat3, round(stat4, 2)))


def fill_null(data):
    '''
    based on the null check and basic logic, most of the sum_trade_1s null value happens when last_trade_time larger
    than 1 sec (in this case sum_trade_1s should be 0). Therefore, we make an assumption that all the sum_trade_1s null
    value could be filled with 0. Based on such assumption, last_trade_time can be filled with last_trade_time of the
    previous record plus a time movement if record interval is smaller than 1 sec.
    '''

    class last_trade_time_filler:
        prev_last_trade_time = None
        prev_timestamp = None

        @classmethod
        def fill(cls, index):
            last_trade_time = data.loc[index, 'last_trade_time']
            timestamp = data.loc[index, 'timestamp']

            if pd.isnull(last_trade_time):
                time_interval = (timestamp - cls.prev_timestamp).microseconds / (1e+6)
                if time_interval <= 1:
                    last_trade_time = cls.prev_last_trade_time + time_interval
                else:
                    last_trade_time = np.nan

            cls.prev_last_trade_time = last_trade_time
            cls.prev_timestamp = timestamp

            return last_trade_time

    data = data.copy()
    data.loc[data['sum_trade_1s'].isnull(), 'sum_trade_1s'] = 0
    data['last_trade_time'] = data.index.map(last_trade_time_filler.fill)
    print('number of null columns is: {} now'.format(len(list(data.columns[data.isnull().any()]))))

    return data


def x_y_split(data):
    label_cols = ['_1s_side', '_3s_side', '_5s_side']
    feature_cols = list(set(data.columns) - set(label_cols))
    y = data[label_cols].copy()
    x = data[feature_cols].copy()

    return x, y


class correlation_filter:
    remove_cols = []

    @classmethod
    def filter(cls, x, threshold=0.99):
        x = x.copy()
        index2col = {i: col for i, col in enumerate(x.columns)}
        corr = np.array(x.corr())
        correlated_pairs = list(zip(*np.where(np.abs(corr) >= threshold)))
        to_be_delete = []
        for i, j in correlated_pairs:
            former = index2col[i]
            latter = index2col[j]
            if former != latter:
                add = True
                for i, del_set in enumerate(to_be_delete):
                    has_intersect = ({former, latter} & del_set) != {}
                    if has_intersect:
                        add = False
                        to_be_delete[i] = del_set | {former, latter}
                if add:
                    to_be_delete.append({former, latter})

        for i in to_be_delete:
            delete_set = i.copy()
            delete_set.pop()
            x = x.drop(list(delete_set), axis=1)
            cls.remove_cols += list(delete_set)

        return x


class feature_eng:
    timestamp = None
    max_lag = 5
    num_window = [5, 10, 20]
    sec_window = [1, 3, 5, 10]
    rolling_sum_cols = []
    rolling_mean_cols = []
    rolling_max_cols = []
    rolling_min_cols = []
    rolling_std_cols = []

    @staticmethod
    def bid_ask_spread(data):
        data['spread'] = data['ask_price'] - data['bid_price']

    @staticmethod
    def bid_ask_qty_comb(data):
        data['bid_ask_qty_total'] = data['ask_qty'] + data['bid_qty']
        data['bid_ask_qty_diff'] = data['ask_qty'] - data['bid_qty']

    @staticmethod
    def trade_price_feature(data):
        data['trade_price_compare'] = 0  # when trade price between current bid and ask price
        data.loc[data['trade_price'] <= data[
            'bid_price'], 'trade_price_compare'] = -1  # when trade price on current bid side
        data.loc[data['trade_price'] >= data[
            'ask_price'], 'trade_price_compare'] = 1  # when trade price on current sell side

        # whether trade price happens on bid side or ask side during the time it happens
        last_trade_timestamp = data['timestamp'] - pd.to_timedelta(data['last_trade_time'], unit='s')
        idx_list = [bisect_left(data['timestamp'], i) for i in list(last_trade_timestamp)]
        trade_price_pos = []
        for i, index in enumerate(idx_list):
            index1 = index
            index2 = index1 + 1 if index1 < data.shape[0] - 1 else index1
            bid1 = data['bid_price'][index1]
            bid2 = data['bid_price'][index2]
            ask1 = data['ask_price'][index1]
            ask2 = data['ask_price'][index2]
            trade_price = data['trade_price'][i]
            if (bid1 <= trade_price <= bid2) or (bid2 <= trade_price <= bid1):
                trade_price_pos.append(-1)  # happen on bid side
            elif (ask1 <= trade_price <= ask2) or (ask2 <= trade_price <= ask1):
                trade_price_pos.append(1)  # happen on sell side
            else:
                trade_price_pos.append(0)  # unknown case
        data['trade_price_pos'] = trade_price_pos

    @staticmethod
    def diff_feature(data):
        for i in set(data.columns) - {'timestamp'}:
            new_name = '{}_diff'.format(i)
            data[new_name] = data[i] - data[i].shift(1)

    @staticmethod
    def up_or_down(data):
        data['up_down'] = 0
        data.loc[data['bid_price_diff'] < 0, 'up_down'] = -1
        data.loc[data['ask_price_diff'] > 0, 'up_down'] = 1

    @staticmethod
    def lag_feature(data, col, lag):
        new_col_name = '{}_lag_{}'.format(col, lag)
        data[new_col_name] = data[col].shift(lag)

    @staticmethod
    def rolling_feature(data, col, window, feature):
        rolling = data[col].rolling(window=window)
        new_col = '{}_rolling_{}_{}'.format(col, feature, window)

        if feature == 'sum':
            data[new_col] = rolling.sum()
        elif feature == 'mean':
            data[new_col] = rolling.mean()
        elif feature == 'max':
            data[new_col] = rolling.max()
        elif feature == 'min':
            data[new_col] = rolling.min()
        elif feature == 'std':
            data[new_col] = rolling.std()
        elif feature == 'mode':
            data[new_col] = rolling.apply(lambda x: mode(x)[0])

    @classmethod
    def basic_features(cls, data):
        data = data.copy()
        cls.timestamp = data['timestamp']

        cls.bid_ask_spread(data)
        cls.bid_ask_qty_comb(data)
        cls.trade_price_feature(data)
        cls.diff_feature(data)
        cls.up_or_down(data)

        data = data.drop('timestamp', axis=1)
        return data

    @classmethod
    def lag_rolling_features(cls, data):
        data = data.copy()

        # get lag and rolling feature based on previous n records
        rolling_cols = set(data.columns) - {'trade_price_compare', 'trade_price_pos'}
        cls.rolling_sum_cols = [i for i in rolling_cols if 'diff' in i or 'up_down' in i]
        cls.rolling_mean_cols = rolling_cols
        cls.rolling_max_cols = [i for i in rolling_cols if 'bid_qty' in i or 'ask_qty' in i]
        cls.rolling_min_cols = [i for i in rolling_cols if 'bid_qty' in i or 'ask_qty' in i]
        cls.rolling_std_cols = rolling_cols

        for col in rolling_cols:
            for lag in range(1, cls.max_lag + 1):
                cls.lag_feature(data, col, lag)

        for col in rolling_cols:
            for num_window in cls.num_window:
                if col in cls.rolling_sum_cols:
                    cls.rolling_feature(data, col, num_window, 'sum')
                if col in cls.rolling_mean_cols:
                    cls.rolling_feature(data, col, num_window, 'mean')
                if col in cls.rolling_max_cols:
                    cls.rolling_feature(data, col, num_window, 'max')
                if col in cls.rolling_min_cols:
                    cls.rolling_feature(data, col, num_window, 'min')
                if col in cls.rolling_std_cols:
                    cls.rolling_feature(data, col, num_window, 'std')

        # get rolling feature based on previous n seconds
        data.index = cls.timestamp
        for col in rolling_cols:
            for sec_window in cls.sec_window:
                sec_window = '{}s'.format(sec_window)
                if col in cls.rolling_sum_cols:
                    cls.rolling_feature(data, col, sec_window, 'sum')
                if col in cls.rolling_mean_cols:
                    cls.rolling_feature(data, col, sec_window, 'mean')
                if col in cls.rolling_max_cols:
                    cls.rolling_feature(data, col, sec_window, 'max')
                if col in cls.rolling_min_cols:
                    cls.rolling_feature(data, col, sec_window, 'min')
                if col in cls.rolling_std_cols:
                    cls.rolling_feature(data, col, sec_window, 'std')
                if col in ['up_down', 'trade_price_compare', 'trade_price_pos']:
                    cls.rolling_feature(data, col, sec_window, 'mode')

        return data

    @staticmethod
    def remove_na(x, y):
        x = x.reset_index(drop=True)
        x = x.dropna()
        y = y.loc[x.index, :].reset_index(drop=True)
        x = x.reset_index(drop=True)
        return x, y


class feature_selection:
    '''feature selection combining feature importance ranking and GA optimization based on random forest model'''

    @classmethod
    def select(cls, x, y):
        rf_imp_features = cls.rf_imp_features(x, y)
        ga_features = cls.GA_features(x, y)
        features = set(rf_imp_features) | set(ga_features)

        return list(features)

    @classmethod
    def rf_imp_features(cls, x, y, top_perc=0.05):
        '''select top features based on feature importance ranking among all the features'''
        feature_imp = cls.rf_importance_selection(x, y)
        perc_threshold = np.percentile(feature_imp['avg_importance'], int((1 - top_perc) * 100))
        features = list(feature_imp.loc[feature_imp['avg_importance'] >= perc_threshold, 'feature'])

        return features

    @staticmethod
    def rf_importance_selection(x, y, iter_time=3):
        feature_imp = pd.DataFrame(np.zeros((x.shape[1], iter_time + 2)))
        feature_imp.columns = ['feature'] + ['importance_{}'.format(i) for i in range(1, iter_time + 1)] + [
            'avg_importance']
        for col in feature_imp.columns:
            feature_imp[col] = list(x.columns)

        for i in range(1, iter_time + 1):
            col = 'importance_{}'.format(i)
            rf = RandomForestClassifier(n_estimators=10, max_depth=8)
            rf.fit(x, y)
            feature_imp_dict = dict(zip(x.columns, rf.feature_importances_))
            feature_imp[col] = feature_imp[col].replace(feature_imp_dict)

        feature_imp['avg_importance'] = feature_imp.iloc[:, 1:-1].mean(axis=1)
        return feature_imp

    @staticmethod
    def GA_features(x, y):
        rf = RandomForestClassifier(max_depth=8, n_estimators=10)
        selector = GeneticSelectionCV(
            rf,
            cv=TimeSeriesSplit(n_splits=4),
            verbose=1,
            scoring="accuracy",
            max_features=80,
            n_population=200,
            crossover_proba=0.5,
            mutation_proba=0.2,
            n_generations=100,
            crossover_independent_proba=0.5,
            mutation_independent_proba=0.05,
            tournament_size=3,
            n_gen_no_change=5,
            caching=True,
            n_jobs=-1
        )
        selector = selector.fit(x, y)
        features = x.columns[selector.support_]

        return features


class model:
    lgbm_paramgrid = {
        'learning_rate': np.arange(0.0005, 0.0015, 0.0001),
        'n_estimators': range(800, 2000, 200),
        'max_depth': [3, 4],
        'colsample_bytree': np.arange(0.2, 0.5, 0.1),
        'reg_alpha': [1],
        'reg_lambda': [1]
    }

    @staticmethod
    def random_forest(x, y):
        rf = RandomForestClassifier(n_estimators=200, max_depth=8)
        rf.fit(x, y)
        return rf

    @classmethod
    def lightgbm(cls, x, y):
        keys, vals = list(zip(*cls.lgbm_paramgrid.items()))
        products = list(product(*vals))
        param_comb = [dict(zip(keys, i)) for i in products]

        if len(param_comb) > 1000:
            best_param = cls.GA_tune_lgbm(x, y)
        else:
            best_param = cls.GS_tune_lgbm(x, y)

        lightgbm = LGBMClassifier(**best_param)
        lightgbm.fit(x, y)

        return lightgbm

    @classmethod
    def GA_tune_lgbm(cls, x, y):
        tuner = EvolutionaryAlgorithmSearchCV(
            estimator=LGBMClassifier(),
            params=cls.lgbm_paramgrid,
            scoring="accuracy",
            cv=TimeSeriesSplit(n_splits=4),
            verbose=1,
            population_size=50,
            gene_mutation_prob=0.2,
            gene_crossover_prob=0.5,
            tournament_size=3,
            generations_number=20,
        )
        tuner.fit(x, y)
        return tuner.best_params_

    @classmethod
    def GS_tune_lgbm(cls, x, y):
        tuner = GridSearchCV(
            estimator=LGBMClassifier(),
            param_grid=cls.lgbm_paramgrid,
            scoring="accuracy",
            cv=TimeSeriesSplit(n_splits=4),
            verbose=1,
            n_jobs=-1,
        )
        tuner.fit(x, y)
        return tuner.best_params_


class feature:
    @staticmethod
    def save(features, correlation_remove):
        final = {
            'keep_features': features,
            'correlation_remove': correlation_remove
        }

        with open('features.txt', 'w') as f:
            f.write(json.dumps(final))

    @staticmethod
    def load():
        with open('features.txt', 'r') as f:
            features = f.read()
            features = json.loads(features)

        return features


def train_model(data, target_label):
    data = data.copy()
    data = preprocessing(data)
    check_null(data)
    data = fill_null(data)
    x, y = x_y_split(data)
    x = feature_eng.basic_features(x)
    x = correlation_filter.filter(x)
    x = feature_eng.lag_rolling_features(x)
    x, y = feature_eng.remove_na(x, y)
    y = y[target_label]
    features = feature_selection.select(x, y)
    feature.save(features, correlation_filter.remove_cols)
    lightgbm = model.lightgbm(x[features], y)
    rf = model.random_forest(x[features], y)
    joblib.dump(rf, 'rf.joblib')
    joblib.dump(lightgbm, 'lgbm.joblib')


def predict(data, target_label):
    '''returns both the prediction and the target_label'''
    features = feature.load()['keep_features']
    correlation_remove = feature.load()['correlation_remove']
    data = data.copy()
    data = preprocessing(data)
    data = fill_null(data)
    x, y = x_y_split(data)
    x = feature_eng.basic_features(x)
    x = x.drop(correlation_remove, axis=1)
    x = feature_eng.lag_rolling_features(x)
    x, y = feature_eng.remove_na(x, y)
    y = y[target_label]
    x = x[features]
    lgbm = joblib.load('lgbm.joblib')
    rf = joblib.load('rf.joblib')
    lgbm_predict = lgbm.predict_proba(x)
    rf_predict = rf.predict_proba(x)
    final_predict = (lgbm_predict + rf_predict) / 2
    final_predict = np.argmax(final_predict, axis=1)

    return final_predict, y


if __name__ == '__main__':
    data = pd.read_csv('data.csv')
    target_label = '_5s_side'
    train_model(data, target_label)
    pred, true_val = predict(data, target_label)

Download .txt

gitextract_hkpi1sqm/

├── README.md
├── data.csv
├── features.txt
├── lgbm.joblib
├── modelling_pipeline.py
└── rf.joblib

Download .txt

SYMBOL INDEX (32 symbols across 1 files)

FILE: modelling_pipeline.py
  function preprocessing (line 16) | def preprocessing(data):
  function check_null (line 38) | def check_null(data):
  function fill_null (line 56) | def fill_null(data):
  function x_y_split (line 93) | def x_y_split(data):
  class correlation_filter (line 102) | class correlation_filter:
    method filter (line 106) | def filter(cls, x, threshold=0.99):
  class feature_eng (line 134) | class feature_eng:
    method bid_ask_spread (line 146) | def bid_ask_spread(data):
    method bid_ask_qty_comb (line 150) | def bid_ask_qty_comb(data):
    method trade_price_feature (line 155) | def trade_price_feature(data):
    method diff_feature (line 183) | def diff_feature(data):
    method up_or_down (line 189) | def up_or_down(data):
    method lag_feature (line 195) | def lag_feature(data, col, lag):
    method rolling_feature (line 200) | def rolling_feature(data, col, window, feature):
    method basic_features (line 218) | def basic_features(cls, data):
    method lag_rolling_features (line 232) | def lag_rolling_features(cls, data):
    method remove_na (line 281) | def remove_na(x, y):
  class feature_selection (line 289) | class feature_selection:
    method select (line 293) | def select(cls, x, y):
    method rf_imp_features (line 301) | def rf_imp_features(cls, x, y, top_perc=0.05):
    method rf_importance_selection (line 310) | def rf_importance_selection(x, y, iter_time=3):
    method GA_features (line 328) | def GA_features(x, y):
  class model (line 353) | class model:
    method random_forest (line 364) | def random_forest(x, y):
    method lightgbm (line 370) | def lightgbm(cls, x, y):
    method GA_tune_lgbm (line 386) | def GA_tune_lgbm(cls, x, y):
    method GS_tune_lgbm (line 403) | def GS_tune_lgbm(cls, x, y):
  class feature (line 416) | class feature:
    method save (line 418) | def save(features, correlation_remove):
    method load (line 428) | def load():
  function train_model (line 436) | def train_model(data, target_label):
  function predict (line 455) | def predict(data, target_label):

Download .json

Condensed preview — 6 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (26K chars).

[
  {
    "path": "README.md",
    "chars": 3268,
    "preview": "# HFT-price-prediction\nA project of using machine learning model (tree-based) to predict instrument price up or down in "
  },
  {
    "path": "features.txt",
    "chars": 3439,
    "preview": "{\"keep_features\": [\"bid_ask_qty_diff_diff_lag_5\", \"up_down_rolling_std_5\", \"spread_diff_rolling_mean_20\", \"spread_diff_r"
  },
  {
    "path": "modelling_pipeline.py",
    "chars": 17755,
    "preview": "import pandas as pd\r\nimport numpy as np\r\nimport json\r\nfrom itertools import product\r\nfrom bisect import bisect_left\r\nfro"
  }
]

// ... and 3 more files (download for full content)

About this extraction

This page contains the full source code of the hzjken/HFT-price-prediction GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 6 files (14.1 MB), approximately 6.4k tokens, and a symbol index with 32 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo