Repository: hzjken/HFT-price-prediction Branch: master Commit: ffa850903027 Files: 6 Total size: 14.1 MB Directory structure: gitextract_hkpi1sqm/ ├── README.md ├── data.csv ├── features.txt ├── lgbm.joblib ├── modelling_pipeline.py └── rf.joblib ================================================ FILE CONTENTS ================================================ ================================================ FILE: README.md ================================================ # HFT-price-prediction A project of using machine learning model (tree-based) to predict instrument price up or down in high frequency trading. ## Project Background A data science hands-on exercise of a high frequency trading company. ## Task To build a model with the given data to predict whether the trading price will go up or down in a short future. (classification problem) ## Data Explanation ### Feature Columns timestamp str, datetime string.
bid_price float, price of current bid in the market.
bid_qty float, quantity currently available at the bid price.
bid_price float, price of current ask in the market.
ask_qty float, quantity currently available at the ask price.
trade_price float, last traded price.
sum_trade_1s float, sum of quantity traded over the last second.
bid_advance_time float, seconds since bid price last advanced.
ask_advance_time float, seconds since ask price last advanced.
last_trade_time float, seconds since last trade.
### Labels _1s_side int
_3s_side int
_5s_side int
Labels indicate what is type of the first event that will happen in the next x seconds, where:
0 -- No price change.
1 -- Bid price decreased.
2 -- Ask price increased.
## Process ### Preprocessing data type conversion: **`preprocessing()`**
data check: **`check_null()`**
missing value handling: **`fill_null()`**, based on the null check and basic logic, most of the sum_trade_1s null value happens when last_trade_time larger than 1 sec (in this case sum_trade_1s should be 0). Therefore, we make an assumption that all the sum_trade_1s null value could be filled with 0. Based on such assumption, last_trade_time can be filled with last_trade_time of the previous record plus a time movement if record interval is smaller than 1 sec. ### Feature Engineering correlation filter: **`correlation_filter.filter()`**, remove columns that are highly correlated to reduce data redundancy.
logical feature engineering: **`feature_eng.basic_features()`**, build up some features based on trading logic.
time-rolling feature engineering: **`feature_eng.lag_rolling_features()`**, build up features by lagging and rolling of time-series.
### Feature Selection **`feature_selection.select()`**, Hybrid approach of genetic algorithm selection plus feature importance selection.
genetic algorithm selection: **`feature_selection.GA_features()`**
feature importance selection: **`feature_selection.rf_imp_features()`**
### Modelling Ensemble of lightGBM and random forest model.
random forest: **`model.random_forest()`**
lightGBM: **`model.lightgbm()`**
### Parameter Tuning Based on search space to decide whether using grid search or genetic search for lightGBM model's parameter tuning.
grid search: **`model.GS_tune_lgbm()`**
genetic search: **`model.GA_tune_lgbm()`**
## Performance Out-of-sample classfication accuracy is roughly 76-78%, which means its prediction of the short-term future price movement is acceptable. ================================================ FILE: data.csv ================================================ [File too large to display: 14.1 MB] ================================================ FILE: features.txt ================================================ {"keep_features": ["bid_ask_qty_diff_diff_lag_5", "up_down_rolling_std_5", "spread_diff_rolling_mean_20", "spread_diff_rolling_mean_5s", "bid_price_rolling_std_1s", "bid_advance_time_rolling_mean_1s", "ask_qty_diff_rolling_max_10s", "ask_price_diff_rolling_std_3s", "ask_qty_rolling_std_10s", "bid_ask_qty_diff_rolling_std_20", "ask_advance_time_lag_2", "bid_ask_qty_total_rolling_max_10", "bid_ask_qty_diff_rolling_sum_5", "bid_qty_rolling_min_5", "bid_ask_qty_diff_diff_rolling_sum_3s", "sum_trade_1s_rolling_std_1s", "spread_rolling_mean_1s", "trade_price_diff_rolling_sum_10", "ask_qty_diff_rolling_sum_10s", "ask_price_diff_rolling_mean_5s", "sum_trade_1s_diff_rolling_sum_20", "bid_price_lag_5", "sum_trade_1s_rolling_mean_5", "bid_ask_qty_diff_rolling_min_5", "bid_ask_qty_diff_diff_rolling_std_3s", "bid_ask_qty_total_rolling_min_5", "bid_advance_time_diff_lag_2", "trade_price_compare", "bid_ask_qty_diff_diff_rolling_mean_20", "trade_price_diff_rolling_sum_3s", "bid_ask_qty_diff_rolling_sum_1s", "bid_qty", "ask_advance_time_rolling_mean_5s", "spread_diff_rolling_std_1s", "trade_price_compare_diff_rolling_std_1s", "bid_ask_qty_diff", "ask_qty_lag_1", "ask_qty_diff_rolling_sum_1s", "trade_price_compare_diff_rolling_sum_5", "spread", "bid_qty_lag_1", "bid_ask_qty_diff_rolling_mean_10", "bid_qty_lag_2", "bid_price_lag_3", "ask_qty_rolling_min_3s", "ask_advance_time_lag_4", "spread_diff_rolling_std_3s", "bid_qty_rolling_max_20", "ask_qty_lag_3", "bid_qty_diff_lag_5", "bid_price_diff_rolling_sum_5s", "trade_price_compare_diff_lag_4", "bid_price_diff_lag_4", "bid_qty_diff_rolling_sum_1s", "bid_ask_qty_diff_diff_rolling_max_1s", "bid_advance_time_rolling_mean_3s", "ask_advance_time_diff_lag_1", "ask_qty_rolling_min_5", "spread_rolling_std_3s", "bid_advance_time_rolling_std_20", "ask_qty_diff_rolling_min_20", "sum_trade_1s_rolling_mean_10", "spread_diff_rolling_std_20", "ask_qty_rolling_mean_5", "bid_qty_rolling_min_10", "trade_price_compare_diff_lag_5", "bid_price_rolling_std_5", "trade_price_rolling_mean_10", "sum_trade_1s_diff_rolling_std_10", "bid_advance_time_diff_rolling_sum_5s", "ask_qty_lag_2", "trade_price_pos_diff_rolling_std_10s", "ask_advance_time_diff_rolling_mean_5", "ask_qty_rolling_min_10", "sum_trade_1s_diff_lag_5", "last_trade_time_diff_lag_4", "bid_qty_diff_rolling_std_5", "bid_price_diff_lag_3", "ask_advance_time_lag_3", "ask_qty_rolling_mean_20", "ask_qty_diff_rolling_mean_5", "bid_ask_qty_diff_diff_rolling_sum_10s", "bid_advance_time_rolling_mean_5s", "sum_trade_1s_lag_1", "bid_qty_rolling_min_3s", "bid_qty_rolling_max_5s", "sum_trade_1s_diff_lag_2", "bid_ask_qty_total_rolling_max_10s", "bid_qty_rolling_mean_10", "bid_advance_time_lag_1", "bid_ask_qty_diff_lag_1", "bid_ask_qty_diff_diff_rolling_min_1s", "bid_qty_diff_rolling_std_10s", "bid_price_rolling_std_5s", "ask_qty_diff_rolling_std_5s", "bid_qty_diff_rolling_max_10", "last_trade_time", "ask_qty_diff_rolling_mean_1s", "trade_price_pos_diff_rolling_mean_3s", "bid_ask_qty_total_diff_rolling_max_3s", "ask_qty_diff_rolling_sum_3s", "last_trade_time_diff_rolling_mean_5s", "bid_ask_qty_total_diff_rolling_max_10", "bid_qty_rolling_mean_5", "ask_qty", "bid_ask_qty_diff_diff_rolling_mean_5s", "bid_ask_qty_total_diff_rolling_sum_5", "bid_qty_rolling_min_20", "last_trade_time_diff_rolling_sum_5", "bid_price_rolling_mean_10s", "ask_advance_time_diff_rolling_mean_1s", "sum_trade_1s_diff"], "correlation_remove": ["ask_price"]} ================================================ FILE: modelling_pipeline.py ================================================ import pandas as pd import numpy as np import json from itertools import product from bisect import bisect_left from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import TimeSeriesSplit from genetic_selection import GeneticSelectionCV from lightgbm import LGBMClassifier from evolutionary_search import EvolutionaryAlgorithmSearchCV from sklearn.model_selection import GridSearchCV from sklearn.externals import joblib from scipy.stats import mode def preprocessing(data): '''align data type and time order''' float_list = [ 'bid_price', 'bid_qty', 'ask_price', 'ask_qty', 'trade_price', 'sum_trade_1s', 'bid_advance_time', 'ask_advance_time', 'last_trade_time', ] data['timestamp'] = pd.to_datetime(data['timestamp']) for i in float_list: data[i] = data[i].astype(float) data = data.sort_values(by='timestamp', ascending=True).reset_index(drop=True) return data def check_null(data): '''check null values in dataframe''' data = data.copy() have_null_cols = list(data.columns[data.isnull().any()]) print('Columns with null values are {}'.format(', '.join(have_null_cols))) for i in have_null_cols: print('number of rows that column {} is null: {}'.format(i, data[i].isnull().sum())) print('null percentage is {}'.format(round(data[i].isnull().sum() / data.shape[0], 2))) stat1 = data['sum_trade_1s'][data['last_trade_time'].isnull()].notnull().sum() stat2 = data['last_trade_time'][data['sum_trade_1s'].isnull()].notnull().sum() stat3 = data['sum_trade_1s'][data['last_trade_time'] >= 1].isnull().sum() stat4 = stat3 / data['sum_trade_1s'].isnull().sum() print('number of rows sum_trade_1s is not null when last_trade_time is not: {}'.format(stat1)) print('number of rows last_trade_time is null when sum_trade_1s is not: {}'.format(stat2)) print('number of rows sum_trade_1s null at last_trade_time > 1: {}, percentage: {}'.format(stat3, round(stat4, 2))) def fill_null(data): ''' based on the null check and basic logic, most of the sum_trade_1s null value happens when last_trade_time larger than 1 sec (in this case sum_trade_1s should be 0). Therefore, we make an assumption that all the sum_trade_1s null value could be filled with 0. Based on such assumption, last_trade_time can be filled with last_trade_time of the previous record plus a time movement if record interval is smaller than 1 sec. ''' class last_trade_time_filler: prev_last_trade_time = None prev_timestamp = None @classmethod def fill(cls, index): last_trade_time = data.loc[index, 'last_trade_time'] timestamp = data.loc[index, 'timestamp'] if pd.isnull(last_trade_time): time_interval = (timestamp - cls.prev_timestamp).microseconds / (1e+6) if time_interval <= 1: last_trade_time = cls.prev_last_trade_time + time_interval else: last_trade_time = np.nan cls.prev_last_trade_time = last_trade_time cls.prev_timestamp = timestamp return last_trade_time data = data.copy() data.loc[data['sum_trade_1s'].isnull(), 'sum_trade_1s'] = 0 data['last_trade_time'] = data.index.map(last_trade_time_filler.fill) print('number of null columns is: {} now'.format(len(list(data.columns[data.isnull().any()])))) return data def x_y_split(data): label_cols = ['_1s_side', '_3s_side', '_5s_side'] feature_cols = list(set(data.columns) - set(label_cols)) y = data[label_cols].copy() x = data[feature_cols].copy() return x, y class correlation_filter: remove_cols = [] @classmethod def filter(cls, x, threshold=0.99): x = x.copy() index2col = {i: col for i, col in enumerate(x.columns)} corr = np.array(x.corr()) correlated_pairs = list(zip(*np.where(np.abs(corr) >= threshold))) to_be_delete = [] for i, j in correlated_pairs: former = index2col[i] latter = index2col[j] if former != latter: add = True for i, del_set in enumerate(to_be_delete): has_intersect = ({former, latter} & del_set) != {} if has_intersect: add = False to_be_delete[i] = del_set | {former, latter} if add: to_be_delete.append({former, latter}) for i in to_be_delete: delete_set = i.copy() delete_set.pop() x = x.drop(list(delete_set), axis=1) cls.remove_cols += list(delete_set) return x class feature_eng: timestamp = None max_lag = 5 num_window = [5, 10, 20] sec_window = [1, 3, 5, 10] rolling_sum_cols = [] rolling_mean_cols = [] rolling_max_cols = [] rolling_min_cols = [] rolling_std_cols = [] @staticmethod def bid_ask_spread(data): data['spread'] = data['ask_price'] - data['bid_price'] @staticmethod def bid_ask_qty_comb(data): data['bid_ask_qty_total'] = data['ask_qty'] + data['bid_qty'] data['bid_ask_qty_diff'] = data['ask_qty'] - data['bid_qty'] @staticmethod def trade_price_feature(data): data['trade_price_compare'] = 0 # when trade price between current bid and ask price data.loc[data['trade_price'] <= data[ 'bid_price'], 'trade_price_compare'] = -1 # when trade price on current bid side data.loc[data['trade_price'] >= data[ 'ask_price'], 'trade_price_compare'] = 1 # when trade price on current sell side # whether trade price happens on bid side or ask side during the time it happens last_trade_timestamp = data['timestamp'] - pd.to_timedelta(data['last_trade_time'], unit='s') idx_list = [bisect_left(data['timestamp'], i) for i in list(last_trade_timestamp)] trade_price_pos = [] for i, index in enumerate(idx_list): index1 = index index2 = index1 + 1 if index1 < data.shape[0] - 1 else index1 bid1 = data['bid_price'][index1] bid2 = data['bid_price'][index2] ask1 = data['ask_price'][index1] ask2 = data['ask_price'][index2] trade_price = data['trade_price'][i] if (bid1 <= trade_price <= bid2) or (bid2 <= trade_price <= bid1): trade_price_pos.append(-1) # happen on bid side elif (ask1 <= trade_price <= ask2) or (ask2 <= trade_price <= ask1): trade_price_pos.append(1) # happen on sell side else: trade_price_pos.append(0) # unknown case data['trade_price_pos'] = trade_price_pos @staticmethod def diff_feature(data): for i in set(data.columns) - {'timestamp'}: new_name = '{}_diff'.format(i) data[new_name] = data[i] - data[i].shift(1) @staticmethod def up_or_down(data): data['up_down'] = 0 data.loc[data['bid_price_diff'] < 0, 'up_down'] = -1 data.loc[data['ask_price_diff'] > 0, 'up_down'] = 1 @staticmethod def lag_feature(data, col, lag): new_col_name = '{}_lag_{}'.format(col, lag) data[new_col_name] = data[col].shift(lag) @staticmethod def rolling_feature(data, col, window, feature): rolling = data[col].rolling(window=window) new_col = '{}_rolling_{}_{}'.format(col, feature, window) if feature == 'sum': data[new_col] = rolling.sum() elif feature == 'mean': data[new_col] = rolling.mean() elif feature == 'max': data[new_col] = rolling.max() elif feature == 'min': data[new_col] = rolling.min() elif feature == 'std': data[new_col] = rolling.std() elif feature == 'mode': data[new_col] = rolling.apply(lambda x: mode(x)[0]) @classmethod def basic_features(cls, data): data = data.copy() cls.timestamp = data['timestamp'] cls.bid_ask_spread(data) cls.bid_ask_qty_comb(data) cls.trade_price_feature(data) cls.diff_feature(data) cls.up_or_down(data) data = data.drop('timestamp', axis=1) return data @classmethod def lag_rolling_features(cls, data): data = data.copy() # get lag and rolling feature based on previous n records rolling_cols = set(data.columns) - {'trade_price_compare', 'trade_price_pos'} cls.rolling_sum_cols = [i for i in rolling_cols if 'diff' in i or 'up_down' in i] cls.rolling_mean_cols = rolling_cols cls.rolling_max_cols = [i for i in rolling_cols if 'bid_qty' in i or 'ask_qty' in i] cls.rolling_min_cols = [i for i in rolling_cols if 'bid_qty' in i or 'ask_qty' in i] cls.rolling_std_cols = rolling_cols for col in rolling_cols: for lag in range(1, cls.max_lag + 1): cls.lag_feature(data, col, lag) for col in rolling_cols: for num_window in cls.num_window: if col in cls.rolling_sum_cols: cls.rolling_feature(data, col, num_window, 'sum') if col in cls.rolling_mean_cols: cls.rolling_feature(data, col, num_window, 'mean') if col in cls.rolling_max_cols: cls.rolling_feature(data, col, num_window, 'max') if col in cls.rolling_min_cols: cls.rolling_feature(data, col, num_window, 'min') if col in cls.rolling_std_cols: cls.rolling_feature(data, col, num_window, 'std') # get rolling feature based on previous n seconds data.index = cls.timestamp for col in rolling_cols: for sec_window in cls.sec_window: sec_window = '{}s'.format(sec_window) if col in cls.rolling_sum_cols: cls.rolling_feature(data, col, sec_window, 'sum') if col in cls.rolling_mean_cols: cls.rolling_feature(data, col, sec_window, 'mean') if col in cls.rolling_max_cols: cls.rolling_feature(data, col, sec_window, 'max') if col in cls.rolling_min_cols: cls.rolling_feature(data, col, sec_window, 'min') if col in cls.rolling_std_cols: cls.rolling_feature(data, col, sec_window, 'std') if col in ['up_down', 'trade_price_compare', 'trade_price_pos']: cls.rolling_feature(data, col, sec_window, 'mode') return data @staticmethod def remove_na(x, y): x = x.reset_index(drop=True) x = x.dropna() y = y.loc[x.index, :].reset_index(drop=True) x = x.reset_index(drop=True) return x, y class feature_selection: '''feature selection combining feature importance ranking and GA optimization based on random forest model''' @classmethod def select(cls, x, y): rf_imp_features = cls.rf_imp_features(x, y) ga_features = cls.GA_features(x, y) features = set(rf_imp_features) | set(ga_features) return list(features) @classmethod def rf_imp_features(cls, x, y, top_perc=0.05): '''select top features based on feature importance ranking among all the features''' feature_imp = cls.rf_importance_selection(x, y) perc_threshold = np.percentile(feature_imp['avg_importance'], int((1 - top_perc) * 100)) features = list(feature_imp.loc[feature_imp['avg_importance'] >= perc_threshold, 'feature']) return features @staticmethod def rf_importance_selection(x, y, iter_time=3): feature_imp = pd.DataFrame(np.zeros((x.shape[1], iter_time + 2))) feature_imp.columns = ['feature'] + ['importance_{}'.format(i) for i in range(1, iter_time + 1)] + [ 'avg_importance'] for col in feature_imp.columns: feature_imp[col] = list(x.columns) for i in range(1, iter_time + 1): col = 'importance_{}'.format(i) rf = RandomForestClassifier(n_estimators=10, max_depth=8) rf.fit(x, y) feature_imp_dict = dict(zip(x.columns, rf.feature_importances_)) feature_imp[col] = feature_imp[col].replace(feature_imp_dict) feature_imp['avg_importance'] = feature_imp.iloc[:, 1:-1].mean(axis=1) return feature_imp @staticmethod def GA_features(x, y): rf = RandomForestClassifier(max_depth=8, n_estimators=10) selector = GeneticSelectionCV( rf, cv=TimeSeriesSplit(n_splits=4), verbose=1, scoring="accuracy", max_features=80, n_population=200, crossover_proba=0.5, mutation_proba=0.2, n_generations=100, crossover_independent_proba=0.5, mutation_independent_proba=0.05, tournament_size=3, n_gen_no_change=5, caching=True, n_jobs=-1 ) selector = selector.fit(x, y) features = x.columns[selector.support_] return features class model: lgbm_paramgrid = { 'learning_rate': np.arange(0.0005, 0.0015, 0.0001), 'n_estimators': range(800, 2000, 200), 'max_depth': [3, 4], 'colsample_bytree': np.arange(0.2, 0.5, 0.1), 'reg_alpha': [1], 'reg_lambda': [1] } @staticmethod def random_forest(x, y): rf = RandomForestClassifier(n_estimators=200, max_depth=8) rf.fit(x, y) return rf @classmethod def lightgbm(cls, x, y): keys, vals = list(zip(*cls.lgbm_paramgrid.items())) products = list(product(*vals)) param_comb = [dict(zip(keys, i)) for i in products] if len(param_comb) > 1000: best_param = cls.GA_tune_lgbm(x, y) else: best_param = cls.GS_tune_lgbm(x, y) lightgbm = LGBMClassifier(**best_param) lightgbm.fit(x, y) return lightgbm @classmethod def GA_tune_lgbm(cls, x, y): tuner = EvolutionaryAlgorithmSearchCV( estimator=LGBMClassifier(), params=cls.lgbm_paramgrid, scoring="accuracy", cv=TimeSeriesSplit(n_splits=4), verbose=1, population_size=50, gene_mutation_prob=0.2, gene_crossover_prob=0.5, tournament_size=3, generations_number=20, ) tuner.fit(x, y) return tuner.best_params_ @classmethod def GS_tune_lgbm(cls, x, y): tuner = GridSearchCV( estimator=LGBMClassifier(), param_grid=cls.lgbm_paramgrid, scoring="accuracy", cv=TimeSeriesSplit(n_splits=4), verbose=1, n_jobs=-1, ) tuner.fit(x, y) return tuner.best_params_ class feature: @staticmethod def save(features, correlation_remove): final = { 'keep_features': features, 'correlation_remove': correlation_remove } with open('features.txt', 'w') as f: f.write(json.dumps(final)) @staticmethod def load(): with open('features.txt', 'r') as f: features = f.read() features = json.loads(features) return features def train_model(data, target_label): data = data.copy() data = preprocessing(data) check_null(data) data = fill_null(data) x, y = x_y_split(data) x = feature_eng.basic_features(x) x = correlation_filter.filter(x) x = feature_eng.lag_rolling_features(x) x, y = feature_eng.remove_na(x, y) y = y[target_label] features = feature_selection.select(x, y) feature.save(features, correlation_filter.remove_cols) lightgbm = model.lightgbm(x[features], y) rf = model.random_forest(x[features], y) joblib.dump(rf, 'rf.joblib') joblib.dump(lightgbm, 'lgbm.joblib') def predict(data, target_label): '''returns both the prediction and the target_label''' features = feature.load()['keep_features'] correlation_remove = feature.load()['correlation_remove'] data = data.copy() data = preprocessing(data) data = fill_null(data) x, y = x_y_split(data) x = feature_eng.basic_features(x) x = x.drop(correlation_remove, axis=1) x = feature_eng.lag_rolling_features(x) x, y = feature_eng.remove_na(x, y) y = y[target_label] x = x[features] lgbm = joblib.load('lgbm.joblib') rf = joblib.load('rf.joblib') lgbm_predict = lgbm.predict_proba(x) rf_predict = rf.predict_proba(x) final_predict = (lgbm_predict + rf_predict) / 2 final_predict = np.argmax(final_predict, axis=1) return final_predict, y if __name__ == '__main__': data = pd.read_csv('data.csv') target_label = '_5s_side' train_model(data, target_label) pred, true_val = predict(data, target_label)