Machine Learning#

The machine learning class#

This class does most of the heavy lifting for machine learning applications in the model.

class machine_learning.MachineLearning(config: dict, estimator: RandomForestClassifier | RandomForestRegressor)[source]#

Bases: object

fit_predict(X_train: ndarray | DataFrame, y_train: ndarray, X_test: ndarray | DataFrame, out_dir: str, run_nr: int, tune_hyperparameters=False, n_jobs=2, verbose=0) → Tuple[ndarray, ndarray, ndarray][source]#

Fits classifier based on training-data and makes predictions. The fitted classifier is dumped to file with pickle to be used again during projections. Makes prediction with test-data including probabilities of those predictions. If specified, hyperparameters of classifier are tuned with GridSearchCV.

Parameters:

X_train (np.ndarray, pd.DataFrame) – training-data of variable values.
y_train (np.ndarray) – training-data of conflict data.
X_test (np.ndarray, pd.DataFrame) – test-data of variable values.
out_dir (str) – path to output folder.
run_nr (int) – number of fit/predict repetition and created classifier.
tune_hyperparameters (bool, optional) – whether to tune hyperparameters. Defaults to False.
n_jobs (int, optional) – Number of cores to be used. Defaults to 2.
verbose (int, optional) – Verbosity level. Defaults to 0.

Returns:

array with the predictions made. np.ndarray: array with probabilities of the predictions made. np.ndarray: dataframe containing permutation importances of variables.

Return type:

np.ndarray

split_scale_train_test_split(X: ndarray | DataFrame, Y: ndarray)[source]#

Splits and transforms the X-array (or sample data) and Y-array (or target data) in test-data and training-data. The fraction of data used to split the data is specified in the configuration file. Additionally, the unique identifier and geometry of each data point in both test-data and training-data is retrieved in separate arrays.

Parameters:

X (array) – array containing the variable values plus unique identifer and geometry information.
Y (array) – array containing merely the binary conflict classifier data.

Returns:

arrays containing training-set and test-set for X-data and Y-data as well as IDs and geometry.

Return type:

arrays

Other functions#

Functions for machine learning applications in the model.

machine_learning.apply_gridsearchCV(estimator: RandomForestClassifier | RandomForestRegressor, X_train: ndarray, y_train: ndarray, n_jobs=2, verbose=0) → RandomForestClassifier | RandomForestRegressor[source]#

Applies grid search to find the best hyperparameters for the RandomForestClassifier.

Parameters:

estimator (Union[RandomForestClassifier, RandomForestRegressor]) – Estimator to be used in the grid search.
X_train (np.ndarray) – Feature matrix.
y_train (np.ndarray) – Target vector.
n_jobs (int, optional) – Number of cores to be used. Defaults to 2.
verbose (int, optional) – Verbosity level. Defaults to 0.

Returns:

Best estimator of the grid search.

Return type:

Union[ensemble.RandomForestClassifier, ensemble.RandomForestRegressor]

machine_learning.define_scaling(config: dict) → MinMaxScaler | StandardScaler | RobustScaler | QuantileTransformer[source]#

Defines scaling method based on model configurations.

Parameters:: config (dict) – Parsed configuration-settings of the model.
Returns:: the specified scaling method instance.
Return type:: scaler

machine_learning.predictive(X: ndarray, estimator: RandomForestClassifier, scaler: MinMaxScaler | StandardScaler | RobustScaler | QuantileTransformer) → DataFrame[source]#

Predictive model to use the already fitted classifier to make annual projections for the projection period. As other models, it reads data which are then scaled and used in conjuction with the classifier to project conflict risk.

Parameters:

X (np.ndarray) – array containing the variable values plus unique identifer and geometry information.
estimator (RandomForestClassifier) – the fitted RandomForestClassifier.
scaler (scaler) – the fitted specified scaling method instance.

Returns:

containing model output on polygon-basis.

Return type:

pd.DataFrame