CoPro#

This is the documentation of CoPro, a machine-learning tool for conflict risk projections.

A software description paper was published in JOSS.

https://img.shields.io/badge/License-MIT-blue.svg https://readthedocs.org/projects/copro/badge/?version=latest https://img.shields.io/github/v/release/JannisHoch/copro https://zenodo.org/badge/254407279.svg https://badges.frapsoft.com/os/v2/open-source.svg?v=103 https://joss.theoj.org/papers/10.21105/joss.02855/status.svg

Main goal#

With CoPro it is possible to apply machine-learning techniques to make projections of future areas at risk. CoPro was developed with a rather clear application in mind, unravelling the interplay of socio-economic development, climate change, and conflict occurrence. Nevertheless, we put a lot of emphasis on making it flexible. We hope that other, related questions about climate and conflict can be tackled as well, and that process understanding is deepened further.

Contents#

Installation#

From GitHub#

To install CoPro from GitHub, first clone the code. It is advised to create a separate environment first.

Note

We recommend to use Anaconda or Miniconda to install CoPro as this was used to develop and test the model. For installation instructions, see here.

$ git clone https://github.com/JannisHoch/copro.git
$ cd path/to/copro
$ conda env create -f environment.yml

It is now possible to activate this environment with

$ conda activate copro

To install CoPro in editable mode in this environment, run this command next in the CoPro-folder:

$ pip install -e .

From PyPI#

To install CoPro directly from PyPI, use the following command.

Note

Only the stable version 0.1.2 can be installed from PyPI. For the latest version, please install from GitHub.

$ pip install copro==0.1.2

Using Copro#

Model workflow#

copro trains a Random Forest classifier model to predict the probability of conflict occorrance. To that end, it needs conflict data (from UCDP) and a set of features describing potential conflict drivers. The temporal resolution of the model is annual, i.e., feature data should contain annual data too and conflicts are predicted for each year.

The model is trained on a training set, and evaluated on a test set. It is possible to use multiple model instances to account for variations due to the train/test split. This is done with the n_runs key in the [machine_learning] section of the Reference configuration file. Overall model performance is evaluated by averaging the results of all model instances.

The final conflict state of the reference period is used as initial conditions for the prediction. Each model instance starts from there and forward-predicts conflict occurance probability for the prediction period.

Via the command line#

The most convienient way to use copro is via the command line. This allows you to run the model with a single command, and to specify all model configurations in one file.

Command line script#

copro contains a command line script which is automatically compiled when following the Installation instructions.

Information about the script can be run with the following command:

copro_runner -help

This should yield the following output:

Usage: copro_runner [OPTIONS] CFG

Main command line script to execute the model.

Args:     CFG (str): (relative) path to cfg-file

Options:
-c, --cores INTEGER    Number of jobs to run in parallel. Default is 0.
-v, --verbose INTEGER  Verbosity level of the output. Default is 0.
--help                 Show this message and exit.
Model configuration file#

The command line script takes on argument, which is a model configuration file. This file covers all necessary information to run the model for the reference period.

In case projections should be made, each projection is specified in a separate (reduced) configuration file. Multiple files can be specified. The name of each projection is specified via the key in the section [PROJ_files].

Note

The file extension of the configuration files is not important - we use .cfg.

Reference configuration file#

The configuration file for the reference period needs to contain the following sections.

Note

All paths should be relative to input_dir.

[general]
input_dir=./example_data
output_dir=./OUT

[settings]
# start year
y_start=2000
# end year
y_end=2012

[PROJ_files]
# cfg-files
proj_nr_1=./example_settings_proj.cfg

[pre_calc]
# if nothing is specified, the XY array will be stored in output_dir
# if XY already pre-calculated, then provide path to npy-file
XY=

[extent]
shp=path/to/polygons.shp

[conflict]
# PRIO/UCDP dataset
conflict_file=path/to/ged201.csv
min_nr_casualties=1
# 1=state-based armed conflict; 2=non-state conflict; 3=one-sided violence
type_of_violence=1,2,3

[data]
# specify the path to the nc-file, whether the variable shall be log-transformed (True, False), and which statistical function should be applied
# these three settings need to be separated by a comma
# NOTE: variable name here needs to be identical with variable name in nc-file
# NOTE: only statistical functions supported by rasterstats are valid
precipitation=path/to/precipitation_data.nc,True,mean
temperature=path/to/temperature_data.nc,True,min
gdp=path/to/gdp_data.nc,False,max

[machine_learning]
# choose from: MinMaxScaler, StandardScaler, RobustScaler, QuantileTransformer
scaler=QuantileTransformer
train_fraction=0.7
# number of model instances
n_runs=10
Projection configuration file#

Per projection, a separate configuration file is needed. This file needs to contain the following sections.

[general]
input_dir=./example_data
verbose=True

[settings]
# end year of projections
y_proj=2015

[pre_calc]
# if nothing is specified, the XY array will be stored in output_dir
# if XY already pre-calculated, then provide (absolute) path to npy-file
XY=

[data]
# specify the path to the nc-file, whether the variable shall be log-transformed (True, False), and which statistical function should be applied
# these three settings need to be separated by a comma
# NOTE: variable name here needs to be identical with variable name in nc-file
# NOTE: only statistical functions supported by rasterstats are valid
precipitation=path/to/precipitation_data.nc,True,mean
temperature=path/to/temperature_data.nc,True,min
gdp=path/to/gdp_data.nc,False,max

Note

The projection data can be in the same file as the reference data or in separate files. Note that it’s important to ensure reference and projection data are consistent and biases are removed.

API#

For bespoke applications, it is possible to use copro as a library. Please find more information in the API documentation.

API documentation#

This section contains the API documentation of the copro package.

Contents#

Main Model#
The models class#

The models class contains all the steps required to prepare and run the conflict projections. It essentially wraps the much of the functionality of the Machine Learning class.

Machine Learning#
The machine learning class#

This class does most of the heavy lifting for machine learning applications in the model.

class machine_learning.MachineLearning(config: RawConfigParser)[source]#

Bases: object

fit_predict(X_train: ndarray | DataFrame, y_train: ndarray, X_test: ndarray | DataFrame, out_dir: str, run_nr: int, tune_hyperparameters=False, n_jobs=2, verbose=0) Tuple[ndarray, ndarray][source]#

Fits classifier based on training-data and makes predictions. The fitted classifier is dumped to file with pickle to be used again during projections. Makes prediction with test-data including probabilities of those predictions. If specified, hyperparameters of classifier are tuned with GridSearchCV.

Parameters:
  • X_train (np.ndarray, pd.DataFrame) – training-data of variable values.

  • y_train (np.ndarray) – training-data of conflict data.

  • X_test (np.ndarray, pd.DataFrame) – test-data of variable values.

  • out_dir (str) – path to output folder.

  • run_nr (int) – number of fit/predict repetition and created classifier.

  • tune_hyperparameters (bool, optional) – whether to tune hyperparameters. Defaults to False.

  • n_jobs (int, optional) – Number of cores to be used. Defaults to 2.

  • verbose (int, optional) – Verbosity level. Defaults to 0.

Returns:

arrays including the predictions made and their probabilities

Return type:

arrays

split_scale_train_test_split(X: ndarray | DataFrame, Y: ndarray)[source]#

Splits and transforms the X-array (or sample data) and Y-array (or target data) in test-data and training-data. The fraction of data used to split the data is specified in the configuration file. Additionally, the unique identifier and geometry of each data point in both test-data and training-data is retrieved in separate arrays.

Parameters:
  • X (array) – array containing the variable values plus unique identifer and geometry information.

  • Y (array) – array containing merely the binary conflict classifier data.

Returns:

arrays containing training-set and test-set for X-data and Y-data as well as IDs and geometry.

Return type:

arrays

Other functions#

Functions for machine learning applications in the model.

machine_learning.apply_gridsearchCV(estimator: RandomForestClassifier, X_train: ndarray, y_train: ndarray, n_jobs=2, verbose=0) RandomForestClassifier[source]#

Applies grid search to find the best hyperparameters for the RandomForestClassifier.

Parameters:
  • estimator (RandomForestClassifier) – Estimator to be used in the grid search.

  • X_train (np.ndarray) – Feature matrix.

  • y_train (np.ndarray) – Target vector.

  • n_jobs (int, optional) – Number of cores to be used. Defaults to 2.

  • verbose (int, optional) – Verbosity level. Defaults to 0.

Returns:

Best estimator of the grid search.

Return type:

RandomForestClassifier

machine_learning.define_scaling(config: RawConfigParser) MinMaxScaler | StandardScaler | RobustScaler | QuantileTransformer[source]#

Defines scaling method based on model configurations.

Parameters:

config (ConfigParser-object) – object containing the parsed configuration-settings of the model.

Returns:

the specified scaling method instance.

Return type:

scaler

machine_learning.load_clfs(config: RawConfigParser, out_dir: str) list[str][source]#

Loads the paths to all previously fitted classifiers to a list. Classifiers were saved to file in fit_predict(). With this list, the classifiers can be loaded again during projections.

Parameters:
  • config (ConfigParser-object) – object containing the parsed configuration-settings of the model.

  • out_dir (path) – path to output folder.

Returns:

list with file names of classifiers.

Return type:

list

machine_learning.predictive(X: ndarray, clf: RandomForestClassifier, scaler: MinMaxScaler | StandardScaler | RobustScaler | QuantileTransformer) DataFrame[source]#

Predictive model to use the already fitted classifier to make annual projections for the projection period. As other models, it reads data which are then scaled and used in conjuction with the classifier to project conflict risk.

Parameters:
  • X (np.ndarray) – array containing the variable values plus unique identifer and geometry information.

  • clf (RandomForestClassifier) – the fitted RandomForestClassifier.

  • scaler (scaler) – the fitted specified scaling method instance.

Returns:

containing model output on polygon-basis.

Return type:

pd.DataFrame

Authors#

  • Jannis M. Hoch (Fathom)

  • Sophie de Bruin (VU Amsterdam)

  • Niko Wanders (Utrecht University)

Indices and tables#