CoPro¶
This is the documentation of CoPro, a machine-learning tool for conflict risk projections.
A software description paper was published in JOSS.
Main goal¶
With CoPro it is possible to apply machine-learning techniques to make projections of future areas at risk. CoPro was developed with a rather clear application in mind, unravelling the interplay of socio-economic development, climate change, and conflict occurrence. Nevertheless, we put a lot of emphasis on making it flexible. We hope that other, related questions about climate and conflict can be tackled as well, and that process understanding is deepened further.
Contents¶
Installation¶
From GitHub¶
To install CoPro from GitHub, first clone the code. It is advised to create a separate environment first.
Note
We recommend to use Anaconda or Miniconda to install CoPro as this was used to develop and test the model. For installation instructions, see here.
$ git clone https://github.com/JannisHoch/copro.git
$ cd path/to/copro
$ conda env create -f environment.yml
It is now possible to activate this environment with
$ conda activate copro
To install CoPro in editable mode in this environment, run this command next in the CoPro-folder:
$ pip install -e .
From PyPI¶
Todo
This is not yet supported. Feel invited to provide a pull request enabling installation via PyPI.
From conda¶
Todo
This is not yet supported. Feel invited to provide a pull request enabling installation via conda.
Model execution¶
To be able to run the model, the conda environment has to be activated first.
$ conda activate copro
Runner script¶
To run the model, a command line script is provided. The usage of the script is as follows:
Usage: copro_runner [OPTIONS] CFG
Main command line script to execute the model.
All settings are read from cfg-file.
One cfg-file is required argument to train, test, and evaluate the model.
Multiple classifiers are trained based on different train-test data combinations.
Additional cfg-files for multiple projections can be provided as optional arguments, whereby each file corresponds to one projection to be made.
Per projection, each classifiers is used to create separate projection outcomes per time step (year).
All outcomes are combined after each time step to obtain the common projection outcome.
Args: CFG (str): (relative) path to cfg-file
Options:
-plt, --make_plots add additional output plots
-v, --verbose command line switch to turn on verbose mode
Help information can be accessed with
$ copro_runner --help
All data and settings are retrieved from the configuration-file (cfg-file
, see Settings ) which needs to be provided as command line argument.
In the cfg-file, the various settings of the simulation are defined.
A typical command would thus look like this:
$ copro_runner settings.cfg
In case issues occur, updating setuptools
may be required.
$ pip3 install --upgrade pip setuptools
Settings¶
The cfg-file¶
The main model settings need to be specified in a configuration file (cfg-file
).
This file looks like this.
[general]
input_dir=./path/to/input_data
output_dir=./path/to/store/output
# 1: all data. 2: leave-one-out model. 3: single variable model. 4: dubbelsteenmodel
# Note that only 1 supports sensitivity_analysis
model=1
verbose=True
[settings]
# start year
y_start=2000
# end year
y_end=2012
[PROJ_files]
# cfg-files
proj_nr_1=./path/to/projection/settings_proj.cfg
[pre_calc]
# if nothing is specified, the XY array will be stored in output_dir
# if XY already pre-calculated, then provide path to npy-file
XY=
[extent]
shp=folder/with/polygons.shp
[conflict]
# either specify path to file or state 'download' to download latest PRIO/UCDP dataset
conflict_file=folder/with/conflict_data.csv
min_nr_casualties=1
# 1=state-based armed conflict. 2=non-state conflict. 3=one-sided violence
type_of_violence=1,2,3
[climate]
shp=folder/with/climate_zones.shp
# define either one or more classes (use abbreviations!) or specify nothing for not filtering
zones=
code2class=folder/with/classification_codes.txt
[data]
# specify the path to the nc-file, whether the variable shall be log-transformed (True, False), and which statistical function should be applied
# these three settings need to be separated by a comma
# NOTE: variable name here needs to be identical with variable name in nc-file
# NOTE: only statistical functions supported by rasterstats are valid
precipitation=folder/with/precipitation_data.nc,False,mean
temperature=folder/with/temperature_data.nc,False,mean
population=folder/with/population_data.nc,True,sum
[machine_learning]
# choose from: MinMaxScaler, StandardScaler, RobustScaler, QuantileTransformer
scaler=QuantileTransformer
# choose from: NuSVC, KNeighborsClassifier, RFClassifier
model=RFClassifier
train_fraction=0.7
# number of repetitions
n_runs=10
Note
All paths for input_dir
, output_dir
, and in [PROJ_files]
are relative to the location of the cfg-file.
Important
Empty spaces should be avoided in the cfg-file, besides for those lines commented out with ‘#’.
The sections¶
Here, the different sections are explained briefly.
[general]¶
input_dir
: (relative) path to the directory where the input data is stored. This requires all input data to be stored in one main folder, sub-folders are possible.
output_dir
: (relative) path to the directory where output will be stored.
If the folder does not exist yet, it will be created.
CoPro will automatically create the sub-folders _REF
for output for the reference run, and _PROJ
for output from the (various) projection runs.
model
: the type of simulation to be run can be specified here. Currently, for different models are available:
‘all data’: all variable values are used to fit the model and predict results.
‘leave one out’: values of each variable are left out once, resulting in n-1 runs with n being the number of variables. This model can be used to identify the relative influence of one variable within the variable set/
‘single variables’: each variable is used as sole predictor once. With this model, the explanatory power of each variable on its own can be assessed.
‘dubbelsteen’: the relation between variables and conflict are abolished by shuffling the binary conflict data randomly. By doing so, the lower boundary of the model can be estimated.
Note
All model types except ‘all_data’ will be deprecated in a future release.
verbose
: if True, additional messages will be printed.
[settings]¶
y_start
: the start year of the reference run.
y_end
: the end year of the reference run.
The period between y_start
and y_end
will be used to train and test the model.
y_proj
: the end year of the projection run.
The period between y_end
and y_proj
will be used to make annual projections.
[PROJ_files]¶
A key section. Here, one (slightly different) cfg-file per projection needs to be provided. This way, multiple projection runs can be defined from within the “main” cfg-file.
The conversion is that the projection name is defined as value here. For example, the projections “SSP1” and “SSP2” would be defined as
SSP1=/path/to/ssp1.cfg
SSP2=/path/to/ssp2.cfg
A cfg-file for a projection is shorter than the main cfg-file used as command line argument and looks like this:
[general]
input_dir=./path/to/input_data
verbose=True
[settings]
# year for which projection is to be made
y_proj=2050
[data]
# specify the path to the nc-file, whether the variable shall be log-transformed (True, False), and which statistical function should be applied
# these three settings need to be separated by a comma
# NOTE: variable name here needs to be identical with variable name in nc-file
# NOTE: only statistical functions supported by rasterstats are valid
precipitation=folder/with/precipitation_data.nc,False,mean
temperature=folder/with/temperature_data.nc,False,mean
population=folder/with/population_data.nc,True,sum
[pre_calc]¶
XY
: if the XY-data was already pre-computed in a previous run and stored as npy-file, it can be specified here and will be loaded from file to save time.
If nothing is specified, the model will save the XY-data by default to the output directory as XY.npy
.
[extent]¶
shp
: the provided shape-file defines the boundaries for which the model is applied.
At the same time, it also defines at which aggregation level the output is determined.
Note
The shp-file can contain multiple polygons covering the study area. Their size defines the output aggregation level. It is also possible to provide only one polygon, but model behaviour is not well tested for this case.
[conflict]¶
conflict_file
: path to the csv-file containing the conflict dataset.
It is also possible to define download
, then the latest conflict dataset (currently version 20.1) is downloaded and used as input.
min_nr_casualties
: minimum number of reported casualties required for a conflict to be considered in the model.
type_of_violence
: the types of violence to be considered can be specified here.
Multiple values can be specified. Types of violence are:
‘state-based armed conflict’: a contested incompatibility that concerns government and/or territory where the use of armed force between two parties, of which at least one is the government of a state, results in at least 25 battle-related deaths in one calendar year.
‘non-state conflict’: the use of armed force between two organized armed groups, neither of which is the government of a state, which results in at least 25 battle-related deaths in a year.
‘one-sided violence’: the deliberate use of armed force by the government of a state or by a formally organized group against civilians which results in at least 25 deaths in a year.
Important
CoPro currently only works with UCDP data.
[climate]¶
shp
: the provided shape-file defines the areas of the different Köppen-Geiger climate zones.
zones
: abbreviations of the climate zones to be considered in the model.
Can either be ‘None’ or one or multiple abbreviations.
code2class
: converting the abbreviations to class-numbers used in the shp-file.
Warning
The code2class-file should not be altered!
[data]¶
In this section, all variables to be used in the model need to be provided.
The paths are relative to input_dir
.
Only netCDF-files with annual data are supported.
The main convention is that the name of the file agrees with the variable name in the file.
For example, if the variable precipitation
is provided in a nc-file, this should be noted as follows
[data]
precipitation=folder/with/precipitation_data.nc
CoPro furthermore requires information whether the values sampled from a file are ought to be log-transformed.
Besides, it is possible to define a statistical function that is applied when sampling from file per polygon of the shp-file
.
CoPro makes use of the zonal_stats
function available within rasterstats.
To determine the log-scaled mean value of precipitation per polygon, the following notation is required:
[data]
precipitation=folder/with/precipitation_data.nc,False,mean
[machine_learning]¶
scaler
: the scaling algorithm used to scale the variable values to comparable scales.
Currently supported are
model
: the machine learning algorithm to be applied.
Currently supported are
train_fraction
: the fraction of the XY-data to be used to train the model.
The remaining data (1-train_fraction) will be used to predict and evaluate the model.
n_runs
: the number of classifiers to use.
Workflow¶
This page provides a short example workflow in Jupyter Notebooks. It is designed such that the main steps, features, assumptions, and outcomes of CoPro become clear.
As model input data, the data set downloadable from Zenodo was used.
Even though the model can be perfectly executed using notebooks, the main (and more convenient) way of model execution is the command line script (see Runner script).
An interactive version of the content shown here can be accessed via Binder.
Model initialization and selection procedure¶
In this notebook, we will show how CoPro is initialized and the selection procedure of spatial aggregation units and conflicts works.
Model initialization¶
Start with loading the required packages.
[1]:
from copro import utils, selection, plots, data
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import os, sys
import warnings
warnings.simplefilter("ignore")
For better reproducibility, the version numbers of all key packages used to run this notebook are provided.
[2]:
utils.show_versions()
Python version: 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 01:53:57) [MSC v.1916 64 bit (AMD64)]
copro version: 0.0.8
geopandas version: 0.9.0
xarray version: 0.15.1
rasterio version: 1.1.0
pandas version: 1.0.3
numpy version: 1.18.1
scikit-learn version: 0.23.2
matplotlib version: 3.2.1
seaborn version: 0.11.0
rasterstats version: 0.14.0
The configurations-file (cfg-file)¶
In the configurations-file (cfg-file), all the settings for the analysis are defined. The cfg-file contains, amongst others, all paths to input files, settings for the machine-learning model, and the various selection criteria for spatial aggregation units and conflicts. Note that the cfg-file can be stored anywhere, not per se in the same directory where the model data is stored (as in this example case). Make sure that the paths in the cfg-file are updated if you use relative paths and change the folder location of th cfg-file!
[3]:
settings_file = 'example_settings.cfg'
Based on this cfg-file, the set-up of the run can be initialized. Here, the cfg-file is parsed (i.e. read) and all settings and paths become ‘known’ to the model. Also, the output folder is created (if it does not exist yet) and the cfg-file is copied to the output folder for improved reusability.
If you set verbose=True
, then additional statements are printed during model execution. This can help to track the behaviour of the model.
[4]:
main_dict, root_dir = utils.initiate_setup(settings_file, verbose=False)
#### CoPro version 0.0.8 ####
#### For information about the model, please visit https://copro.readthedocs.io/ ####
#### Copyright (2020-2021): Jannis M. Hoch, Sophie de Bruin, Niko Wanders ####
#### Contact via: j.m.hoch@uu.nl ####
#### The model can be used and shared under the MIT license ####
INFO: reading model properties from example_settings.cfg
INFO: verbose mode on: False
INFO: saving output to main folder C:\Users\hoch0001\Documents\_code\copro\example\./OUT
One of the outputs is a dictionary (here main_dict
) containing the parsed configurations (they are stored in computer memory, therefore the slighly odd specification) as well as output directories of both the reference run and the various projection runs specified in the cfg-file.
For the reference run, only the respective entries are required.
[5]:
config_REF = main_dict['_REF'][0]
print('the configuration of the reference run is {}'.format(config_REF))
out_dir_REF = main_dict['_REF'][1]
print('the output directory of the reference run is {}'.format(out_dir_REF))
the configuration of the reference run is <configparser.RawConfigParser object at 0x000001FBC8FA9D08>
the output directory of the reference run is C:\Users\hoch0001\Documents\_code\copro\example\./OUT\_REF
Filter conflicts and spatial aggregation units¶
Background¶
As conflict database, we use the UCDP Georeferenced Event Dataset. Not all conflicts of the database may need to be used for a simulation. This can be, for example, because they belong to a non-relevant type of conflict we are not interested in, or because it is simply not in our area-of-interest. Therefore, it is possible to filter the conflicts on various properties:
min_nr_casualties: minimum number of casualties of a reported conflict;
type_of_violence: 1=state-based armed conflict; 2=non-state conflict; 3=one-sided violence.
To unravel the interplay between climate and conflict, it may be beneficial to run the model only for conflicts in particular climate zones. It is hence also possible to select only those conflcits that fall within a climate zone following the Koeppen-Geiger classification.
Selection procedure¶
In the selection procedure, we first load the conflict database and convert it to a georeferenced dataframe (geo-dataframe). To define the study area, a shape-file containing polygons (in this case water provinces) is loaded and converted to geo-dataframe as well.
We then apply the selection criteria (see above) as specified in the cfg-file, and keep the remaining data points and associated polygons.
[7]:
conflict_gdf, selected_polygons_gdf, global_df = selection.select(config_REF, out_dir_REF, root_dir)
INFO: reading csv file to dataframe C:\Users\hoch0001\Documents\_code\copro\example\./example_data\UCDP/ged201.csv
INFO: filtering based on conflict properties.
With the chosen settings, the following picture of polygons and conflict data points is obtained.
[8]:
fig, ax= plt.subplots(1, 1, figsize=(20,10))
conflict_gdf.plot(ax=ax, c='r', column='best', cmap='magma',
vmin=int(config_REF.get('conflict', 'min_nr_casualties')), vmax=conflict_gdf.best.mean(),
legend=True,
legend_kwds={'label': "# casualties", 'orientation': "vertical", 'pad': 0.05})
selected_polygons_gdf.boundary.plot(ax=ax);

It’s nicely visible that for this example-run, not all provinces are considered but we focus on specified climate zones only.
Temporary files¶
To be able to also run the following notebooks, some of the data has to be written to file temporarily. This is not part of the CoPro workflow but merely needed to split up the workflow in different notebooks outlining the main steps to go through when using CoPro.
[9]:
if not os.path.isdir('temp_files'):
os.makedirs('temp_files')
[10]:
conflict_gdf.to_file(os.path.join('temp_files', 'conflicts.shp'))
selected_polygons_gdf.to_file(os.path.join('temp_files', 'polygons.shp'))
[11]:
global_df['ID'] = global_df.index.values
global_arr = global_df.to_numpy()
np.save(os.path.join('temp_files', 'global_df'), global_arr)
Obtaining samples matrix and target values¶
In this notebook, we will show how CoPro reads the input data and derives the samples matrix and target values needed to establish a machine-learning model.
Preparations¶
Start with loading the required packages.
[1]:
from copro import utils, pipeline, data
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd
import os, sys
import warnings
from shutil import copyfile
warnings.simplefilter("ignore")
For better reproducibility, the version numbers of all key packages used to run this notebook are provided.
[2]:
utils.show_versions()
Python version: 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 01:53:57) [MSC v.1916 64 bit (AMD64)]
copro version: 0.0.8
geopandas version: 0.9.0
xarray version: 0.15.1
rasterio version: 1.1.0
pandas version: 1.0.3
numpy version: 1.18.1
scikit-learn version: 0.23.2
matplotlib version: 3.2.1
seaborn version: 0.11.0
rasterstats version: 0.14.0
To be able to also run this notebooks, some of the previously saved temporary files need to be loaded.
[3]:
conflict_gdf = gpd.read_file(os.path.join('temp_files', 'conflicts.shp'))
selected_polygons_gdf = gpd.read_file(os.path.join('temp_files', 'polygons.shp'))
The configurations-file (cfg-file)¶
To be able to continue the simulation with the same settings as in the previous notebook, the cfg-file has to be read again and the model needs to be initialised subsequently. This is not needed if CoPro is run from command line. Please see the previous notebook for additional information.
[4]:
settings_file = 'example_settings.cfg'
[5]:
main_dict, root_dir = utils.initiate_setup(settings_file, verbose=False)
#### CoPro version 0.0.8 ####
#### For information about the model, please visit https://copro.readthedocs.io/ ####
#### Copyright (2020-2021): Jannis M. Hoch, Sophie de Bruin, Niko Wanders ####
#### Contact via: j.m.hoch@uu.nl ####
#### The model can be used and shared under the MIT license ####
INFO: reading model properties from example_settings.cfg
INFO: verbose mode on: False
INFO: saving output to main folder C:\Users\hoch0001\Documents\_code\copro\example\./OUT
[6]:
config_REF = main_dict['_REF'][0]
out_dir_REF = main_dict['_REF'][1]
Reading the files and storing the data¶
Background¶
This is an essential part of CoPro. For a machine-learning model to work, it requires a samples matrix (X), representing here the socio-economic and hydro-climatic ‘drivers’ of conflict, and target values (Y) representing the (observed) conflicts themselves. By fitting a machine-learning model, a relation between X and Y is established, which in turn can be used to make projections.
Additional information can be found on scikit-learn.
Since CoPro simulates conflict risk spatially explicit for each polygons (here water provinces), it is furthermore needed to be able to associate each polygon with the corresponding data points in X and Y. We therefore also keep a polygon-ID and its geometry, and track it throughout the modelling chain.
Implementation¶
CoPro goes through all model years as specified in the cfg-file. Per year, CoPro loops over all polygons remaining after the selection procedure (see previous notebook) and does the following to obtain the X-data.
Assign ID to polygon and retrieve geometry information;
Calculate a statistical value per polygon from each input file specified in the cfg-file in section ‘data’. Which statistical value is ought to be computed needs to be specified in the cfg-file. In the cfg-file, it’s furthermore possible to specify whether values are ought to be log-transformed.
Note that CoPro applies a 1 year time-lag by default. That means that for a given year J, the data from year J-1 is read. This is to avoid issues with backwards reversibility, i.e. we assume that the driver results in conflict (or not) with one year delay and not immediately in the same year.
And to obtain the Y-data:
Assign a Boolean value whether a conflict took place in a polygon or not - the number of casualties or conflicts per year is not relevant in thise case.
All information is stored in a X-array and a Y-array. The X-array has 2+n columns whereby n denotes the number of samples provided. The Y-array has obviously only 1 column, consisting of zeros and ones. In both arrays is the number of rows determined as number of years times the number of polygons. In case a row contains a missing value (e.g. because one input data does not cover this polygon), the entire row is removed from the XY-array.
Note that the sample values can still range a lot depending on their units, measurement, etc. In the next notebook, the X-data will be scaled to be able to compare the different values in the samples matrix.
[7]:
X, Y = pipeline.create_XY(config_REF, out_dir_REF, root_dir, selected_polygons_gdf, conflict_gdf)
INFO: reading data for period from 2000 to 2012
INFO: skipping first year 2000 to start up model
INFO: entering year 2001
INFO: entering year 2002
INFO: entering year 2003
INFO: entering year 2004
INFO: entering year 2005
INFO: entering year 2006
INFO: entering year 2007
INFO: entering year 2008
INFO: entering year 2009
INFO: entering year 2010
INFO: entering year 2011
INFO: entering year 2012
Saving data to file¶
Depending on sample and file size, obtaining the X-array and Y-array can be time-consuming. Therefore, CoPro automatically stores a combined XY-array as npy-file
to the output folder if not specified otherwise in the cfg-file. With this file, future runs using the same data but maybe different machine-learning settings can be executed in less time.
Let’s check if this files exists.
[8]:
os.path.isfile(os.path.join(out_dir_REF, 'XY.npy'))
[8]:
True
Temporary files¶
By default, a binary map of conflict per polygon for the last year of the simulation period is stored to the output directory. Since the output directory is created from scratch at each model initalisation, we need to temporarily store this map in another folder to be used in subsequent notebooks.
[9]:
%%capture
for root, dirs, files in os.walk(os.path.join(out_dir_REF, 'files')):
for file in files:
fname = file
print(fname)
copyfile(os.path.join(out_dir_REF, 'files', str(fname)),
os.path.join('temp_files', str(fname)))
Initializing and executing the machine-learning model¶
In this notebook, we will show how CoPro creates, trains, and tests a machine-learning model based on the settings and data shown in the previous notebooks.
Preparations¶
Start with loading the required packages.
[1]:
from copro import utils, pipeline, evaluation, plots, machine_learning
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import geopandas as gpd
import seaborn as sbs
import os, sys
from sklearn import metrics
from shutil import copyfile
import warnings
warnings.simplefilter("ignore")
For better reproducibility, the version numbers of all key packages used to run this notebook are provided.
[2]:
utils.show_versions()
Python version: 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 01:53:57) [MSC v.1916 64 bit (AMD64)]
copro version: 0.0.8
geopandas version: 0.9.0
xarray version: 0.15.1
rasterio version: 1.1.0
pandas version: 1.0.3
numpy version: 1.18.1
scikit-learn version: 0.23.2
matplotlib version: 3.2.1
seaborn version: 0.11.0
rasterstats version: 0.14.0
To be able to also run this notebooks, some of the previously saved data needs to be loaded.
[3]:
conflict_gdf = gpd.read_file(os.path.join('temp_files', 'conflicts.shp'))
selected_polygons_gdf = gpd.read_file(os.path.join('temp_files', 'polygons.shp'))
[4]:
global_arr = np.load(os.path.join('temp_files', 'global_df.npy'), allow_pickle=True)
global_df = pd.DataFrame(data=global_arr, columns=['geometry', 'ID'])
global_df.set_index(global_df.ID, inplace=True)
global_df.drop(['ID'] , axis=1, inplace=True)
The configurations-file (cfg-file)¶
To be able to continue the simulation with the same settings as in the previous notebook, the cfg-file has to be read again and the model needs to be initialised subsequently. This is not needed if CoPro is run from command line. Please see the first notebook for additional information.
[5]:
settings_file = 'example_settings.cfg'
[6]:
main_dict, root_dir = utils.initiate_setup(settings_file, verbose=False)
#### CoPro version 0.0.8 ####
#### For information about the model, please visit https://copro.readthedocs.io/ ####
#### Copyright (2020-2021): Jannis M. Hoch, Sophie de Bruin, Niko Wanders ####
#### Contact via: j.m.hoch@uu.nl ####
#### The model can be used and shared under the MIT license ####
INFO: reading model properties from example_settings.cfg
INFO: verbose mode on: False
INFO: saving output to main folder C:\Users\hoch0001\Documents\_code\copro\example\./OUT
[7]:
config_REF = main_dict['_REF'][0]
out_dir_REF = main_dict['_REF'][1]
Loading the XY-data¶
To avoid reading the XY-data again (see previous notebook for this), we can load the data directly from a XY.npy
file which is automatically written to the output path. We saw that this file was created, but since no XY-data is specified in the config-file initially, we have to set the path manually. Note that this de-tour is only necessary due to the splitting of the workflow in different notebooks!
[8]:
config_REF.set('pre_calc', 'XY', str(os.path.join(out_dir_REF, 'XY.npy')))
To double-check, see if the file manually specifid actually exists.
[9]:
os.path.isfile(config_REF.get('pre_calc', 'XY'))
[9]:
True
[10]:
X, Y = pipeline.create_XY(config_REF, out_dir_REF, root_dir, selected_polygons_gdf, conflict_gdf)
INFO: loading XY data from file C:\Users\hoch0001\Documents\_code\copro\example\./OUT\_REF\XY.npy
Scaler and classifier¶
Background¶
In principle, one can put all kind of data into the samples matrix X, leading to a wide spread of orders of magnitude, units, distributions etc. It is therefore needed to scale (or transform) the data in the X-array such that sensible comparisons and computations are possible. To that end, a scaling technique is applied.
Once there is a scaled X-array, a machine-learning model can be fitted with it together with the target values Y.
Implementation¶
CoPro supports four different scaling techniques. For more info, see the scikit-learn documentation.
MinMaxScaler;
StandardScaler;
RobustScaler;
QuantileTransformer.
From the wide range of machine-learning model, CoPro employs three different ones from the categorie of supervised learning.
NuSVC;
KNeighborsClassifier;
RFClassifier.
Note that CoPro uses pretty much the default parameterization of the scalers and models. An extensive GridSearchCV did not show any significant improvements when changing the parameters. There is currently no way to provide other parameters than those currently set.
Let’s see which scaling technique and which supervised classifiers is specified for the example-run.
[11]:
scaler, clf = pipeline.prepare_ML(config_REF)
print('As scaling technique, it is used: {}'.format(scaler))
print('As supervised classifing technique, it is used: {}'.format(clf))
As scaling technique, it is used: QuantileTransformer(random_state=42)
As supervised classifing technique, it is used: RandomForestClassifier(class_weight={1: 100}, n_estimators=1000,
random_state=42)
Output initialization¶
Since the model is run multiple times to test various random train-test data combinations, we need to initialize a few lists first to append the output per run.
[12]:
out_X_df = evaluation.init_out_df()
out_y_df = evaluation.init_out_df()
[13]:
out_dict = evaluation.init_out_dict()
[14]:
trps, aucs, mean_fpr = evaluation.init_out_ROC_curve()
ML-model execution¶
The pudels kern! This is where the magic happens, and not only once. To make sure that any conincidental results are ruled out, we run the model multiple times. Thereby, always different parts of the XY-array are used for training and testing. By using a sufficient number of runs and averaging the overall results, we should be able to get a good picture of what the model is capable of. The number of runs as well as the split between training and testing data needs to be specified in the cfg-file.
Per repetition, the model is evaluated. The main evaluation metric is the mean ROC-score and **ROC-curve**, plotted at the end of all runs. Additional evaluation metrics are computed as described below.
[15]:
# #- create plot instance
fig, (ax1) = plt.subplots(1, 1, figsize=(20,10))
#- go through all n model executions
for n in range(config_REF.getint('machine_learning', 'n_runs')):
print('INFO: run {} of {}'.format(n+1, config_REF.getint('machine_learning', 'n_runs')))
#- run machine learning model and return outputs
X_df, y_df, eval_dict = pipeline.run_reference(X, Y, config_REF, scaler, clf, out_dir_REF, run_nr=n+1)
#- select sub-dataset with only datapoints with observed conflicts
X1_df, y1_df = utils.get_conflict_datapoints_only(X_df, y_df)
#- append per model execution
out_X_df = evaluation.fill_out_df(out_X_df, X_df)
out_y_df = evaluation.fill_out_df(out_y_df, y_df)
out_dict = evaluation.fill_out_dict(out_dict, eval_dict)
#- plot ROC curve per model execution
tprs, aucs = plots.plot_ROC_curve_n_times(ax1, clf, X_df.to_numpy(), y_df.y_test.to_list(),
trps, aucs, mean_fpr)
#- plot mean ROC curve
plots.plot_ROC_curve_n_mean(ax1, tprs, aucs, mean_fpr)
plt.savefig('../docs/_static/roc_curve.png', dpi=300, bbox_inches='tight')
INFO: run 1 of 10
No handles with labels found to put in legend.
INFO: run 2 of 10
No handles with labels found to put in legend.
INFO: run 3 of 10
No handles with labels found to put in legend.
INFO: run 4 of 10
No handles with labels found to put in legend.
INFO: run 5 of 10
No handles with labels found to put in legend.
INFO: run 6 of 10
No handles with labels found to put in legend.
INFO: run 7 of 10
No handles with labels found to put in legend.
INFO: run 8 of 10
No handles with labels found to put in legend.
INFO: run 9 of 10
No handles with labels found to put in legend.
INFO: run 10 of 10
No handles with labels found to put in legend.

Model evaluation¶
For all data points¶
During the model runs, the computed model evaluation scores per model execution were stored to a dictionary. Currently, the evaluation scores used are:
**Accuracy**: the fraction of correct predictions;
**Precision**: the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative;
**Recall**: the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples;
**F1 score**: the F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0;
**Cohen-Kappa score**: is used to measure inter-rater reliability. It is generally thought to be a more robust measure than simple percent agreement calculation, as κ takes into account the possibility of the agreement occurring by chance.
**Brier score**: the smaller the Brier score, the better, hence the naming with “loss”. The lower the Brier score is for a set of predictions, the better the predictions are calibrated. Note that the Brier loss score is relatively sensitive for imbalanced datasets;
**ROC score**: a value of 0.5 suggests no skill, e.g. a curve along the diagonal, whereas a value of 1.0 suggests perfect skill, all points along the left y-axis and top x-axis toward the top left corner. A value of 0.0 suggests perfectly incorrect predictions. Note that the ROC score is relatively insensitive for imbalanced datasets.
**AP score**: the average_precision_score function computes the average precision (AP) from prediction scores. The value is between 0 and 1 and higher is better.
Let’s check the mean scores over all runs:
[16]:
for key in out_dict:
print('average {0} of run with {1} repetitions is {2:0.3f}'.format(key, config_REF.getint('machine_learning', 'n_runs'), np.mean(out_dict[key])))
average Accuracy of run with 10 repetitions is 0.883
average Precision of run with 10 repetitions is 0.717
average Recall of run with 10 repetitions is 0.496
average F1 score of run with 10 repetitions is 0.585
average Cohen-Kappa score of run with 10 repetitions is 0.520
average Brier loss score of run with 10 repetitions is 0.090
average ROC AUC score of run with 10 repetitions is 0.863
average AP score of run with 10 repetitions is 0.636
So how are, e.g. accuracy, precision, and recall distributed?
[17]:
plots.metrics_distribution(out_dict, metrics=['Accuracy', 'Precision', 'Recall'], figsize=(20, 5));

Based on all data points, the **confusion matrix** can be plotted. This is a relatively straightforward way to visualize how good (i.e. correct) the observations are predicted by the model. Ideally, all True label
and Predicted label
pairs have the highest values.
[18]:
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
metrics.plot_confusion_matrix(clf, out_X_df.to_numpy(), out_y_df.y_test.to_list(), ax=ax);

In out_y_df
, all predictions are stored. This includes the actual value y_test
(ie. whether a conflict was observed or not) and the predicted outcome y_pred
together with the probabilities of this outcome. Additionally, CoPro adds a column with a Boolean indicator whether the predictions was correct (y_test=y_pred
) or not.
[19]:
out_y_df.head()
[19]:
ID | geometry | y_test | y_pred | y_prob_0 | y_prob_1 | correct_pred | |
---|---|---|---|---|---|---|---|
0 | 1009 | POLYGON ((29 6.696147705436432, 29.05159624587... | 0 | 0 | 0.989 | 0.011 | 1 |
1 | 1525 | POLYGON ((0.5770535604073238 6, 0.578418291470... | 0 | 0 | 0.998 | 0.002 | 1 |
2 | 1307 | (POLYGON ((-14.94260162796269 16.6312412609754... | 0 | 0 | 0.976 | 0.024 | 1 |
3 | 118 | POLYGON ((25.29046121561265 -18.03749999982506... | 0 | 0 | 0.914 | 0.086 | 1 |
4 | 45 | POLYGON ((9.821052248962189 28.22336190952456,... | 0 | 0 | 0.966 | 0.034 | 1 |
Per unique polygon¶
Thus far, we merely looked at numerical scores for all predcitions. This of course tells us a lot about the quality of the machine-learning model, but not so much about how this looks like spatially. We therefore combine the observations and predictions made with the associated polygons based on a ‘global’ dataframe functioning as a look-up table. By this means, each model prediction (ie. each row in out_y_df
) can be connected to its polygon using a unique polygon-ID.
[20]:
df_hit, gdf_hit = evaluation.polygon_model_accuracy(out_y_df, global_df)
First, let’s have a look at how often each polygon occurs in the all test samples, i.e. those obtained by appending the test samples per model execution. Besides, the overall relative distribution is visualized.
[21]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))
gdf_hit.plot(ax=ax1, column='nr_predictions', legend=True, cmap='Blues')
selected_polygons_gdf.boundary.plot(ax=ax1, color='0.5')
ax1.set_title('number of predictions made per polygon')
sbs.distplot(df_hit.nr_predictions.values, ax=ax2)
ax2.set_title('distribution of predictions');

By repeating the model n times, the aim is to represent all polygons in the resulting test sample. The fraction is computed below.
Note that is should be close to 100 % but may be slightly less. This can happen if input variables have no data for one polygon, leading to a removal of those polygons from the analysis. Or because some polygons and input data may not overlap.
[22]:
print('{0:0.2f} % of all active polygons are considered in test sample'.format(len(gdf_hit)/len(selected_polygons_gdf)*100))
100.00 % of all active polygons are considered in test sample
By aggregating results per polygon, we can now assess model output spatially. Three main aspects are presented here:
The total number of conflict events per water province;
The chance of a correct prediction, defined as the ratio of number of correct predictions made to overall number of predictions made;
The mean conflict probability, defined as the mean value of all probabilites of conflict to occur (y_prob_1) in a polygon.
[24]:
fig, axes = plt.subplots(1, 3, figsize=(20, 20), sharex=True, sharey=True)
gdf_hit.plot(ax=axes[0], column='nr_observed_conflicts', legend=True, cmap='Reds',
legend_kwds={'label': "nr_observed_conflicts", 'orientation': "horizontal", 'fraction': 0.045, 'pad': 0.05})
selected_polygons_gdf.boundary.plot(ax=axes[0], color='0.5')
gdf_hit.plot(ax=axes[1], column='fraction_correct_predictions', legend=True,
legend_kwds={'label': "fraction_correct_predictions", 'orientation': "horizontal", 'fraction': 0.045, 'pad': 0.05})
selected_polygons_gdf.boundary.plot(ax=axes[1], color='0.5')
gdf_hit.plot(ax=axes[2], column='probability_of_conflict', legend=True, cmap='Blues', vmin=0, vmax=1,
legend_kwds={'label': "mean conflict probability", 'orientation': "horizontal", 'fraction': 0.045, 'pad': 0.05})
selected_polygons_gdf.boundary.plot(ax=axes[2], color='0.5')
plt.tight_layout();

Preparing for projections¶
In this notebook, we have trained and tested our model with various combinations of data. Subsequently, the average performance of the model was evaluated with a range of metrics.
If we want to re-use our model for the future and want to make projections, it is necessary to save the model (that is, the n fitted classifiers). They can then be loaded and one or more projections can be made with other variable values than those used for this reference run.
To that end, the classifier is fitted again, but then with all data, i.e. without a split-sample test. That way, the classifier fit is most robust.
[25]:
%%capture
for root, dirs, files in os.walk(os.path.join(out_dir_REF, 'clfs')):
for file in files:
fname = file
print(fname)
copyfile(os.path.join(out_dir_REF, 'clfs', str(fname)),
os.path.join('temp_files', str(fname)))
[ ]:
Projecting conflict risk¶
In this notebook, we will show how CoPro uses a number of previously fitted classifiers and projects conflict risk forward in time. Eventually, these forward predictions based on multiple classifiers can be merged into a robust estimate of future conflict risk.
Preparations¶
Start with loading the required packages.
[1]:
from copro import utils, pipeline, evaluation, plots, machine_learning
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import geopandas as gpd
import seaborn as sbs
import os, sys
from sklearn import metrics
from shutil import copyfile
import warnings
import glob
warnings.simplefilter("ignore")
For better reproducibility, the version numbers of all key packages are provided.
[2]:
utils.show_versions()
Python version: 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 01:53:57) [MSC v.1916 64 bit (AMD64)]
copro version: 0.0.8
geopandas version: 0.9.0
xarray version: 0.15.1
rasterio version: 1.1.0
pandas version: 1.0.3
numpy version: 1.18.1
scikit-learn version: 0.23.2
matplotlib version: 3.2.1
seaborn version: 0.11.0
rasterstats version: 0.14.0
To be able to also run this notebooks, some of the previously saved data needs to be loaded from a temporary location.
[3]:
conflict_gdf = gpd.read_file(os.path.join('temp_files', 'conflicts.shp'))
selected_polygons_gdf = gpd.read_file(os.path.join('temp_files', 'polygons.shp'))
[4]:
global_arr = np.load(os.path.join('temp_files', 'global_df.npy'), allow_pickle=True)
global_df = pd.DataFrame(data=global_arr, columns=['geometry', 'ID'])
global_df.set_index(global_df.ID, inplace=True)
global_df.drop(['ID'] , axis=1, inplace=True)
The configurations-file (cfg-file)¶
To be able to continue the simulation with the same settings as in the previous notebook, the cfg-file has to be read again and the model needs to be initialised subsequently. This is not needed if CoPro is run from command line. Please see the first notebook for additional information.
[5]:
settings_file = 'example_settings.cfg'
[6]:
main_dict, root_dir = utils.initiate_setup(settings_file, verbose=False)
#### CoPro version 0.0.8 ####
#### For information about the model, please visit https://copro.readthedocs.io/ ####
#### Copyright (2020-2021): Jannis M. Hoch, Sophie de Bruin, Niko Wanders ####
#### Contact via: j.m.hoch@uu.nl ####
#### The model can be used and shared under the MIT license ####
INFO: reading model properties from example_settings.cfg
INFO: verbose mode on: False
INFO: saving output to main folder C:\Users\hoch0001\Documents\_code\copro\example\./OUT
[7]:
config_REF = main_dict['_REF'][0]
out_dir_REF = main_dict['_REF'][1]
In addition to the config-object and output path for the reference period, main_dict
also contains the equivalents for the projection run. In the cfg-file, an extra cfg-file can be provided per projection.
[8]:
config_REF.items('PROJ_files')
[8]:
[('proj_nr_1', './example_settings_proj.cfg')]
In this example, the files is called example_settings_proj.cfg
and the name of the projection is proj_nr_1
.
[9]:
config_PROJ = main_dict['proj_nr_1'][0]
print('the configuration of the projection run is {}'.format(config_PROJ))
out_dir_PROJ = main_dict['proj_nr_1'][1]
print('the output directory of the projection run is {}'.format(out_dir_PROJ))
the configuration of the projection run is [<configparser.RawConfigParser object at 0x0000021E18A03508>]
the output directory of the projection run is C:\Users\hoch0001\Documents\_code\copro\example\./OUT\_PROJ\proj_nr_1
In the previous notebooks, conflict at the last year of the reference period as well as classifiers were stored temporarily to another folder than the output folder. Now let’s copy these files back to the folders where the belong.
[10]:
%%capture
for root, dirs, files in os.walk('temp_files'):
# conflicts at last time step
files = glob.glob(os.path.abspath('./temp_files/conflicts_in*'))
for file in files:
fname = file.rsplit('\\')[-1]
print(fname)
copyfile(os.path.join('temp_files', fname),
os.path.join(out_dir_REF, 'files', str(fname)))
# classifiers
files = glob.glob(os.path.abspath('./temp_files/clf*'))
for file in files:
fname = file.rsplit('\\')[-1]
print(fname)
copyfile(os.path.join('temp_files', fname),
os.path.join(out_dir_REF, 'clfs', str(fname)))
Similarly, we need to load the sample data (X) for the reference run as we need to fit the scaler with this data before we can make comparable and consistent projections.
[11]:
config_REF.set('pre_calc', 'XY', str(os.path.join(out_dir_REF, 'XY.npy')))
X, Y = pipeline.create_XY(config_REF, out_dir_REF, root_dir, selected_polygons_gdf, conflict_gdf)
INFO: loading XY data from file C:\Users\hoch0001\Documents\_code\copro\example\./OUT\_REF\XY.npy
Lastly, we need to get the scaler for the samples matrix again. The pre-computed and already fitted classifiers are directly loaded from file (see above). The clf returned here will not be used.
[12]:
scaler, clf = pipeline.prepare_ML(config_REF)
Project!¶
With this all in place, we can now make projections. Under the hood, various steps are taken for each projectio run specified:
Load the corresponding ConfigParser-object;
Determine the projection period defined as the period between last year of reference run and projection year specified in cfg-file of projection run;
Make a separate projection per classifier (the number of classifiers, or model runs, is specified in the cfg-file):
in the first year of the projection year, use conflict data from last year of reference run, i.e. still observed conflict data;
in all following year, use the conflict data projected for the previous year with this specific classifier;
all other variables are read from file for all years.
Per year, merge the conflict risk projected by all classifiers and derive a fractional conflict risk per polygon.
For detailed information, please see the documentatoin and code of copro.pipeline.run_prediction()
. As this is one function doing all the work, it is not possible to split up the workflow in more detail here.
[13]:
all_y_df = pipeline.run_prediction(scaler.fit(X[: , 2:]), main_dict, root_dir, selected_polygons_gdf)
INFO: loading config-object for projection run: proj_nr_1
INFO: the projection period is 2013 to 2015
INFO: making projection for year 2013
INFO: making projection for year 2014
INFO: making projection for year 2015
Analysis of projection¶
All the previously used evaluation metrics are not applicable anymore, as there are no target values anymore. We can still look what the mean conflict probability is as computed by the model per polygon.
[15]:
# link projection outcome to polygons via unique polygon-ID
df_hit, gdf_hit = evaluation.polygon_model_accuracy(all_y_df, global_df, make_proj=True)
# and plot
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
gdf_hit.plot(ax=ax, column='probability_of_conflict', legend=True, figsize=(20, 10), cmap='Blues', vmin=0, vmax=1,
legend_kwds={'label': "mean conflict probability", 'orientation': "vertical", 'fraction': 0.045})
selected_polygons_gdf.boundary.plot(ax=ax, color='0.5');

Projection output¶
The conflict projection per year is also stored in the output folder of the projection run as geoJSON files. These files can be used to post-process the data with the scripts provided with CoPro or to load them into bespoke scripts and functions written by the user.
Output¶
Output folder structure¶
All output is stored in the output folder as specified in the configurations-file (cfg-file) under [general].
[general]
output_dir=./path/to/store/output
By default, CoPro creates two sub-folders: _REF
and _PROJ
. In the latter, another sub-folder will be created per projection defined in the cfg-file.
In the example below, this would be the folders /_PROJ/SSP1
and /_PROJ/SSP2
.
[PROJ_files]
SSP1=/path/to/ssp1.cfg
SSP2=/path/to/ssp2.cfg
List of output files¶
Important
Not all model types provide the output mentioned below. If the ‘leave-one-out’ or ‘single variable’ model are selected, only the metrics are stored to a csv-file.
_REF¶
In addition to the output files listed below, the cfg-file is automatically copied to the _REF folder.
selected_polygons.shp
: Shapefile containing all remaining polygons after selection procedure.
selected_conflicts.shp
: Shapefile containing all remaining conflict points after selection procedure,
XY.npy
: NumPy-array containing geometry, ID, and scaled data of sample (X) and target data (Y).
Can be provided in cfg-file to safe time in next run; file can be loaded with numpy.load().
raw_output_data.npy
: NumPy-array containing each single prediction made in the reference run.
Will contain multiple predictions per polygon. File can be loaded with numpy.load().
evaluation_metrics.csv
: Various evaluation metrics determined per repetition of the split-sample tests.
File can e.g. be loaded with pandas.read_csv().
feature_importance.csv
: Importance of each model variable in making projections.
This is a property of RF Classifiers and thus only obtainable if RF Classifier is used.
permutation_importance.csv
: Mean permutation importance per model variable.
Computed with sklearn.inspection.permutation_importance.
ROC_data_tprs.csv
and ROC_data_aucs.csv
: False-positive rates respectively Area-under-curve values per repetition of the split-sample test.
Files can e.g. be loaded with pandas.read_csv() and can be used to later plot ROC-curve.
output_for_REF.geojson
: GeoJSON-file containing resulting conflict risk estimates per polygon based on out-of-sample projections of _REF run.
Conflict risk per polygon¶
At the end of all model repetitions, the resulting raw_output_data.npy
file contains multiple out-of-sample predictions per polygon.
By aggregating results per polygon, it is possible to assess model output spatially as stored in output_for_REF.geojson
.
The main output metrics are calculated per polygon and saved to output_per_polygon.shp
:
nr_predictions: the number of predictions made;
nr_correct_predictions: the number of correct predictions made;
nr_observed_conflicts: the number of observed conflict events;
nr_predicted_conflicts: the number of predicted conflicts;
min_prob_1: minimum probability of conflict in all repetitions;
probability_of_conflict (POC): probability of conflict averaged over all repetitions;
max_prob_1: maximum probability of conflict in all repetitions;
fraction_correct_predictions (FOP): ratio of the number of correct predictions over the total number of predictions made;
chance_of_conflict: ratio of the number of conflict predictions over the total number of predictions made.
_PROJ¶
Per projection, CoPro creates one output file per projection year.
output_in_<YEAR>
: GeoJSON-file containing model output per polygon averaged over all classifier instances per YEAR of the projection.
The number of instances is set with n_runs
in [machine_learning]
section.
Conflict risk per polygon¶
During the projection run, each classifier instances produces its own output per YEAR.
CoPro merges these outputs into one output_in_<YEAR>.geojson
file.
As there are no observations available for the projection period, the output metrics differ from the reference run:
nr_predictions: the number of predictions made, ie. number of classifier instances;
nr_predicted_conflicts: the number of predicted conflicts.
min_prob_1: minimum probability of conflict in all outputs of classifier instances.
probability_of_conflict (POC): probability of conflict averaged over all outputs of classifier instances.
max_prob_1: maximum probability of conflict in all outputs of classifier instances;
chance_of_conflict: ratio of the number of conflict predictions over the total number of predictions made.
Postprocessing¶
There are several command line scripts available for post-processing. In addition to quick plots to evaluate model output, they also produce files for use in bespoke plotting and analysis scripts.
The scripts are located under /copro/scripts/postprocessing
.
The here shown help print-outs can always be accessed with
python <SCRIPT_FILE_NAME> --help
plot_value_over_time.py¶
Usage: python plot_value_over_time.py [OPTIONS] INPUT_DIR OUTPUT_DIR
Quick and dirty function to plot the develoment of a column in the
outputted geoJSON-files over time. The script uses all geoJSON-files
located in input-dir and retrieves values from them. Possible to plot
obtain development for multiple polygons (indicated via their ID) or
entire study area. If the latter, then different statistics can be chosen
(mean, max, min, std).
Args:
input-dir (str): path to input directory with geoJSON-files located per projection year.
output-dir (str): path to directory where output will be stored.
Output:
a csv-file containing values per time step.
a png-file showing development over time.
Options:
-id, --polygon-id TEXT
-s, --statistics TEXT which statistical method to use (mean, max, min,
std). note: has only effect if with "-id all"!
-c, --column TEXT column name
-t, --title TEXT title for plot and file_object name
--verbose / --no-verbose verbose on/off
avg_over_time.py¶
Usage: python avg_over_time.py [OPTIONS] INPUT_DIR OUTPUT_DIR SELECTED_POLYGONS
Post-processing script to calculate average model output over a user-
specifeid period or all output geoJSON-files stored in input-dir.
Computed average values can be outputted as geoJSON-file or png-file or both.
Args:
input_dir: path to input directory with geoJSON-files located per projection year.
output_dir (str): path to directory where output will be stored.
selected_polygons (str): path to a shp-file with all polygons used in a CoPro run.
Output:
geoJSON-file with average column value per polygon (if geojson is set).
png-file with plot of average column value per polygon (if png is set)
Options:
-t0, --start-year INTEGER
-t1, --end-year INTEGER
-c, --column TEXT column name
--geojson / --no-geojson save output to geojson or not
--png / --no-png save output to png or not
--verbose / --no-verbose verbose on/off
plot_polygon_vals.py¶
Usage: python plot_polygon_vals.py [OPTIONS] FILE_OBJECT OUTPUT_DIR
Quick and dirty function to plot the column values of a geojson file with
minimum user input, and save plot. Mainly used for quick inspection of
model output in specific years.
Args:
file-object (str): path to geoJSON-file whose values are to be plotted.
output-dir (str): path to directory where plot will be saved.
Output:
a png-file of values per polygon.
Options:
-c, --column TEXT column name
-t, --title TEXT title for plot and file_object name
-v0, --minimum-value FLOAT
-v1, --maximum-value FLOAT
-cmap, --color-map TEXT
geojson2gif.py¶
Usage: python geojson2gif.py [OPTIONS] INPUT_DIR OUTPUT_DIR
Function to convert column values of all geoJSON-files in a directory into
one GIF-file. The function provides several options to modify the design
of the GIF-file. The GIF-file is based on png-files of column value per
geoJSON-file. It is possible to keep these png-file as simple plots of
values per time step.
Args:
input-dir (str): path to directory where geoJSON-files are stored.
output_dir (str): path to directory where GIF-file will be stored.
Output:
GIF-file with animated column values per input geoJSON-file.
Options:
-c, --column TEXT column name
-cmap, --color-map TEXT
-v0, --minimum-value FLOAT
-v1, --maximum-value FLOAT
--delete / --no-delete whether or not to delete png-files
API docs¶
This section contains the Documentation of the Application Programming Interface (API) of ‘copro’.
The model pipeline¶
Top-level function to create the X-array and Y-array. |
|
Top-level function to instantiate the scaler and model as specified in model configurations. |
|
Top-level function to run one of the four supported models. |
|
Top-level function to execute the projections. |
The various models¶
Main model workflow when all XY-data is used. |
|
Model workflow when each variable is left out from analysis once. |
|
Model workflow when the model is based on only one single variable. |
|
Model workflow when the relation between variables and conflict is based on randomness. |
|
Predictive model to use the already fitted classifier to make annual projections for the projection period. |
Note
The ‘leave_one_out’, ‘single_variables’, and ‘dubbelsteen’ models are only tested in beta-state. They will most likely be deprecated in near future.
Selecting polygons and conflicts¶
Main function performing the selection procedure. |
|
Filters conflict database according to certain conflict properties such as number of casualties, type of violence or country. |
|
Reducing the geo-dataframe to those entries falling into a specified time period. |
|
As the original conflict data has global extent, this function clips the database to those entries which have occured on a specified continent. |
|
This function allows for selecting only those conflicts and polygons falling in specified climate zones. |
Machine learning¶
Defines scaling method based on model configurations. |
|
Defines model based on model configurations. |
|
Splits and transforms the X-array (or sample data) and Y-array (or target data) in test-data and training-data. |
|
Fits classifier based on training-data and makes predictions. |
|
(Re)fits a classifier with all available data and pickles it. |
|
Loads the paths to all previously fitted classifiers to a list. |
Variable values¶
This function extracts a value from a netCDF-file (specified in the cfg-file) for each polygon specified in extent_gdf for a given year. |
|
This function extracts a value from a netCDF-file (specified in the cfg-file) for each polygon specified in extent_gdf for a given year. |
Warning
Reading files with a float timestamp will most likely be deprecated in near future.
XY-Data¶
Initiates an empty dictionary to contain the XY-data for each polygon, ie. |
|
Initiates an empty dictionary to contain the X-data for each polygon, ie. |
|
Fills the (XY-)dictionary with data for each variable and conflict for each polygon for each simulation year. |
|
Fills the X-dictionary with the data sample data besides any conflict-related data for each polygon and each year. |
|
Fills the X-dictionary with the conflict data for each polygon and each year. |
|
Separates the XY-array into array containing information about variable values (X-array or sample data) and conflict data (Y-array or target data). |
|
For each polygon, determines its neighboring polygons. |
|
Filters all polygons which are actually neighbors to given polygon. |
Work with conflict data¶
Creates a list for each timestep with boolean information whether a conflict took place in a polygon or not. |
|
Creates a list for each timestep with boolean information whether a conflict took place in a polygon at the previous timestep or not. |
|
Creates a list for each timestep with boolean information whether a conflict took place in a polygon or not. |
|
Determines whether in the neighbouring polygons of a polygon i_poly conflict took place. |
|
Extracts and returns a list with unique identifiers for each polygon used in the model. |
|
Extracts geometry information for each polygon from geodataframe and saves to list. |
|
Separates the unique identifier, geometry information, and data from the variable-containing X-array. |
|
Stacks together the arrays with unique identifier, geometry, test data, and predicted data into a dataframe. |
Model evaluation¶
Initiates the main model evaluatoin dictionary for a range of model metric scores. |
|
Appends the computed metric score per run to the main output dictionary. |
|
Initiates and empty main output dataframe. |
|
Appends output dataframe of each simulation to main output dataframe. |
|
Computes a range of model evaluation metrics and appends the resulting scores to a dictionary. |
|
Determines a range of model accuracy values for each polygon. |
|
Initiates empty lists for range of variables needed to plot ROC-curve per simulation. |
|
Saves data needed to plot mean ROC and standard deviation to csv-files. |
|
Computes the correlation matrix for a dataframe. |
|
Determines relative importance of each feature (i.e. |
|
Returns a dataframe with the mean permutation importance of the features used to train a RF tree model. |
Plotting¶
Creates a plotting instance of the boundaries of all selected polygons. |
|
Creates a plotting instance of the best casualties estimates of the selected conflicts. |
|
Plots the value distribution of a range of evaluation metrics based on all model simulations. |
|
Plots the correlation matrix of a dataframe. |
|
Plots the ROC-curve per model simulation to a pre-initiated matplotlib-instance. |
|
Plots the mean ROC-curve to a pre-initiated matplotlib-instance. |
Auxiliary functions¶
click.echos a header with main model information. |
|
Georeferences a pandas dataframe using longitude and latitude columns of that dataframe. |
|
click.echos the version numbers by the main python-packages used. |
|
Reads the model configuration file. |
|
This function parses the (various) cfg-files for projections. |
|
Determines the period for which projections need to be made. |
|
Creates the output folder at location specfied in cfg-file, and returns dictionary with config-objects and out-dir per run. |
|
If specfied in cfg-file, the PRIO/UCDP data is directly downloaded and used as model input. |
|
Initiates the model set-up. |
|
Creates an array with identical percentage of conflict points as input array. |
|
Retrieves unique ID and geometry information from geo-dataframe for a global look-up dataframe. |
|
Filters out only those polygons where conflict was actually observed in the test-sample. |
|
Saves an dictionary to csv-file. |
|
Saves an argument (either dictionary or dataframe) to npy-file. |
Authors¶
Jannis M. Hoch (Utrecht University)
Sophie de Bruin (Utrecht University, PBL)
Niko Wanders (Utrecht University)
Corresponding author: Jannis M. Hoch (j.m.hoch@uu.nl)