6. Feature Extractors

In the previous page, we introduced training data collectors, which solve the optimization problem and collect raw training data, such as the optimal solution. In this page, we introduce feature extractors, which take the raw training data, stored in HDF5 files, and extract relevant information in order to train a machine learning model.

6.1. Overview

Feature extraction is an important step of the process of building a machine learning model because it helps to reduce the complexity of the data and convert it into a format that is more easily processed. Previous research has proposed converting absolute variable coefficients, for example, into relative values which are invariant to various transformations, such as problem scaling, making them more amenable to learning. Various other transformations have also been described.

In the framework, we treat data collection and feature extraction as two separate steps to accelerate the model development cycle. Specifically, collectors are typically time-consuming, as they often need to solve the problem to optimality, and therefore focus on collecting and storing all data that may or may not be relevant, in its raw format. Feature extractors, on the other hand, focus entirely on filtering the data and improving its representation, and are therefore much faster to run. Experimenting with new data representations, therefore, can be done without resolving the instances.

In MIPLearn, extractors implement the abstract class FeatureExtractor, which has methods that take as input an H5File and produce either: (i) instance features, which describe the entire instances; (ii) variable features, which describe a particular decision variables; or (iii) constraint features, which describe a particular constraint. The extractor is free to implement only a subset of these methods, if it is known that it will not be used with a machine learning component that requires the other types of features.

6.2. H5FieldsExtractor

H5FieldsExtractor, the most simple extractor in MIPLearn, simple extracts data that is already available in the HDF5 file, assembles it into a matrix and returns it as-is. The fields used to build instance, variable and constraint features are user-specified. The class also performs checks to ensure that the shapes of the returned matrices make sense.

Example

The example below demonstrates the usage of H5FieldsExtractor in a randomly generated instance of the multi-dimensional knapsack problem.

[5]:
from glob import glob
from shutil import rmtree

import numpy as np
from scipy.stats import uniform, randint

from miplearn.collectors.basic import BasicCollector
from miplearn.extractors.fields import H5FieldsExtractor
from miplearn.h5 import H5File
from miplearn.io import write_pkl_gz
from miplearn.problems.multiknapsack import (
    MultiKnapsackGenerator,
    build_multiknapsack_model,
)

# Set random seed to make example reproducible
np.random.seed(42)

# Generate some random multiknapsack instances
rmtree("data/multiknapsack/", ignore_errors=True)
write_pkl_gz(
    MultiKnapsackGenerator(
        n=randint(low=10, high=11),
        m=randint(low=5, high=6),
        w=uniform(loc=0, scale=1000),
        K=uniform(loc=100, scale=0),
        u=uniform(loc=1, scale=0),
        alpha=uniform(loc=0.25, scale=0),
        w_jitter=uniform(loc=0.95, scale=0.1),
        p_jitter=uniform(loc=0.75, scale=0.5),
        fix_w=True,
    ).generate(10),
    "data/multiknapsack",
)

# Run the basic collector
BasicCollector().collect(
    glob("data/multiknapsack/*"),
    build_multiknapsack_model,
    n_jobs=4,
)

ext = H5FieldsExtractor(
    # Use as instance features the value of the LP relaxation and the
    # vector of objective coefficients.
    instance_fields=[
        "lp_obj_value",
        "static_var_obj_coeffs",
    ],
    # For each variable, use as features the optimal value of the LP
    # relaxation, the variable objective coefficient, the variable's
    # value its reduced cost.
    var_fields=[
        "lp_obj_value",
        "static_var_obj_coeffs",
        "lp_var_values",
        "lp_var_reduced_costs",
    ],
    # For each constraint, use as features the RHS, dual value and slack.
    constr_fields=[
        "static_constr_rhs",
        "lp_constr_dual_values",
        "lp_constr_slacks",
    ],
)

with H5File("data/multiknapsack/00000.h5") as h5:
    # Extract and print instance features
    x1 = ext.get_instance_features(h5)
    print("instance features", x1.shape, "\n", x1)

    # Extract and print variable features
    x2 = ext.get_var_features(h5)
    print("variable features", x2.shape, "\n", x2)

    # Extract and print constraint features
    x3 = ext.get_constr_features(h5)
    print("constraint features", x3.shape, "\n", x3)

instance features (11,)
 [-1531.24308771  -350.          -692.          -454.
  -709.          -605.          -543.          -321.
  -674.          -571.          -341.        ]
variable features (10, 4)
 [[-1.53124309e+03 -3.50000000e+02  0.00000000e+00  9.43468018e+01]
 [-1.53124309e+03 -6.92000000e+02  2.51703322e-01  0.00000000e+00]
 [-1.53124309e+03 -4.54000000e+02  0.00000000e+00  8.25504150e+01]
 [-1.53124309e+03 -7.09000000e+02  1.11373022e-01  0.00000000e+00]
 [-1.53124309e+03 -6.05000000e+02  1.00000000e+00 -1.26055283e+02]
 [-1.53124309e+03 -5.43000000e+02  0.00000000e+00  1.68693771e+02]
 [-1.53124309e+03 -3.21000000e+02  1.07488781e-01  0.00000000e+00]
 [-1.53124309e+03 -6.74000000e+02  8.82293701e-01  0.00000000e+00]
 [-1.53124309e+03 -5.71000000e+02  0.00000000e+00  1.41129074e+02]
 [-1.53124309e+03 -3.41000000e+02  1.28830120e-01  0.00000000e+00]]
constraint features (5, 3)
 [[ 1.3100000e+03 -1.5978307e-01  0.0000000e+00]
 [ 9.8800000e+02 -3.2881632e-01  0.0000000e+00]
 [ 1.0040000e+03 -4.0601316e-01  0.0000000e+00]
 [ 1.2690000e+03 -1.3659772e-01  0.0000000e+00]
 [ 1.0070000e+03 -2.8800571e-01  0.0000000e+00]]

Warning

You should ensure that the number of features remains the same for all relevant HDF5 files. In the previous example, to illustrate this issue, we used variable objective coefficients as instance features. While this is allowed, note that this requires all problem instances to have the same number of variables; otherwise the number of features would vary from instance to instance and MIPLearn would be unable to concatenate the matrices.

6.3. AlvLouWeh2017Extractor

Alvarez, Louveaux and Wehenkel (2017) proposed a set features to describe a particular decision variable in a given node of the branch-and-bound tree, and applied it to the problem of mimicking strong branching decisions. The class AlvLouWeh2017Extractor implements a subset of these features (40 out of 64), which are available outside of the branch-and-bound tree. Some features are derived from the static defintion of the problem (i.e. from objective function and constraint data), while some features are derived from the solution to the LP relaxation. The features have been designed to be: (i) independent of the size of the problem; (ii) invariant with respect to irrelevant problem transformations, such as row and column permutation; and (iii) independent of the scale of the problem. We refer to the paper for a more complete description.

Example

[6]:
from miplearn.extractors.AlvLouWeh2017 import AlvLouWeh2017Extractor
from miplearn.h5 import H5File

# Build the extractor
ext = AlvLouWeh2017Extractor()

# Open previously-created multiknapsack training data
with H5File("data/multiknapsack/00000.h5") as h5:
    # Extract and print variable features
    x1 = ext.get_var_features(h5)
    print("x1", x1.shape, "\n", x1.round(1))
x1 (10, 40)
 [[-1.00e+00  1.00e+20  1.00e-01  1.00e+00  0.00e+00  1.00e+00  6.00e-01
   0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  1.00e+00  6.00e-01  1.00e+00  1.75e+01  1.00e+00  2.00e-01
   1.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  1.00e+00 -1.00e+00  0.00e+00  1.00e+20]
 [-1.00e+00  1.00e+20  1.00e-01  1.00e+00  1.00e-01  1.00e+00  1.00e+00
   0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  1.00e+00  7.00e-01  1.00e+00  5.10e+00  1.00e+00  2.00e-01
   1.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   3.00e-01 -1.00e+00 -1.00e+00  0.00e+00  0.00e+00]
 [-1.00e+00  1.00e+20  1.00e-01  1.00e+00  0.00e+00  1.00e+00  9.00e-01
   0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  1.00e+00  5.00e-01  1.00e+00  1.30e+01  1.00e+00  2.00e-01
   1.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  1.00e+00 -1.00e+00  0.00e+00  1.00e+20]
 [-1.00e+00  1.00e+20  1.00e-01  1.00e+00  2.00e-01  1.00e+00  9.00e-01
   0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  1.00e+00  8.00e-01  1.00e+00  3.40e+00  1.00e+00  2.00e-01
   1.00e+00  1.00e-01  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   1.00e-01 -1.00e+00 -1.00e+00  0.00e+00  0.00e+00]
 [-1.00e+00  1.00e+20  1.00e-01  1.00e+00  1.00e-01  1.00e+00  7.00e-01
   0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  1.00e+00  6.00e-01  1.00e+00  3.80e+00  1.00e+00  2.00e-01
   1.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00 -1.00e+00 -1.00e+00  0.00e+00  0.00e+00]
 [-1.00e+00  1.00e+20  1.00e-01  1.00e+00  1.00e-01  1.00e+00  8.00e-01
   0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  1.00e+00  7.00e-01  1.00e+00  3.30e+00  1.00e+00  2.00e-01
   1.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  1.00e+00 -1.00e+00  0.00e+00  1.00e+20]
 [-1.00e+00  1.00e+20  1.00e-01  1.00e+00  0.00e+00  1.00e+00  3.00e-01
   0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  1.00e+00  1.00e+00  1.00e+00  5.70e+00  1.00e+00  1.00e-01
   1.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   1.00e-01 -1.00e+00 -1.00e+00  0.00e+00  0.00e+00]
 [-1.00e+00  1.00e+20  1.00e-01  1.00e+00  1.00e-01  1.00e+00  6.00e-01
   0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  1.00e+00  8.00e-01  1.00e+00  6.80e+00  1.00e+00  2.00e-01
   1.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   1.00e-01 -1.00e+00 -1.00e+00  0.00e+00  0.00e+00]
 [-1.00e+00  1.00e+20  1.00e-01  1.00e+00  4.00e-01  1.00e+00  6.00e-01
   0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  1.00e+00  8.00e-01  1.00e+00  1.40e+00  1.00e+00  1.00e-01
   1.00e+00  1.00e-01  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  1.00e+00 -1.00e+00  0.00e+00  1.00e+20]
 [-1.00e+00  1.00e+20  1.00e-01  1.00e+00  0.00e+00  1.00e+00  5.00e-01
   0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  1.00e+00  5.00e-01  1.00e+00  7.60e+00  1.00e+00  1.00e-01
   1.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
   1.00e-01 -1.00e+00 -1.00e+00  0.00e+00  0.00e+00]]

References

  • Alvarez, Alejandro Marcos. Computational and theoretical synergies between linear optimization and supervised machine learning. (2016). University of Liège.

  • Alvarez, Alejandro Marcos, Quentin Louveaux, and Louis Wehenkel. A machine learning-based approximation of strong branching. INFORMS Journal on Computing 29.1 (2017): 185-195.