5. Training Data Collectors

The first step in solving mixed-integer optimization problems with the assistance of supervised machine learning methods is solving a large set of training instances and collecting the raw training data. In this section, we describe the various training data collectors included in MIPLearn. Additionally, the framework follows the convention of storing all training data in files with a specific data format (namely, HDF5). In this section, we briefly describe this format and the rationale for choosing it.

5.1. Overview

In MIPLearn, a collector is a class that solves or analyzes the problem and collects raw data which may be later useful for machine learning methods. Collectors, by convention, take as input: (i) a list of problem data filenames, in gzipped pickle format, ending with .pkl.gz; (ii) a function that builds the optimization model, such as build_tsp_model. After processing is done, collectors store the training data in a HDF5 file located alongside with the problem data. For example, if the problem data is stored in file problem.pkl.gz, then the collector writes to problem.h5. Collectors are, in general, very time consuming, as they may need to solve the problem to optimality, potentially multiple times.

5.2. HDF5 Format

MIPLearn stores all training data in HDF5 (Hierarchical Data Format, Version 5) files. The HDF format was originally developed by the National Center for Supercomputing Applications (NCSA) for storing and organizing large amounts of data, and supports a variety of data types, including integers, floating-point numbers, strings, and arrays. Compared to other formats, such as CSV, JSON or SQLite, the HDF5 format provides several advantages for MIPLearn, including:

  • Storage of multiple scalars, vectors and matrices in a single file — This allows MIPLearn to store all training data related to a given problem instance in a single file, which makes training data easier to store, organize and transfer.

  • High-performance partial I/O — Partial I/O allows MIPLearn to read a single element from the training data (e.g. value of the optimal solution) without loading the entire file to memory or reading it from beginning to end, which dramatically improves performance and reduces memory requirements. This is especially important when processing a large number of training data files.

  • On-the-fly compression — HDF5 files can be transparently compressed, using the gzip method, which reduces storage requirements and accelerates network transfers.

  • Stable, portable and well-supported data format — Training data files are typically expensive to generate. Having a stable and well supported data format ensures that these files remain usable in the future, potentially even by other non-Python MIP/ML frameworks.

MIPLearn currently uses HDF5 as simple key-value storage for numerical data; more advanced features of the format, such as metadata, are not currently used. Although files generated by MIPLearn can be read with any HDF5 library, such as h5py, some convenience functions are provided to make the access more simple and less error-prone. Specifically, the class H5File, which is built on top of h5py, provides the methods put_scalar, put_array, put_sparse, put_bytes to store, respectively, scalar values, dense multi-dimensional arrays, sparse multi-dimensional arrays and arbitrary binary data. The corresponding get methods are also provided. Compared to pure h5py methods, these methods automatically perform type-checking and gzip compression. The example below shows their usage.

Example

[3]:
import numpy as np
import scipy.sparse

from miplearn.h5 import H5File

# Set random seed to make example reproducible
np.random.seed(42)

# Create a new empty HDF5 file
with H5File("test.h5", "w") as h5:
    # Store a scalar
    h5.put_scalar("x1", 1)
    h5.put_scalar("x2", "hello world")

    # Store a dense array and a dense matrix
    h5.put_array("x3", np.array([1, 2, 3]))
    h5.put_array("x4", np.random.rand(3, 3))

    # Store a sparse matrix
    h5.put_sparse("x5", scipy.sparse.random(5, 5, 0.5))

# Re-open the file we just created and print
# previously-stored data
with H5File("test.h5", "r") as h5:
    print("x1 =", h5.get_scalar("x1"))
    print("x2 =", h5.get_scalar("x2"))
    print("x3 =", h5.get_array("x3"))
    print("x4 =", h5.get_array("x4"))
    print("x5 =", h5.get_sparse("x5"))
x1 = 1
x2 = hello world
x3 = [1 2 3]
x4 = [[0.37454012 0.9507143  0.7319939 ]
 [0.5986585  0.15601864 0.15599452]
 [0.05808361 0.8661761  0.601115  ]]
x5 =   (2, 3)   0.68030757
  (3, 2)        0.45049927
  (4, 0)        0.013264962
  (0, 2)        0.94220173
  (4, 2)        0.5632882
  (2, 1)        0.3854165
  (1, 1)        0.015966251
  (3, 0)        0.23089382
  (4, 4)        0.24102546
  (1, 3)        0.68326354
  (3, 1)        0.6099967
  (0, 3)        0.8331949

5.3. Basic collector

BasicCollector is the most fundamental collector, and performs the following steps:

  1. Extracts all model data, such as objective function and constraint right-hand sides into numpy arrays, which can later be easily and efficiently accessed without rebuilding the model or invoking the solver;

  2. Solves the linear relaxation of the problem and stores its optimal solution, basis status and sensitivity information, among other information;

  3. Solves the original mixed-integer optimization problem to optimality and stores its optimal solution, along with solve statistics, such as number of explored nodes and wallclock time.

Data extracted in Phases 1, 2 and 3 above are prefixed, respectively as static_, lp_ and mip_. The entire set of fields is shown in the table below.

Data fields

Field

Type

Description

static_constr_lhs

(nconstrs, nvars)

Constraint left-hand sides, in sparse matrix format

static_constr_names

(nconstrs,)

Constraint names

static_constr_rhs

(nconstrs,)

Constraint right-hand sides

static_constr_sense

(nconstrs,)

Constraint senses ("<", ">" or "=")

static_obj_offset

float

Constant value added to the objective function

static_sense

str

"min" if minimization problem or "max" otherwise

static_var_lower_bounds

(nvars,)

Variable lower bounds

static_var_names

(nvars,)

Variable names

static_var_obj_coeffs

(nvars,)

Objective coefficients

static_var_types

(nvars,)

Types of the decision variables ("C", "B" and "I" for continuous, binary and integer, respectively)

static_var_upper_bounds

(nvars,)

Variable upper bounds

lp_constr_basis_status

(nconstr,)

Constraint basis status (0 for basic, -1 for non-basic)

lp_constr_dual_values

(nconstr,)

Constraint dual value (or shadow price)

lp_constr_sa_rhs_{up,down}

(nconstr,)

Sensitivity information for the constraint RHS

lp_constr_slacks

(nconstr,)

Constraint slack in the solution to the LP relaxation

lp_obj_value

float

Optimal value of the LP relaxation

lp_var_basis_status

(nvars,)

Variable basis status (0, -1, -2 or -3 for basic, non-basic at lower bound, non-basic at upper bound, and superbasic, respectively)

lp_var_reduced_costs

(nvars,)

Variable reduced costs

lp_var_sa_{obj,ub,lb}_{up,down}

(nvars,)

Sensitivity information for the variable objective coefficient, lower and upper bound.

lp_var_values

(nvars,)

Optimal solution to the LP relaxation

lp_wallclock_time

float

Time taken to solve the LP relaxation (in seconds)

mip_constr_slacks

(nconstrs,)

Constraint slacks in the best MIP solution

mip_gap

float

Relative MIP optimality gap

mip_node_count

float

Number of explored branch-and-bound nodes

mip_obj_bound

float

Dual bound

mip_obj_value

float

Value of the best MIP solution

mip_var_values

(nvars,)

Best MIP solution

mip_wallclock_time

float

Time taken to solve the MIP (in seconds)

Example

The example below shows how to generate a few random instances of the traveling salesman problem, store its problem data, run the collector and print some of the training data to screen.

[4]:
import random
import numpy as np
from scipy.stats import uniform, randint
from glob import glob

from miplearn.problems.tsp import (
    TravelingSalesmanGenerator,
    build_tsp_model,
)
from miplearn.io import write_pkl_gz
from miplearn.h5 import H5File
from miplearn.collectors.basic import BasicCollector

# Set random seed to make example reproducible.
random.seed(42)
np.random.seed(42)

# Generate a few instances of the traveling salesman problem.
data = TravelingSalesmanGenerator(
    n=randint(low=10, high=11),
    x=uniform(loc=0.0, scale=1000.0),
    y=uniform(loc=0.0, scale=1000.0),
    gamma=uniform(loc=0.90, scale=0.20),
    fix_cities=True,
    round=True,
).generate(10)

# Save instance data to data/tsp/00000.pkl.gz, data/tsp/00001.pkl.gz, ...
write_pkl_gz(data, "data/tsp")

# Solve all instances and collect basic solution information.
# Process at most four instances in parallel.
bc = BasicCollector()
bc.collect(glob("data/tsp/*.pkl.gz"), build_tsp_model, n_jobs=4)

# Read and print some training data for the first instance.
with H5File("data/tsp/00000.h5", "r") as h5:
    print("lp_obj_value = ", h5.get_scalar("lp_obj_value"))
    print("mip_obj_value = ", h5.get_scalar("mip_obj_value"))
lp_obj_value =  2909.0
mip_obj_value =  2921.0
[ ]: