{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "505cea0b-5f5d-478a-9107-42bb5515937d",
   "metadata": {},
   "source": [
    "# Training Data Collectors\n",
    "The first step in solving mixed-integer optimization problems with the assistance of supervised machine learning methods is solving a large set of training instances and collecting the raw training data. In this section, we describe the various training data collectors included in MIPLearn. Additionally, the framework follows the convention of storing all training data in files with a specific data format (namely, HDF5). In this section, we briefly describe this format and the rationale for choosing it.\n",
    "\n",
    "## Overview\n",
    "\n",
    "In MIPLearn, a **collector** is a class that solves or analyzes the problem and collects raw data which may be later useful for machine learning methods. Collectors, by convention, take as input: (i) a list of problem data filenames, in gzipped pickle format, ending with `.pkl.gz`; (ii) a function that builds the optimization model, such as `build_tsp_model`. After processing is done, collectors store the training data in a HDF5 file located alongside with the problem data. For example, if the problem data is stored in file `problem.pkl.gz`, then the collector writes to `problem.h5`. Collectors are, in general, very time consuming, as they may need to solve the problem to optimality, potentially multiple times.\n",
    "\n",
    "## HDF5 Format\n",
    "\n",
    "MIPLearn stores all training data in [HDF5](HDF5) (Hierarchical Data Format, Version 5) files. The HDF format was originally developed by the [National Center for Supercomputing Applications][NCSA] (NCSA) for storing and organizing large amounts of data, and supports a variety of data types, including integers, floating-point numbers, strings, and arrays. Compared to other formats, such as CSV, JSON or SQLite, the HDF5 format provides several advantages for MIPLearn, including:\n",
    "\n",
    "- *Storage of multiple scalars, vectors and matrices in a single file* --- This allows MIPLearn to store all training data related to a given problem instance in a single file, which makes training data easier to store, organize and transfer.\n",
    "- *High-performance partial I/O* --- Partial I/O allows MIPLearn to read a single element from the training data (e.g. value of the optimal solution) without loading the entire file to memory or reading it from beginning to end, which dramatically improves performance and reduces memory requirements. This is especially important when processing a large number of training data files.\n",
    "- *On-the-fly compression* --- HDF5 files can be transparently compressed, using the gzip method, which reduces storage requirements and accelerates network transfers.\n",
    "- *Stable, portable and well-supported data format* --- Training data files are typically expensive to generate. Having a stable and well supported data format ensures that these files remain usable in the future, potentially even by other non-Python MIP/ML frameworks.\n",
    "\n",
    "MIPLearn currently uses HDF5 as simple key-value storage for numerical data; more advanced features of the format, such as metadata, are not currently used. Although files generated by MIPLearn can be read with any HDF5 library, such as [h5py][h5py], some convenience functions are provided to make the access more simple and less error-prone. Specifically, the class [H5File][H5File], which is built on top of h5py, provides the methods [put_scalar][put_scalar], [put_array][put_array], [put_sparse][put_sparse], [put_bytes][put_bytes] to store, respectively, scalar values, dense multi-dimensional arrays, sparse multi-dimensional arrays and arbitrary binary data. The corresponding *get* methods are also provided. Compared to pure h5py methods, these methods automatically perform type-checking and gzip compression. The example below shows their usage.\n",
    "\n",
    "[HDF5]: https://en.wikipedia.org/wiki/Hierarchical_Data_Format\n",
    "[NCSA]: https://en.wikipedia.org/wiki/National_Center_for_Supercomputing_Applications\n",
    "[h5py]: https://www.h5py.org/\n",
    "[H5File]: ../../api/helpers/#miplearn.h5.H5File\n",
    "[put_scalar]: ../../api/helpers/#miplearn.h5.H5File.put_scalar\n",
    "[put_array]: ../../api/helpers/#miplearn.h5.H5File.put_scalar\n",
    "[put_sparse]: ../../api/helpers/#miplearn.h5.H5File.put_scalar\n",
    "[put_bytes]: ../../api/helpers/#miplearn.h5.H5File.put_scalar\n",
    "\n",
    "\n",
    "### Example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "f906fe9c",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "x1 = 1\n",
      "x2 = hello world\n",
      "x3 = [1 2 3]\n",
      "x4 = [[0.37454012 0.9507143  0.7319939 ]\n",
      " [0.5986585  0.15601864 0.15599452]\n",
      " [0.05808361 0.8661761  0.601115  ]]\n",
      "x5 =   (2, 3)\t0.68030757\n",
      "  (3, 2)\t0.45049927\n",
      "  (4, 0)\t0.013264962\n",
      "  (0, 2)\t0.94220173\n",
      "  (4, 2)\t0.5632882\n",
      "  (2, 1)\t0.3854165\n",
      "  (1, 1)\t0.015966251\n",
      "  (3, 0)\t0.23089382\n",
      "  (4, 4)\t0.24102546\n",
      "  (1, 3)\t0.68326354\n",
      "  (3, 1)\t0.6099967\n",
      "  (0, 3)\t0.8331949\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "import scipy.sparse\n",
    "\n",
    "from miplearn.h5 import H5File\n",
    "\n",
    "# Set random seed to make example reproducible\n",
    "np.random.seed(42)\n",
    "\n",
    "# Create a new empty HDF5 file\n",
    "with H5File(\"test.h5\", \"w\") as h5:\n",
    "    # Store a scalar\n",
    "    h5.put_scalar(\"x1\", 1)\n",
    "    h5.put_scalar(\"x2\", \"hello world\")\n",
    "\n",
    "    # Store a dense array and a dense matrix\n",
    "    h5.put_array(\"x3\", np.array([1, 2, 3]))\n",
    "    h5.put_array(\"x4\", np.random.rand(3, 3))\n",
    "\n",
    "    # Store a sparse matrix\n",
    "    h5.put_sparse(\"x5\", scipy.sparse.random(5, 5, 0.5))\n",
    "\n",
    "# Re-open the file we just created and print\n",
    "# previously-stored data\n",
    "with H5File(\"test.h5\", \"r\") as h5:\n",
    "    print(\"x1 =\", h5.get_scalar(\"x1\"))\n",
    "    print(\"x2 =\", h5.get_scalar(\"x2\"))\n",
    "    print(\"x3 =\", h5.get_array(\"x3\"))\n",
    "    print(\"x4 =\", h5.get_array(\"x4\"))\n",
    "    print(\"x5 =\", h5.get_sparse(\"x5\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "50441907",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "d0000c8d",
   "metadata": {},
   "source": [
    "## Basic collector\n",
    "\n",
    "[BasicCollector][BasicCollector] is the most fundamental collector, and performs the following steps:\n",
    "\n",
    "1. Extracts all model data, such as objective function and constraint right-hand sides into numpy arrays, which can later be easily and efficiently accessed without rebuilding the model or invoking the solver;\n",
    "2. Solves the linear relaxation of the problem and stores its optimal solution, basis status and sensitivity information, among other information;\n",
    "3. Solves the original mixed-integer optimization problem to optimality and stores its optimal solution, along with solve statistics, such as number of explored nodes and wallclock time.\n",
    "\n",
    "Data extracted in Phases 1, 2 and 3 above are prefixed, respectively as `static_`, `lp_` and `mip_`. The entire set of fields is shown in the table below.\n",
    "\n",
    "[BasicCollector]: ../../api/collectors/#miplearn.collectors.basic.BasicCollector\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6529f667",
   "metadata": {},
   "source": [
    "### Data fields\n",
    "\n",
    "| Field                             | Type                | Description                                                                                                                                 |\n",
    "|-----------------------------------|---------------------|---------------------------------------------------------------------------------------------------------------------------------------------|\n",
    "| `static_constr_lhs`               | `(nconstrs, nvars)` | Constraint left-hand sides, in sparse matrix format                                                                                         |\n",
    "| `static_constr_names`             | `(nconstrs,)`       | Constraint names                                                                                                                            |\n",
    "| `static_constr_rhs`               | `(nconstrs,)`       | Constraint right-hand sides                                                                                                                 |\n",
    "| `static_constr_sense`             | `(nconstrs,)`       | Constraint senses (`\"<\"`, `\">\"` or `\"=\"`)                                                                                                   |\n",
    "| `static_obj_offset`               | `float`             | Constant value added to the objective function                                                                                              |\n",
    "| `static_sense`                    | `str`               | `\"min\"` if minimization problem or `\"max\"` otherwise                                                                                        |\n",
    "| `static_var_lower_bounds`         | `(nvars,)`          | Variable lower bounds                                                                                                                       |\n",
    "| `static_var_names`                | `(nvars,)`          | Variable names                                                                                                                              |\n",
    "| `static_var_obj_coeffs`           | `(nvars,)`          | Objective coefficients                                                                                                                      |\n",
    "| `static_var_types`                | `(nvars,)`          | Types of the decision variables (`\"C\"`, `\"B\"` and `\"I\"` for continuous, binary and integer, respectively)                                   |\n",
    "| `static_var_upper_bounds`         | `(nvars,)`          | Variable upper bounds                                                                                                                       |\n",
    "| `lp_constr_basis_status`          | `(nconstr,)`        | Constraint basis status (`0` for basic, `-1` for non-basic)                                                                                 |\n",
    "| `lp_constr_dual_values`           | `(nconstr,)`        | Constraint dual value (or shadow price)                                                                                                     |\n",
    "| `lp_constr_sa_rhs_{up,down}`      | `(nconstr,)`        | Sensitivity information for the constraint RHS                                                                                              |\n",
    "| `lp_constr_slacks`                | `(nconstr,)`        | Constraint slack in the solution to the LP relaxation                                                                                       |\n",
    "| `lp_obj_value`                    | `float`             | Optimal value of the LP relaxation                                                                                                          |\n",
    "| `lp_var_basis_status`             | `(nvars,)`          | Variable basis status (`0`, `-1`, `-2` or `-3` for basic, non-basic at lower bound, non-basic at upper bound, and superbasic, respectively) |\n",
    "| `lp_var_reduced_costs`            | `(nvars,)`          | Variable reduced costs                                                                                                                      |\n",
    "| `lp_var_sa_{obj,ub,lb}_{up,down}` | `(nvars,)`          | Sensitivity information for the variable objective coefficient, lower and upper bound.                                                      |\n",
    "| `lp_var_values`                   | `(nvars,)`          | Optimal solution to the LP relaxation                                                                                                       |\n",
    "| `lp_wallclock_time`               | `float`             | Time taken to solve the LP relaxation (in seconds)                                                                                          |\n",
    "| `mip_constr_slacks`               | `(nconstrs,)`       | Constraint slacks in the best MIP solution                                                                                                  |\n",
    "| `mip_gap`                         | `float`             | Relative MIP optimality gap                                                                                                                 |\n",
    "| `mip_node_count`                  | `float`             | Number of explored branch-and-bound nodes                                                                                                   |\n",
    "| `mip_obj_bound`                   | `float`             | Dual bound                                                                                                                                  |\n",
    "| `mip_obj_value`                   | `float`             | Value of the best MIP solution                                                                                                              |\n",
    "| `mip_var_values`                  | `(nvars,)`          | Best MIP solution                                                                                                                           |\n",
    "| `mip_wallclock_time`              | `float`             | Time taken to solve the MIP (in seconds)                                                                                                    |"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2894594",
   "metadata": {},
   "source": [
    "### Example\n",
    "\n",
    "The example below shows how to generate a few random instances of the traveling salesman problem, store its problem data, run the collector and print some of the training data to screen."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "ac6f8c6f",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "lp_obj_value =  2909.0\n",
      "mip_obj_value =  2921.0\n"
     ]
    }
   ],
   "source": [
    "import random\n",
    "import numpy as np\n",
    "from scipy.stats import uniform, randint\n",
    "from glob import glob\n",
    "\n",
    "from miplearn.problems.tsp import (\n",
    "    TravelingSalesmanGenerator,\n",
    "    build_tsp_model,\n",
    ")\n",
    "from miplearn.io import write_pkl_gz\n",
    "from miplearn.h5 import H5File\n",
    "from miplearn.collectors.basic import BasicCollector\n",
    "\n",
    "# Set random seed to make example reproducible.\n",
    "random.seed(42)\n",
    "np.random.seed(42)\n",
    "\n",
    "# Generate a few instances of the traveling salesman problem.\n",
    "data = TravelingSalesmanGenerator(\n",
    "    n=randint(low=10, high=11),\n",
    "    x=uniform(loc=0.0, scale=1000.0),\n",
    "    y=uniform(loc=0.0, scale=1000.0),\n",
    "    gamma=uniform(loc=0.90, scale=0.20),\n",
    "    fix_cities=True,\n",
    "    round=True,\n",
    ").generate(10)\n",
    "\n",
    "# Save instance data to data/tsp/00000.pkl.gz, data/tsp/00001.pkl.gz, ...\n",
    "write_pkl_gz(data, \"data/tsp\")\n",
    "\n",
    "# Solve all instances and collect basic solution information.\n",
    "# Process at most four instances in parallel.\n",
    "bc = BasicCollector()\n",
    "bc.collect(glob(\"data/tsp/*.pkl.gz\"), build_tsp_model, n_jobs=4)\n",
    "\n",
    "# Read and print some training data for the first instance.\n",
    "with H5File(\"data/tsp/00000.h5\", \"r\") as h5:\n",
    "    print(\"lp_obj_value = \", h5.get_scalar(\"lp_obj_value\"))\n",
    "    print(\"mip_obj_value = \", h5.get_scalar(\"mip_obj_value\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "78f0b07a",
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    }
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}