{ "cells": [ { "cell_type": "markdown", "id": "cdc6ebe9-d1d4-4de1-9b5a-4fc8ef57b11b", "metadata": {}, "source": [ "# Feature Extractors\n", "\n", "In the previous page, we introduced *training data collectors*, which solve the optimization problem and collect raw training data, such as the optimal solution. In this page, we introduce **feature extractors**, which take the raw training data, stored in HDF5 files, and extract relevant information in order to train a machine learning model." ] }, { "cell_type": "markdown", "id": "b4026de5", "metadata": {}, "source": [ "\n", "## Overview\n", "\n", "Feature extraction is an important step of the process of building a machine learning model because it helps to reduce the complexity of the data and convert it into a format that is more easily processed. Previous research has proposed converting absolute variable coefficients, for example, into relative values which are invariant to various transformations, such as problem scaling, making them more amenable to learning. Various other transformations have also been described.\n", "\n", "In the framework, we treat data collection and feature extraction as two separate steps to accelerate the model development cycle. Specifically, collectors are typically time-consuming, as they often need to solve the problem to optimality, and therefore focus on collecting and storing all data that may or may not be relevant, in its raw format. Feature extractors, on the other hand, focus entirely on filtering the data and improving its representation, and are therefore much faster to run. Experimenting with new data representations, therefore, can be done without resolving the instances.\n", "\n", "In MIPLearn, extractors implement the abstract class [FeatureExtractor][FeatureExtractor], which has methods that take as input an [H5File][H5File] and produce either: (i) instance features, which describe the entire instances; (ii) variable features, which describe a particular decision variables; or (iii) constraint features, which describe a particular constraint. The extractor is free to implement only a subset of these methods, if it is known that it will not be used with a machine learning component that requires the other types of features.\n", "\n", "[FeatureExtractor]: ../../api/collectors/#miplearn.features.fields.FeaturesExtractor\n", "[H5File]: ../../api/helpers/#miplearn.h5.H5File" ] }, { "cell_type": "markdown", "id": "b2d9736c", "metadata": {}, "source": [ "\n", "## H5FieldsExtractor\n", "\n", "[H5FieldsExtractor][H5FieldsExtractor], the most simple extractor in MIPLearn, simple extracts data that is already available in the HDF5 file, assembles it into a matrix and returns it as-is. The fields used to build instance, variable and constraint features are user-specified. The class also performs checks to ensure that the shapes of the returned matrices make sense." ] }, { "cell_type": "markdown", "id": "e8184dff", "metadata": {}, "source": [ "### Example\n", "\n", "The example below demonstrates the usage of H5FieldsExtractor in a randomly generated instance of the multi-dimensional knapsack problem." ] }, { "cell_type": "code", "execution_count": 5, "id": "ed9a18c8", "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "instance features (11,) \n", " [-1531.24308771 -350. -692. -454.\n", " -709. -605. -543. -321.\n", " -674. -571. -341. ]\n", "variable features (10, 4) \n", " [[-1.53124309e+03 -3.50000000e+02 0.00000000e+00 9.43468018e+01]\n", " [-1.53124309e+03 -6.92000000e+02 2.51703322e-01 0.00000000e+00]\n", " [-1.53124309e+03 -4.54000000e+02 0.00000000e+00 8.25504150e+01]\n", " [-1.53124309e+03 -7.09000000e+02 1.11373022e-01 0.00000000e+00]\n", " [-1.53124309e+03 -6.05000000e+02 1.00000000e+00 -1.26055283e+02]\n", " [-1.53124309e+03 -5.43000000e+02 0.00000000e+00 1.68693771e+02]\n", " [-1.53124309e+03 -3.21000000e+02 1.07488781e-01 0.00000000e+00]\n", " [-1.53124309e+03 -6.74000000e+02 8.82293701e-01 0.00000000e+00]\n", " [-1.53124309e+03 -5.71000000e+02 0.00000000e+00 1.41129074e+02]\n", " [-1.53124309e+03 -3.41000000e+02 1.28830120e-01 0.00000000e+00]]\n", "constraint features (5, 3) \n", " [[ 1.3100000e+03 -1.5978307e-01 0.0000000e+00]\n", " [ 9.8800000e+02 -3.2881632e-01 0.0000000e+00]\n", " [ 1.0040000e+03 -4.0601316e-01 0.0000000e+00]\n", " [ 1.2690000e+03 -1.3659772e-01 0.0000000e+00]\n", " [ 1.0070000e+03 -2.8800571e-01 0.0000000e+00]]\n" ] } ], "source": [ "from glob import glob\n", "from shutil import rmtree\n", "\n", "import numpy as np\n", "from scipy.stats import uniform, randint\n", "\n", "from miplearn.collectors.basic import BasicCollector\n", "from miplearn.extractors.fields import H5FieldsExtractor\n", "from miplearn.h5 import H5File\n", "from miplearn.io import write_pkl_gz\n", "from miplearn.problems.multiknapsack import (\n", " MultiKnapsackGenerator,\n", " build_multiknapsack_model,\n", ")\n", "\n", "# Set random seed to make example reproducible\n", "np.random.seed(42)\n", "\n", "# Generate some random multiknapsack instances\n", "rmtree(\"data/multiknapsack/\", ignore_errors=True)\n", "write_pkl_gz(\n", " MultiKnapsackGenerator(\n", " n=randint(low=10, high=11),\n", " m=randint(low=5, high=6),\n", " w=uniform(loc=0, scale=1000),\n", " K=uniform(loc=100, scale=0),\n", " u=uniform(loc=1, scale=0),\n", " alpha=uniform(loc=0.25, scale=0),\n", " w_jitter=uniform(loc=0.95, scale=0.1),\n", " p_jitter=uniform(loc=0.75, scale=0.5),\n", " fix_w=True,\n", " ).generate(10),\n", " \"data/multiknapsack\",\n", ")\n", "\n", "# Run the basic collector\n", "BasicCollector().collect(\n", " glob(\"data/multiknapsack/*\"),\n", " build_multiknapsack_model,\n", " n_jobs=4,\n", ")\n", "\n", "ext = H5FieldsExtractor(\n", " # Use as instance features the value of the LP relaxation and the\n", " # vector of objective coefficients.\n", " instance_fields=[\n", " \"lp_obj_value\",\n", " \"static_var_obj_coeffs\",\n", " ],\n", " # For each variable, use as features the optimal value of the LP\n", " # relaxation, the variable objective coefficient, the variable's\n", " # value its reduced cost.\n", " var_fields=[\n", " \"lp_obj_value\",\n", " \"static_var_obj_coeffs\",\n", " \"lp_var_values\",\n", " \"lp_var_reduced_costs\",\n", " ],\n", " # For each constraint, use as features the RHS, dual value and slack.\n", " constr_fields=[\n", " \"static_constr_rhs\",\n", " \"lp_constr_dual_values\",\n", " \"lp_constr_slacks\",\n", " ],\n", ")\n", "\n", "with H5File(\"data/multiknapsack/00000.h5\") as h5:\n", " # Extract and print instance features\n", " x1 = ext.get_instance_features(h5)\n", " print(\"instance features\", x1.shape, \"\\n\", x1)\n", "\n", " # Extract and print variable features\n", " x2 = ext.get_var_features(h5)\n", " print(\"variable features\", x2.shape, \"\\n\", x2)\n", "\n", " # Extract and print constraint features\n", " x3 = ext.get_constr_features(h5)\n", " print(\"constraint features\", x3.shape, \"\\n\", x3)\n" ] }, { "cell_type": "markdown", "id": "2da2e74e", "metadata": {}, "source": [ "\n", "[H5FieldsExtractor]: ../../api/collectors/#miplearn.features.fields.H5FieldsExtractor" ] }, { "cell_type": "markdown", "id": "d879c0d3", "metadata": {}, "source": [ "