Table of Contents

Background

Resources

Analytics is used for:

  • Descriptive - what happened?
  • Predictive - what will happen?
  • Prescriptive - which action would be best?
  • Other general questions

“Modeling”:

  • Describe real life with mathematics (abstract)
  • analyze the math
  • convert it back into an answer in real life (integrate up)

“Model” can refer to:

  • regression model
  • regression model with x y z parameters
  • regression model with x y z parameters predicting abc with equation d + e = f
  • turtles all the way down

Module 2: Classification

putting things in categories

  • When making a classifier/regression line (dot plot), you want to minimize both mistakes but also near-mistakes (since a small measurement error would turn it into a mistake)
  • it depends on how risk-tolerant you are in a certain direction (eating poisonous plant > not eating good plant)

Types of Data

  • Structured:
    • Quantitative (numbers)
    • Qualitative (categorical)
    • Unrelated (no relationship between data points) vs related
      • Time Series - same data recorded over time (could be equal or unequal intervals)
  • Unstructured
    • handwritten text

Classification Approaches (both are ML algorithms):

Support Vector Machine Model

More Info

  • Support vector is a line that has all the datapoints on one side
  • the “machine” automatically generates such lines
  • ironically the actual classification line shouldn’t touch any data points but should be between two SVs
  • You might need to scale data before running SVM so that it doesn’t get skewed
  • near-zero coefficients probably not relevant for classification
  • Works the same in all dimensions
  • Doesn’t have to be a straight line (kernel methods)
  • We might need to step back to decide should this be a classification problem or a probability problem

Formula Given data \({x_{ij}}\) the classifier is the line at \(0 = a + \sum_{j}a_{j}x_{j}\) (“The sum over all factors J of aj times xj”)
Maximize gap by minimizing \(\sum_{j} a_{j}^{2}\)

Example: \(0 = 1,000,000 + 5\cdot Income + 700 \cdot CreditScore\) Sum of Squared Coefficients: \(\sum_{j} a_{j}^{2} = 5^{2} + 700^{2} = 490,025\) A small change in the 700 coefficient results in a large change

Adjusting Data - Scaling

Between 0-1: \(x_{ij}^{scaled} = \frac{x_{ij} - x_{minj}}{x_{maxj} - x_{minj}}\) Between A-B: \(x_{ij}^{scaled[b,a]} = x_{ij}^{scaled[0,1]}(a-b)+b\) Normal Distribution:

  • Mean = 0, standard deviation = 1

K Nearest Neighbor Algorithm

More Info

  • Picks the classification based on the k nearest neighbor points, instead of using a classification line
  • Complexities:
    • different distance formulas?
    • attributes can be weighted by importance
    • unimportant attributes can be removed
    • choosing a good value of k

R

  • Beginner Tutorial
  • vec1 <- c(1,2,3,4,5) (c stands for concatenate) (it standardizes the types (all strings, numbers, bools etc))
  • list() (like a js object, can be nested, contain anything, but it doesn’t have keys:values, just has things nested)
  • data.frame() (has to be a regular/rectangular dimensional matrix where each sub vect has the same length, basically just a table). you can name columns like data.frame(a = vec1, b = vec2)
  • matrix(vec1, 2, 3) - vector, and then dimensions (2 rows, 3 columns)
  • vec1[2] - indexing to call a value from vec1, does not start with 0
  • names(list) - prints out the names of the list
  • list$name like getting the key of the object, returns values stored under that key, also works for data frames
  • ** myfunc <- function(arg1, arg2) { results <- arg1 + arg2 return(results) }**
  • install.packages(“ggplot2”) then, whenever you use it in a script, call ** library(ggplot2)**
  • ggplot::(\tab)geo… to search package for specific functions
  • mydata <- read.table(“c:/mydata.csv”, header=TRUE, sep=”,”, row.names=”id”)

Classifiers

Office Hours

  • August 25th isye6501_office_hour_fa22_week00_thu.pdf Tips_to_be_successful_in_this_class.pdf r_markdown_primer.pdf - note: this wasn’t covered in OO on Thursday. We’ll cover Monday if time permits, but making it available as early as possible week00t_qa_oms.csv

Pseudocode for Q2.2.3 (KKNN)

  • (they used Leave One Out Cross-Validation (LOOCV))
  • store an empty array of 0s called “predicted”
  • for each data point (row), create a model from the rest of the data excluding itself
  • then test that model to see how well it performed,
  • turn this into 1/0 for good/bad and pop it into the “predicted” array

Week2

Validation

  • training set - building the model
  • validation set - picking a model (if you make multiple models)
  • test set - testing the model to see it’s actual effectiveness (accounting for randomness)
  • Rule of thumb for dividing data:
    • 70-90% Training 10-30% Test (for 2 sets)
    • 50-70% Training, split the rest between validation & testing (for 3 sets)
  • Methods for sampling:
    • random - less bias but could be unevenly distributed
    • sound-off rotation - evenly distributed across the set but could have number pattern bias
  • Cross-validation
    • K-fold - subdivide the data into multiple groups (usually 10), then permute through them using one as validation and rest as training, then average the results
      • do you set aside a test set during k-fold???
    • train.kknn(), specify maxk
    • cv.kknn() specify single k
    • sample()
    • can use “caret” for splitting?? R library that can perform various tasks like splitting data for cross validation
    • fitted(model) - shows the values @ the different Ks
    • predict ??

Clustering

  • Distance norms - different formulas. For multiple dimensions, p-norm/minkowski distance, ∞-norm distance (largest abs value of a set of numbers)
  • beneficial to standardize data to normal distribution
  • k-means algorithm
    1. pick k cluster centers in data range
    2. assign each point to the cluster center it’s closest to
    3. recalculate cluster center to be better fitted (centroid)
    4. re-assign each point to a closer cluster center
    5. keep repeating until no data changes
      • this is a form of expectation-maximization (EM) algorithm (alternating those two steps)
      • it’s a heuristic algorithm, but because it runs fast, you can run it multiple times with different initial cluster centers, and choose the best solution of these. You can also try running it with different numbers of clusters
  • You should investigate outliers to understand what they mean and how they impact the clustering
  • to figure out # of clusters: use elbow diagram to plot the absolute distance for each point vs number of clusters, to see where the marginal benefit of adding more clusters diminishes (kink in the curve)
  • On the y axis you’ll use something like within sum of squares or between sum of squares, vs k values on the x axis.
  • Classification vs Clustering - Supervised vs Unsupervised: Classification is used when you know the correct classification/categorization of the data points (supervised learning). Clustering is when you’re trying to find ways to group the data points, when you don’t apriori know how to categorize them (unsupervised).

Data Prep & Outliers

Change Detection

  • CUSUM (Cumulative Sum) Approach: “Has the mean of the observed distribution gone beyond a critical level?”
  • \(x_{t}\) = observed value at time T \(\mu =\) mean of x, if no change, T = Threshold
  • Formula: \(S_{t} = max{0, S_{t-1} +(x_{t} - \mu - C ) }\)
  • is \(S_{t} \geq T\)
  • Bigger C, the less sensitive, smaller C is more sensitive
  • At each time period, we observe how far above the expected, then add to previous metric’s T to get a running total. if the total is >0 then we keep it, if it is <0 then we remove it.
  • can plot a threshold line

Exponential Smoothing

\(S_{t} = \alpha x_{t} + (1 - \alpha ) S_{t-1}\)

  • where \(S_{t}\) is the expected measurement and \(x_{t}\) is the actual measurement
  • \(\alpha\) is a constant modulating fluctuation sensitivity. \(0 < \alpha < 1\)
  • a -> 0 if there’s a lot of randomness/fluctuations,
  • a -> 1 if there’s not a lot of fluctuations
  • how to start: initial condition \(S_{1} = x_{1}\)
  • \(T_{t}\) - the trend at time t
  • Observation accounting for trends: \(S_{t} = \alpha x_{t} + (1 - \alpha ) ( S_{t-1} + T_{t-1})\)
  • Trend at time t: \(T_{t} = \beta (S_{t} - S_{t-1}) + (1 - \beta ) T_{t-1}\)
  • Initial condition: \(T_{1} = 0\)

Cyclic Patterns

  • Seasonalities, multiplicative way:
  • \(L\) - length of the cycle
  • \(C_{t}\) - the multiplicative seasonality factor for time t
    • this inflates or deflates the observation
  • \[S_{t} = \alpha x_{t}/C_{t-L} + (1 - \alpha)(S_{t-1}+T_{t-1})\]
  • \[C_{t} = \gamma (x_{t}/S_{t}) +(1 - \gamma)C_{t-L}\]
  • initial condition: set the first L values of C to be 1 (first cycle)
  • depending on how many factors you include.
  • triple exponential smoothing is called Winter’s Method, or Holt-Winters
  • additive vs multiplicative seasonality:
    • additive means that the trend doesn’t affect the magnitude of the cycles
    • multiplicative means that the magnitude of the wave changes as the trend changes

Forecasting

  • \(F_{t+1} = \alpha S_{t} + (1 - \alpha) S_{t}\), so, \(F_{t+1} = S_{t}\)
  • Next trend value: \(F_{t+1} = S_{t} + T_{t}\)
  • Trend formula: \(F_{t+k} = S_{t} + kT_{t} , k = 1, 2, ...\)
  • Next seasonality factor: \(C_{t+1}=C_{(t+1)-L}\)
  • forecast for time period \(t+1\): \(F_{t+k} = S_{t}, k = 1,2, ...\)
  • including trend, forecast for next time period \(t+1\): \(F_{t+1} = S_{t}+T_{t}\)
  • Formula: \(F_{t+k} = S_{t} + kT_{t}, k = 1,2, ...\)
  • Forecast for time period t+1: \(F_{t+1} = (S_{t}+T_{t})C_{(t+1)-L}\)
  • Formula: \(F_{t+k} = (S_{t} + kT_{t})C_{(t+1)-L+(k-1)}, k = 1,2, ...\)

  • How to find good values for \(\alpha , \beta, \gamma\)?
    • optimization. Minimize: \((F_{t}-x_{t})^{2}\)

ARIMA Model ( auto regressive integrated moving average)

  • 3 key parts:
  • d-th order Differences
  • p-th order auto regression
  • q-th moving average.
  • ARIMA(0,0,0) = white noise, no patterns
  • ARIMA(0,1,0) = random walk
  • ARIMA(p,0,0) = autoregressive model, the other components inactive
  • ARIMA(0,0,q) = moving average model, other components inactive
  • ARIMA(0,1,1) = basic exponential smoothing model
  • ARIMA model is good for more stable data, exponential smoothing is better for rough data
  • need at least 40 data points for ARIMA to work well

GARCH (Generalized Autoregressive Conditional Heteroskedasticity)

  • estimate / forecast the variance
  • estimate the error
  • often used to estimate portfolio volatility/risk

Holt-Winters function in R

  • xhat - smoothed series (predictions)
  • the additive model is most appropriate if the magnitude of the seasonal fluctuations or the variation around the trend-cycle does not vary with the level of the time series. when the variation in the seasonal pattern, or the variation around the trend-cycle, appears to be proportional to the level of the time series, then a pultiplicative model is more appropriate

Regression

  • Simple linear regression with one predictor, linear relationship between predictor and response
  • \(y\) - response (new car sales)
  • \(x_{1}\) - predictor (workforce participation)
  • regression equation: \(y = a_{0} + a_{1}x_{1}\)
  • with \(m\) predictors: \(y = a_{0} + a_{1}x_{1} + a_{2}x_{2} ... + a_{m}x_{m}\)
  • shortened: \(y = a_{0} + \sum_{j=1}^{m} a_{j}x_{j}\)
  • Accuracy of model is sum of squared errors: \(\sum_{i=1}^{n} (y_{i} - \^{y}_{i})^{2}\)
  • where \(y_{i}\) is cars sold at point i, \(\^{y}_{i}\) is cars predicted at point 1
  • Maximum Likelyhood Fit: if the squared errors are normally distributed, ??
  • Akaike Information Criterion (AIC):
    • L*: Maximum Likelyhood value
    • k: number of parameters being estimated
    • “ideal world” formula: \(AIC = 2k - 2ln(L\*)\)
    • \(2k\) is the penalty term, balances likelyhood with simplicity, helps avoid overfitting
    • it works best for infinitely large datasets, but for “small” datasets, the corrected version is used.
  • Bayesian Information Criterion (BIC): has a larger penalty term, works better with fewer parameters, parameters < data points
    • formula: \(BIC = k ln(n) - 2ln(L\*)\)
    • L*: Maximum Likelyhood value
    • k: number of parameters being estimated
    • n: number of data points
  • non-linear regression can be scaled with the box-cox transformation method, if the y values are not normally destributed, you scale the y value with box-cox
  • you can get p-values for each coefficient, to see how important it is to the model

Q8.2 helpful links: • https://stats.stackexchange.com/questions/58141/interpreting-plot-lm • https://stats.stackexchange.com/questions/52089/what-does-having-constant-variance-in-a-linear-regression-model-mean/52107#52107 • https://stats.stackexchange.com/questions/74622/converting-standardized-betas-back-to-original-variables

p 2h 0i 2l 1l 2y 0 0 2m 1t 2l 0
p 2h 0l 1i 0l 2m 0t 1l 0

first peak should be the tallest, add a 4th peak if possible

i 2p 0h 1l 0m 1t 0l 1i 0
p 2h 0l 1m 0t 1l 0
b 1i 3n 2g 2o 1 \h 2o 2p 3\
p 3h 0l 1 0m 2t 0l 1 0  
b 1i 3n 2g 2o 1 2h 2o 2p 3

Sample quiz

1a. x2 is not important 1b. figure a, because figure b overfits the bottom right quadrant (each leaf should have at least 5% of the data points in it)

2a. logistic regression 2b. appropriate models: CART, k-nearest neighbor, support vector machine 2c. CUSUM

4a. Xt is hours of operation, 4b. 170 is L (length of cycle) 4c. lower values of C mean response is sooner. Vehicles built early in the batch tend to break down more quickly 4d. Positive values of T mean the response is getting slower over time, it’s taking longer for vehicles to break down 4e. The first component will be the most important because the two components are highly correlated and the second one basically becomes useless.

Advanced Data Preparation

  • Box-Cox method - scaling unevenly distributed data (heteroskedastic)
  • Detrending Data - for example, adjusting for inflation
    • factor-by-factor \(y = a_{0} + a_{1}x\) - if you know the inflation each year you just go year by year to adjust the exact inflation
    • alternatively, a simple version just uses linear regression
    • example: price = -45,600 + 23.2 • Year De-trended Price = Actual Price - (-45,600 + 23.2 • Year)

Feature Extraction - Principle Component Analysis

  • Figuring out which subset of predictors is the most important
  • Useful for high-dimensional and correlated data
  • Potential issues:
    • you would need a lot of data esp timespan of data, but over that timespan the factors may change (new stock company rises)
    • some factors might be highly correlated (e.g. companies in the same sector tend to move together)
    • beneficial to standardize data to normal distribution
  • PCA transforms data
    • removes correlations from data
    • ranks coordinates by importance
    • if you focus on the first n components, you reduce random effects, first components have higher signal to noise ratio
    • the way it works is you basically have a change of base of the coordinate system, and if the two components are still parallel (0 correlation). the new change of base allows you to have standardized terms for things like spread, and PCA will identify the weightier component
    • you can have linear and nonlinear transformations depending on the kernel you use just like SVM
    • PCA-based models can be tranformed back to the original factor space

    get the betas from this PC regression model and transform them into the alphas that are in terms of the original scaled variables. Then, from that you can unscale the coefficients to get them in terms of the original unscaled coefficients. you will need to do this in order to make a prediction for the new point given in the previous hw. you will need to do some type of matrix multiplication in order to do these things. use lectures for guidance.

    compare quality of tit of thids model to the models you made in previous weeks. use r^2 to start off with, but should try computing adj r^2 and/or CV r^2

Use prcomp for PCA, and include scale option to scale the data. don’t forget to UNSCALE the coefficients Steps:

  • scale the data
  • perform PCA to transform data
  • Create a linear regression model using transformed data
  • reverse the PCA transformation to get original variables
  • descale coefficients
  • predict using test data point

Meta:

  • the eigenvectors (prcomp refers to them as rotations) are how you transform your data
  • the eigenvalues (prcomp refers to them as sdev) tell you how much variance is explained by each principal component
    • the larger the eigenvalue, the more variance is captured in the PC
    • the rotation matrix is always ordered largest to smallest
    • PCs are selected by picking the top x PCs or picking the number of PCs that capture 80-95% of the variance

Advanced Regression

  • Classification Problems - Classification and Regression Trees (CART)
  • Decision Making - Decision Trees
  • you can split the data into branches, and create separate models for different branches or sub-branches. the endings of these branches are called “Leaves”
  • Branching - for each leaf (using half the data):
    • calculate variance
    • split on each variance factor, find biggest variance decrease
    • set a threshold for variance decrease
    • keep splitting until the variance decrease no longer passes the threshold
    • there should also be enough data points in the branch - at least 5% of the data
  • Pruning - validation with second half of the data:
    • calculate estimation error with and without branching
    • if branching increases error, remove branching
  • Random Forest Method - generate many different trees and average the result:
    1. randomly pick n points for each tree (sample w/ replacement)
    2. randomly pick a subset of factors, and choose the best factor of the subset for branching (1 + log(n) is a common subset size to use)
    3. don’t prune the tree.
      • regression trees: use average predicted response
      • classification trees: use mode / most common predicted response
      • benefits: less overfitting, better overall estimates
      • drawbacks: hard to explain/interpret results, doesn’t give us a specific model
  • Logistic Regression / LOGIT Model
    • Receiver Operating Characteristic (ROC) Curve
    • Area Under the Curve (AUC) / Concordance Index - probability of random yes vs random no, if it’s .5 it’s just random
  • Confusion Matrix - punnet square of model classification vs true classification (yes/no)
    • True Positive (TP), False Positive (FP), True Negative (TN), False Negative (FN)
    • Sensitivity - \(TP / (TP + FN)\)
    • Specificity - \(TN / (TN + FP )\)
  • Parametric vs Non-Parametric - setting the form/classification types vs not choosing the form (knn, clustering)

Midterm 1 Cheat Sheet

Variable Selection

  • tbd

Design of Experiments (DOE)

  • compare and control
  • blocking
    • a blocking factor creates variation that needs to be accounted for. like de-trending
    • you might need to subdivide the data into blocks, for example red cars -> red sports cars, red family cars, etc.
  • A/B Testing requires:
    1. collecting data quickly
    2. representative sample
    3. sample is small compared to population
  • Multiple Factors - Full Factorial Design
    • test every combination
    • ANOVA analysis to determine factor importance
  • Fractional Factorial Design - reduce # of factors
    • balanced design: each factor appears same # of times
  • Independent factors
    • use subset of combinations,
    • regression to estimate effects
  • Exploration vs Exploitation trade-off
    • more info vs immediate value. multi-armed bandit

Probability-Based Models

  • Bernoulli Distribution / Binomial Distribution
    • modelin binary events, not necessarily 50/50 tho
    • “probability of getting x successes out of n independent identically distributed Bernoulli probability trials”
    • large n approaches normal distribution
    • Probability Mass Function: \(P (X = x) = \mycolv{ n \\ x}p^{x}(1-p)^{n-x} = (n!/(n-x)!x!)p^{x}(1-p)^{n-x}\)
  • Geometric Distribution
    • probability of having x Bernoulli(p) failures until 1st success
    • probability mass functino: \(P (X = x) = (1-p)^{x}p\)
    • if it fits a geometric distribution, then the trials are independent, if not, then not. (example: TSA screening)
  • Poisson Distribution
    • good at modeling random arrivals - independent and identically distributed (i.i.d.)
    • Probability mass function: \(f_{x}(x) = \frac{ \lambda ^{x} e ^{-\lambda} }{x!}\)
  • Exponential Distribution (time between arrivals): \(f_{x}(x) = \lambda e ^{-\lambda x}\)
  • Weibull - time between failures
    • \[f_{x}(x) = \frac{k}{\lambda} (\frac{x}{\lambda})^{k-1}e^{-(x/\lambda )^{k}}\]
    • k >1 = increasing failure rate, k<1 = decreasing failure rate, k=1 = constant

Queuing Models

  • General Arrival Distribution (A)
  • General Service distribution (S)
  • of servers (c)

  • size of queue (K)
  • population size (N)
  • Queuing discipline (D) (FIFO/LIFO)
  • model extensions: hang-ups, balking, etc.

Simulations

  • deterministic vs stochastic
  • continuous (differential equations) vs discrete events
  • valuable for prescriptive analytics
  • parts:
    • entities (things that move, e.g. bags, people)
    • modules (parts of process, e.g. queues, storage)
    • actions
    • resources (i.e. workers)
    • decision points
    • statistical tracking
    • etc.
  • must validate against real data

Missing Data

  • remove datapoints with missing values
  • add categorical variables about missing data
  • hedge your bets and duplicate it with and without missing data
  • create trees with different ways of handling missing data
  • imputation:
    • 1: mean/median/mode(categorical) - easy to compute, but could be biased
    • 2: use predictive model like regression - more compled, does not capture all the variability, could be overfit
    • 3: imputation with perturbation (normally-distributed variation) - less accurate than 2. on average, but more accurate variability
  • no more than 5% of data (per factor) should be imputed
  • compounding errors with each layer of technique …but regular data also has errors!

Optimization

Statistical vs Optimization view od what “variable” means – xij vs constant terms

  • basically everything is optimization tbh
  • Linear Program - linear function/equations/inequalities (FASTEST)
    • Network Model - shortest path, assignment model, maximum flow,
  • Convex Quadratic Program- (FAST)
  • Convex Program(SLOWER)
    • concave - maximizing
    • convex - minimizing
  • Integer Program - linear program but some variables have to be integers/binary, not continuous (PARTLY UNSOLVABLE) because tree-based solving method
  • General non-convex problem - (HARD)
  • uncertainty
    • model conservatively/include margins
    • scenario modeling - robust solution satisfies every scenario, expected cost x probability of scenario,
  • Dynamic Program
    • states (situation and values)
    • decisions (choice of next state )
    • bellman’s equation - optimize decisions
  • Stochastic Dynamic Program - Dynamic program with probabilities of states incorporated into decisions
  • Markov Decision Process - discrete # states/decisions, depends on current state only
  • Newton Method / step improving direction - works for convex, not for others (local optima not global, outside of scope, )

Stochastic optimization refers to a collection of methods for minimizing or maximizing an objective function when randomness is present.

Advanced Models

  • Nonparametric tests where we don’t know distribution. Uses Medians/Ranks of data because we don’t know the exact distribution so we can’t use mean
  • Two Data sets: McNemar’s Test - compare results on pairs of responses, binomial distribution - throws out cases where tests are the same (binary/yes/no data)
    • Wilcoxon Signed Rank Test for Medians: is the median distribution different from m? normal distribution, can use it for pairs of varialbes to compare if two sets of observations are similar. (paired numerical data)
    • Mann-Whitney Test - non-paired samples of two data sets
  • One Data set:
    • Wilcoxon Signed Rank Test for Medians: - comparing possible medians
  • Decision tree:
    • 1 data set (compare mean, median, etc to specific value of M) or
    • 2 data sets (compare mean, median etc to each other)
      • paired or unpaired?
    • what metric are you analyzing?
      • mean/exact values -> Parametric
      • median/ranking -> Non-parametric
      • fraction of success/failure/binary outcomes -> Binomial

Bayesian Models / Empirical Bayesian Modeling

  • overall distribution is known/estimable but we have little data
  • single observation combined with broader observations

High-Interconnected Sub-populations

  • louvain algorithm - decomposing graphs into sub-communities/cliques - heuristic
    • start with each node as its own “community”
    • if it moves i into another community, how much does the modularity go up?
      • selects the move that has the biggest increase in modularity
    • until no move increases modularity
    • then restart process, but now using “super nodes” and “super arcs” instead of individual nodes, including a self-weight (weight of all arcs in the supernode)
  • maximize “modularity” of a graph
  • \(a_{ij}\) = weight on the arcs between nodes i and j
  • \(w_{i}\) = total weight of arcs connected to i
  • \(W\) = total weight of all the arcs
  • modularity = \(\frac{1}{2W} \sum{i,j}\) in same community \(a_{ij} - \frac{w_{i}w_{j}}{2W}\)

Neural Networks & Deep Learning

  • react to patterns we don’t understand/can’t formalize
  • DL is similar to NN but with more layers

Competetive Models / Decision Making / Game Theory

  • considering behavioural adjustments to policies, auctions, negotiations
  • you can use optimization to pick the best strategy

Survival Models

  • insurance, effects of treatment/transplant, maintenance/replacement
  • Cox proportional hazards model (h_0 is baseline if all predictor variables are 0, probability of event happening at time t)
  • \[h(t) = h_{0} e^{(\beta_{1}x_{1}+ ... + \beta_{n}x_{n})}\]
  • “censored data” - no data before or after a certain time. can complicate predictions but there’s ways around it

Gradient Boosting

  • used with factor-based models
  • start with one model and then use gradient fit to boost it with other models/
  • train next model to fit/counteract the errors/quality metric in first model’s predictions

Case Studies

Presentation format:

  • given x, use y, to do z

Case 1: Power Company

  • turn power off for people who were never going to pay
  • not for those that forgot or got behind
  • logistical problems:
    • manual shut off
    • go to location
    • more work thank company could handle
  • considerations:
    • which shutoffs should be done each month?
    • some of worker’s time is taken up by travel
    • some shut-offs shouldn’t be done at all, how to identify?
    • how should shut-offs be prioritized
  • what data do you want?
  • which analytics models will you use
    • different approaches
    • cannot tell whether a model works unless tested on real data