Variance Numpy And Pandas

  • calculate some summary statistics to get an idea of the distribution\
    • use `numpy` to calculate mean and variance

      import numpy as np
      
      X = [15, 8, 13, 7, 7, 12, 15, 6, 8, 9]
      mean = np.mean(X)
      var = np.var(X)
      
      print(f"Mean={mean:.2f}, Variance={var:.2f}")
      
      #+RESULTS:

Mean=10.00, Variance=10.60

  • lets try `pandas`

    import pandas as pd
    
    X = pd.Series([15, 8, 13, 7, 7, 12, 15, 6, 8, 9])
    mean = X.mean()
    var = X.var()
    
    print(f"Mean={mean:.2f}, Variance={var:.2f}")
    
    #+RESULTS:

Mean=10.00, Variance=11.78

This discrepancy arises because numpy and pandas use different default equations for calculating the variance of an array

population variance \(\sigma^2 = \frac{\sum_{i=1}^n (x_i-\mu)^2}{n-1}\)

sample variance \(s^2 = \frac{\sum_{i=1}^n (x_i-\bar{x})^2}{n-1}\)

notice the differences between these equations

  • in the numerator's sum, \(\sigma^2\) is calculated using the population mean, \(\mu\)
    • while \(s^2\) is calculated using the sample mean, \(\bar{x}\)
  • in the denominator, \(\sigma^2\) divides by the total population size \(N\)
    • while \(s^2\) divides by the sample size minus one, \(n-1\)

as \(n\) grows, the distinction between \(n\) and \(n-1\) becomes less and less significant

  • \(\mu\) is the true population mean given you have all the data
    • when calculating sample variance, you only have an estimate of \(\mu\)
      • which is the sample mean \(\bar{x}\)

using the sample mean instead of the true population mean tends to underestimate the true population variance on average