Generating missing values in tabular data¶
This tutorial shows how to generate missing values on pre-existing tabular data and to visualize both the original and the transformed data
In [1]:
Copied!
from sklearn.datasets import make_blobs
from badgers.generators.tabular_data.missingness import MissingCompletelyAtRandom, DummyMissingNotAtRandom, DummyMissingAtRandom
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import make_blobs
from badgers.generators.tabular_data.missingness import MissingCompletelyAtRandom, DummyMissingNotAtRandom, DummyMissingAtRandom
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
In [2]:
Copied!
def plot_missing(X, y, Xt):
"""
Some utility function to generate the plots
"""
missing_mask = np.isnan(Xt).any(axis=1)
fig, axes = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(8,4))
for label in np.unique(y):
ix = np.where(y == label)
axes[0].scatter(X[ix,0],X[ix,1], c = f'C{label}', label = f'{label}')
ix = np.where(y[~missing_mask] == label )
axes[1].scatter(Xt[~missing_mask][ix,0], Xt[~missing_mask][ix,1], c = f'C{label}', label = f'{label}')
# plot missing values
axes[1].scatter(X[missing_mask][:,0],X[missing_mask][:,1],marker='x', color='black', label = 'missing')
axes[0].set_title('Original')
axes[1].set_title('Transformed')
axes[0].set_xlabel('dimension 0', fontsize=10)
axes[1].set_xlabel('dimension 0', fontsize=10)
axes[0].set_ylabel('dimension 1', fontsize=10)
axes[1].set_ylabel('dimension 1', fontsize=10)
axes[0].legend()
axes[1].legend()
return fig, axes
def plot_missing(X, y, Xt):
"""
Some utility function to generate the plots
"""
missing_mask = np.isnan(Xt).any(axis=1)
fig, axes = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(8,4))
for label in np.unique(y):
ix = np.where(y == label)
axes[0].scatter(X[ix,0],X[ix,1], c = f'C{label}', label = f'{label}')
ix = np.where(y[~missing_mask] == label )
axes[1].scatter(Xt[~missing_mask][ix,0], Xt[~missing_mask][ix,1], c = f'C{label}', label = f'{label}')
# plot missing values
axes[1].scatter(X[missing_mask][:,0],X[missing_mask][:,1],marker='x', color='black', label = 'missing')
axes[0].set_title('Original')
axes[1].set_title('Transformed')
axes[0].set_xlabel('dimension 0', fontsize=10)
axes[1].set_xlabel('dimension 0', fontsize=10)
axes[0].set_ylabel('dimension 1', fontsize=10)
axes[1].set_ylabel('dimension 1', fontsize=10)
axes[0].legend()
axes[1].legend()
return fig, axes
Load and prepare data¶
We first load an existing dataset from sklearn.datasets
In [3]:
Copied!
X, y = make_blobs(random_state=0)
X, y = make_blobs(random_state=0)
In [4]:
Copied!
trf = MissingCompletelyAtRandom(percentage_missing=25)
Xt, _ = trf.generate(X.copy(), y)
trf = MissingCompletelyAtRandom(percentage_missing=25)
Xt, _ = trf.generate(X.copy(), y)
In [5]:
Copied!
pd.DataFrame(Xt).head()
pd.DataFrame(Xt).head()
Out[5]:
0 | 1 | |
---|---|---|
0 | NaN | 0.689365 |
1 | NaN | 4.690690 |
2 | 3.002519 | 0.742654 |
3 | NaN | 4.091047 |
4 | -0.072283 | 2.883769 |
In [6]:
Copied!
fig, axes = plot_missing(X, y, Xt)
fig, axes = plot_missing(X, y, Xt)
Missing at random (MAR)¶
Missing not at random means that the fact that a value is missing correlates with some other features.
The DummyMissingAtRandom transformer replaces a value (row,col) with np.nan
depending upon another feature chosen randomly. The probability of missingness depends linearly on the other chosen feature.
In [7]:
Copied!
trf = DummyMissingAtRandom(percentage_missing=25)
Xt, _ = trf.generate(X.copy(), y)
trf = DummyMissingAtRandom(percentage_missing=25)
Xt, _ = trf.generate(X.copy(), y)
In [8]:
Copied!
pd.DataFrame(Xt).head()
pd.DataFrame(Xt).head()
Out[8]:
0 | 1 | |
---|---|---|
0 | NaN | 0.689365 |
1 | NaN | 4.690690 |
2 | NaN | 0.742654 |
3 | NaN | 4.091047 |
4 | -0.072283 | 2.883769 |
In [9]:
Copied!
fig, axes = plot_missing(X, y, Xt)
fig, axes = plot_missing(X, y, Xt)
Missing not at random (MNAR)¶
Missing not at random means that the value that is missing depends on its own value had it not been missing.
The DummyMissingNotAtRandom simply replaces a value with np.nan
with a probability proportional to the original value.
In [10]:
Copied!
trf = DummyMissingNotAtRandom(percentage_missing=5)
Xt, _ = trf.generate(X.copy(), y)
trf = DummyMissingNotAtRandom(percentage_missing=5)
Xt, _ = trf.generate(X.copy(), y)
In [11]:
Copied!
pd.DataFrame(Xt).head()
pd.DataFrame(Xt).head()
Out[11]:
0 | 1 | |
---|---|---|
0 | 2.631858 | 0.689365 |
1 | 0.080804 | 4.690690 |
2 | NaN | 0.742654 |
3 | -0.637628 | 4.091047 |
4 | NaN | 2.883769 |
In [12]:
Copied!
fig, axes = plot_missing(X, y, Xt)
fig, axes = plot_missing(X, y, Xt)