Generating missing values in tabular data¶

This tutorial shows how to generate missing values on pre-existing tabular data and to visualize both the original and the transformed data

In [1]:

Copied!





from sklearn.datasets import make_blobs
from badgers.generators.tabular_data.missingness import MissingCompletelyAtRandom, DummyMissingNotAtRandom, DummyMissingAtRandom
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import make_blobs
from badgers.generators.tabular_data.missingness import MissingCompletelyAtRandom, DummyMissingNotAtRandom, DummyMissingAtRandom
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [2]:

Copied!





def plot_missing(X, y, Xt):
    """
    Some utility function to generate the plots
    """
    missing_mask = np.isnan(Xt).any(axis=1)
    fig, axes = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(8,4))
    for label in np.unique(y):
        ix = np.where(y == label)
        axes[0].scatter(X[ix,0],X[ix,1], c = f'C{label}', label = f'{label}')
        ix = np.where(y[~missing_mask] == label )
        axes[1].scatter(Xt[~missing_mask][ix,0], Xt[~missing_mask][ix,1], c = f'C{label}', label = f'{label}')
    # plot missing values
    axes[1].scatter(X[missing_mask][:,0],X[missing_mask][:,1],marker='x', color='black', label = 'missing')
    axes[0].set_title('Original')
    axes[1].set_title('Transformed')
    axes[0].set_xlabel('dimension 0', fontsize=10)
    axes[1].set_xlabel('dimension 0', fontsize=10)
    axes[0].set_ylabel('dimension 1', fontsize=10)
    axes[1].set_ylabel('dimension 1', fontsize=10)
    axes[0].legend()
    axes[1].legend()
    return fig, axes
def plot_missing(X, y, Xt):
    """
    Some utility function to generate the plots
    """
    missing_mask = np.isnan(Xt).any(axis=1)
    fig, axes = plt.subplots(1, 2, sharex=True, sharey=True, figsize=(8,4))
    for label in np.unique(y):
        ix = np.where(y == label)
        axes[0].scatter(X[ix,0],X[ix,1], c = f'C{label}', label = f'{label}')
        ix = np.where(y[~missing_mask] == label )
        axes[1].scatter(Xt[~missing_mask][ix,0], Xt[~missing_mask][ix,1], c = f'C{label}', label = f'{label}')
    # plot missing values
    axes[1].scatter(X[missing_mask][:,0],X[missing_mask][:,1],marker='x', color='black', label = 'missing')
    axes[0].set_title('Original')
    axes[1].set_title('Transformed')
    axes[0].set_xlabel('dimension 0', fontsize=10)
    axes[1].set_xlabel('dimension 0', fontsize=10)
    axes[0].set_ylabel('dimension 1', fontsize=10)
    axes[1].set_ylabel('dimension 1', fontsize=10)
    axes[0].legend()
    axes[1].legend()
    return fig, axes

Load and prepare data¶

We first load an existing dataset from sklearn.datasets

In [3]:

Copied!

X, y = make_blobs(random_state=0)
X, y = make_blobs(random_state=0)

Generate missing values¶

Missing value mechanisms are usually categorized as missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).

Missing completely at random (MCAR)¶

The transformer simply replaces values (row, col) with np.nan independently at random.

In [4]:

Copied!

trf = MissingCompletelyAtRandom(percentage_missing=25)
Xt, _ = trf.generate(X.copy(), y)
trf = MissingCompletelyAtRandom(percentage_missing=25)
Xt, _ = trf.generate(X.copy(), y)

In [5]:

Copied!

pd.DataFrame(Xt).head()
pd.DataFrame(Xt).head()

Out[5]:

	0	1
0	NaN	0.689365
1	NaN	4.690690
2	3.002519	0.742654
3	NaN	4.091047
4	-0.072283	2.883769

In [6]:

Copied!

fig, axes = plot_missing(X, y, Xt)
fig, axes = plot_missing(X, y, Xt)

No description has been provided for this image

Missing at random (MAR)¶

Missing not at random means that the fact that a value is missing correlates with some other features.

The DummyMissingAtRandom transformer replaces a value (row,col) with np.nan depending upon another feature chosen randomly. The probability of missingness depends linearly on the other chosen feature.

In [7]:

Copied!

trf = DummyMissingAtRandom(percentage_missing=25)
Xt, _ = trf.generate(X.copy(), y)
trf = DummyMissingAtRandom(percentage_missing=25)
Xt, _ = trf.generate(X.copy(), y)

In [8]:

Copied!

pd.DataFrame(Xt).head()
pd.DataFrame(Xt).head()

Out[8]:

	0	1
0	NaN	0.689365
1	NaN	4.690690
2	NaN	0.742654
3	NaN	4.091047
4	-0.072283	2.883769

In [9]:

Copied!

fig, axes = plot_missing(X, y, Xt)
fig, axes = plot_missing(X, y, Xt)

Missing not at random (MNAR)¶

Missing not at random means that the value that is missing depends on its own value had it not been missing.

The DummyMissingNotAtRandom simply replaces a value with np.nan with a probability proportional to the original value.

In [10]:

Copied!

trf = DummyMissingNotAtRandom(percentage_missing=5)
Xt, _ = trf.generate(X.copy(), y)
trf = DummyMissingNotAtRandom(percentage_missing=5)
Xt, _ = trf.generate(X.copy(), y)

In [11]:

Copied!

pd.DataFrame(Xt).head()
pd.DataFrame(Xt).head()

Out[11]:

	0	1
0	2.631858	0.689365
1	0.080804	4.690690
2	NaN	0.742654
3	-0.637628	4.091047
4	NaN	2.883769

In [12]:

Copied!

fig, axes = plot_missing(X, y, Xt)
fig, axes = plot_missing(X, y, Xt)