bhepop2.sources.marginal_distributions

This module contains classes describing marginal distributions sources.

In this scope, specific source distributions are known for population subsets. This allows a more precise feature value association than a global, population wide distribution.

Module Contents

Classes

MarginalDistributions

Abstract class describing marginal distributions source.

QualitativeMarginalDistributions

Marginal distributions describing qualitative features.

QuantitativeMarginalDistributions

Marginal distributions describing quantitative features.

Attributes

ALL_LABEL

bhepop2.sources.marginal_distributions.ALL_LABEL = 'all'
class bhepop2.sources.marginal_distributions.MarginalDistributions(data, name=None, attribute_selection: list = None)

Bases: bhepop2.sources.base.EnrichmentSource

Abstract class describing marginal distributions source.

In this class, the distributions subsets are known for population individuals presenting a specific attribute. For instance, the Filosofi data source (INSEE) stores distributions of declared income in administrative areas, for the whole population and for population subsets, such as tenants or owners.

In this scope, we use the following terms to describe such marginal distributions:

  • An attribute refers to an information in the initial sample or in the aggregate data.

    For instance: age, profession, ownership, etc.

  • Modalities are the partition of one attribute.

    For instance, in Filosofi, the ownership attribute can take the values Owner and Tenant.

  • Cross modalities are the intersection of two or more modalities.

    For instance, Owner and above 65 years old.

Then, population individuals are part of a single cross modality, and can be matched with distributions corresponding to their known attributes.

_validate_data()

Validate the source data.

Raise a ValueError if data is invalid.

Raises:

ValueError

abstract _validate_data_type()
abstract compute_feature_prob(attribute=ALL_LABEL, modality=ALL_LABEL)

Return a DataFrame containing the probability to be in each feature state while in the given modality.

The resulting DataFrame is of the following format: { “feature”: [feature_values], “prob”: [feature_probs] }

This method accepts attributes and modalities from self.modalities and also (ALL_LABEL, ALL_LABEL) couple, returning the global distribution.

Parameters:
  • attribute – attribute label

  • modality – modality label

Returns:

DataFrame[“feature”, “prob”]

get_modality_distribution(attribute, modality)

Get the distribution corresponding to the given attribute and modality.

This method accepts attributes and modalities from self.modalities and also (ALL_LABEL, ALL_LABEL) couple, returning the global distribution.

Parameters:
  • attribute – attribute label

  • modality – modality label

Returns:

class bhepop2.sources.marginal_distributions.QualitativeMarginalDistributions(data, name=None, attribute_selection: list = None)

Bases: MarginalDistributions

Marginal distributions describing qualitative features.

Input data:

DataFrame with feature values as columns, and probabilities as column values, for each attribute/modality pair. An additional row containing a global distribution (for the whole population) must be present, with attribute and modality equal to ALL_LABEL.

Example:

Table containing qualitative marginal distributions for attributes ownership and age

Red

Green

Blue

attribute

modality

0.3

0.3

0.4

all

all

0.5

0.2

0.3

ownership

Owner

0.4

0.4

0.2

ownership

Tenant

0

0.5

0.5

age

0_29

0.7

0.1

0.2

age

75_or_more

_evaluate_feature_values()

Evaluate the feature values from the distributions columns.

Returns:

list of feature values

_validate_data_type()
compute_feature_prob(attribute=ALL_LABEL, modality=ALL_LABEL)

Return a DataFrame containing the probability to be in each feature state while in the given modality.

The resulting DataFrame is of the following format: { “feature”: [feature_values], “prob”: [feature_probs] }

This method accepts attributes and modalities from self.modalities and also (ALL_LABEL, ALL_LABEL) couple, returning the global distribution.

Parameters:
  • attribute – attribute label

  • modality – modality label

Returns:

DataFrame[“feature”, “prob”]

get_value_for_feature(feature_index, rng)

Return a feature value for the given feature index.

Generate a singular value from the feature state corresponding to the given index.

Parameters:
  • feature_index – index of the feature in self.feature_values

  • rng – Numpy random Generator

Returns:

feature value

compare_with_populations(populations, feature_name, **kwargs)

Compare the source data with populations containing the described feature (enriched or original)

The class returns an instance of a PopulationAnalysis subclass, which can be used to generate different kinds of comparisons between the populations and the source data.

Parameters:
  • populations – dict of populations {population_name: population}

  • feature_name – population column containing the feature values

  • kwargs – additional arguments for the analysis instance

Returns:

PopulationAnalysis subclass instance.

class bhepop2.sources.marginal_distributions.QuantitativeMarginalDistributions(data, name=None, attribute_selection: list = None, abs_minimum: int = 0, relative_maximum: float = 1.5, delta_min: int = None)

Bases: MarginalDistributions, bhepop2.sources.base.QuantitativeAttributes

Marginal distributions describing quantitative features.

Input data:

DataFrame with deciles numbers as columns (D1, D2 to D9), and values as column values, for each attribute/modality pair. An additional row containing a global distribution (for the whole population) must be present, with attribute and modality equal to ALL_LABEL.

Example:

Table containing quantitative marginal distributions for attributes ownership and age

D1

D9

attribute

modality

18 852

46 522

all

all

16 542

50 060

ownership

Owner

8 764

29 860

ownership

Tenant

15 000

45 000

age

0_29

20 000

65 000

age

75_or_more

_evaluate_feature_values()

Evaluate the feature values from the distribution values and class parameters.

Returns:

list of feature values

_validate_data_type()
compute_feature_prob(attribute=ALL_LABEL, modality=ALL_LABEL)

Return a DataFrame containing the probability to be in each feature state while in the given modality.

The resulting DataFrame is of the following format: { “feature”: [feature_values], “prob”: [feature_probs] }

This method accepts attributes and modalities from self.modalities and also (ALL_LABEL, ALL_LABEL) couple, returning the global distribution.

Parameters:
  • attribute – attribute label

  • modality – modality label

Returns:

DataFrame[“feature”, “prob”]

get_value_for_feature(feature_index, rng)

Return a value drawn from the interval corresponding to the feature index.

The first interval is defined as [self._abs_minimum, self.feature_values[0]]. and so on. The value is drawn using a uniform rule.

Parameters:
  • feature_index

  • rng

Returns:

compare_with_populations(populations, feature_name, **kwargs)

Compare the source data with populations containing the described feature (enriched or original)

The class returns an instance of a PopulationAnalysis subclass, which can be used to generate different kinds of comparisons between the populations and the source data.

Parameters:
  • populations – dict of populations {population_name: population}

  • feature_name – population column containing the feature values

  • kwargs – additional arguments for the analysis instance

Returns:

PopulationAnalysis subclass instance.