`bhepop2.sources.marginal_distributions`

This module contains classes describing marginal distributions sources.

In this scope, specific source distributions are known for population subsets. This allows a more precise feature value association than a global, population wide distribution.

Module Contents

Classes

`MarginalDistributions`	Abstract class describing marginal distributions source.
`QualitativeMarginalDistributions`	Marginal distributions describing qualitative features.
`QuantitativeMarginalDistributions`	Marginal distributions describing quantitative features.

Attributes

ALL_LABEL

bhepop2.sources.marginal_distributions.ALL_LABEL = 'all'

class bhepop2.sources.marginal_distributions.MarginalDistributions(data, name=None, attribute_selection: list = None)

Bases: bhepop2.sources.base.EnrichmentSource

Abstract class describing marginal distributions source.

In this class, the distributions subsets are known for population individuals presenting a specific attribute. For instance, the Filosofi data source (INSEE) stores distributions of declared income in administrative areas, for the whole population and for population subsets, such as tenants or owners.

In this scope, we use the following terms to describe such marginal distributions:

An attribute refers to an information in the initial sample or in the aggregate data.
For instance: age, profession, ownership, etc.
Modalities are the partition of one attribute.
For instance, in Filosofi, the ownership attribute can take the values Owner and Tenant.
Cross modalities are the intersection of two or more modalities.
For instance, Owner and above 65 years old.

Then, population individuals are part of a single cross modality, and can be matched with distributions corresponding to their known attributes.

_validate_data()

Validate the source data.

Raise a ValueError if data is invalid.

Raises:: ValueError

abstract _validate_data_type()

abstract compute_feature_prob(attribute=ALL_LABEL, modality=ALL_LABEL)

Return a DataFrame containing the probability to be in each feature state while in the given modality.

The resulting DataFrame is of the following format: { “feature”: [feature_values], “prob”: [feature_probs] }

This method accepts attributes and modalities from self.modalities and also (ALL_LABEL, ALL_LABEL) couple, returning the global distribution.

Parameters:

attribute – attribute label
modality – modality label

Returns:

DataFrame[“feature”, “prob”]

get_modality_distribution(attribute, modality)

Get the distribution corresponding to the given attribute and modality.

This method accepts attributes and modalities from self.modalities and also (ALL_LABEL, ALL_LABEL) couple, returning the global distribution.

Parameters:

attribute – attribute label
modality – modality label

Returns:

class bhepop2.sources.marginal_distributions.QualitativeMarginalDistributions(data, name=None, attribute_selection: list = None)

Bases: MarginalDistributions

Marginal distributions describing qualitative features.

Input data:

DataFrame with feature values as columns, and probabilities as column values, for each attribute/modality pair. An additional row containing a global distribution (for the whole population) must be present, with attribute and modality equal to ALL_LABEL.

Example:

Table containing qualitative marginal distributions for attributes **ownership** and **age**
Red	Green	Blue	attribute	modality
0.3	0.3	0.4	all	all
0.5	0.2	0.3	ownership	Owner
0.4	0.4	0.2	ownership	Tenant
0	0.5	0.5	age	0_29
…	…	…	…	…
0.7	0.1	0.2	age	75_or_more

_evaluate_feature_values()

Evaluate the feature values from the distributions columns.

Returns:: list of feature values

_validate_data_type()

compute_feature_prob(attribute=ALL_LABEL, modality=ALL_LABEL)

Return a DataFrame containing the probability to be in each feature state while in the given modality.

The resulting DataFrame is of the following format: { “feature”: [feature_values], “prob”: [feature_probs] }

This method accepts attributes and modalities from self.modalities and also (ALL_LABEL, ALL_LABEL) couple, returning the global distribution.

Parameters:

attribute – attribute label
modality – modality label

Returns:

DataFrame[“feature”, “prob”]

get_value_for_feature(feature_index, rng)

Return a feature value for the given feature index.

Generate a singular value from the feature state corresponding to the given index.

Parameters:

feature_index – index of the feature in self.feature_values
rng – Numpy random Generator

Returns:

feature value

compare_with_populations(populations, feature_name, **kwargs)

Compare the source data with populations containing the described feature (enriched or original)

The class returns an instance of a PopulationAnalysis subclass, which can be used to generate different kinds of comparisons between the populations and the source data.

Parameters:

populations – dict of populations {population_name: population}
feature_name – population column containing the feature values
kwargs – additional arguments for the analysis instance

Returns:

PopulationAnalysis subclass instance.

class bhepop2.sources.marginal_distributions.QuantitativeMarginalDistributions(data, name=None, attribute_selection: list = None, abs_minimum: int = 0, relative_maximum: float = 1.5, delta_min: int = None)

Bases: MarginalDistributions, bhepop2.sources.base.QuantitativeAttributes

Marginal distributions describing quantitative features.

Input data:

DataFrame with deciles numbers as columns (D1, D2 to D9), and values as column values, for each attribute/modality pair. An additional row containing a global distribution (for the whole population) must be present, with attribute and modality equal to ALL_LABEL.

Example:

Table containing quantitative marginal distributions for attributes **ownership** and **age**
D1	…	D9	attribute	modality
18 852	…	46 522	all	all
16 542	…	50 060	ownership	Owner
8 764	…	29 860	ownership	Tenant
15 000	…	45 000	age	0_29
…	…	…	…	…
20 000	…	65 000	age	75_or_more

_evaluate_feature_values()

Evaluate the feature values from the distribution values and class parameters.

Returns:: list of feature values

_validate_data_type()

compute_feature_prob(attribute=ALL_LABEL, modality=ALL_LABEL)

Return a DataFrame containing the probability to be in each feature state while in the given modality.

The resulting DataFrame is of the following format: { “feature”: [feature_values], “prob”: [feature_probs] }

This method accepts attributes and modalities from self.modalities and also (ALL_LABEL, ALL_LABEL) couple, returning the global distribution.

Parameters:

attribute – attribute label
modality – modality label

Returns:

DataFrame[“feature”, “prob”]

get_value_for_feature(feature_index, rng)

Return a value drawn from the interval corresponding to the feature index.

The first interval is defined as [self._abs_minimum, self.feature_values[0]]. and so on. The value is drawn using a uniform rule.

Parameters:

feature_index –
rng –

Returns:

compare_with_populations(populations, feature_name, **kwargs)

Compare the source data with populations containing the described feature (enriched or original)

The class returns an instance of a PopulationAnalysis subclass, which can be used to generate different kinds of comparisons between the populations and the source data.

Parameters:

populations – dict of populations {population_name: population}
feature_name – population column containing the feature values
kwargs – additional arguments for the analysis instance

Returns:

PopulationAnalysis subclass instance.

bhepop2.sources.marginal_distributions

Module Contents

Classes

Attributes

`bhepop2.sources.marginal_distributions`