bhepop2.sources.marginal_distributions
This module contains classes describing marginal distributions sources.
In this scope, specific source distributions are known for population subsets. This allows a more precise feature value association than a global, population wide distribution.
Module Contents
Classes
Abstract class describing marginal distributions source. |
|
Marginal distributions describing qualitative features. |
|
Marginal distributions describing quantitative features. |
Attributes
- bhepop2.sources.marginal_distributions.ALL_LABEL = 'all'
- class bhepop2.sources.marginal_distributions.MarginalDistributions(data, name=None, attribute_selection: list = None)
Bases:
bhepop2.sources.base.EnrichmentSource
Abstract class describing marginal distributions source.
In this class, the distributions subsets are known for population individuals presenting a specific attribute. For instance, the Filosofi data source (INSEE) stores distributions of declared income in administrative areas, for the whole population and for population subsets, such as tenants or owners.
In this scope, we use the following terms to describe such marginal distributions:
- An attribute refers to an information in the initial sample or in the aggregate data.
For instance: age, profession, ownership, etc.
- Modalities are the partition of one attribute.
For instance, in Filosofi, the ownership attribute can take the values Owner and Tenant.
- Cross modalities are the intersection of two or more modalities.
For instance, Owner and above 65 years old.
Then, population individuals are part of a single cross modality, and can be matched with distributions corresponding to their known attributes.
- _validate_data()
Validate the source data.
Raise a ValueError if data is invalid.
- Raises:
SourceValidationError
- usable_with_population(population)
Check that the population attributes are compatible with the source.
Check that the source attributes are present in the population. Check that the population values of each attribute are in the source distributions.
- Parameters:
population – population DataFrame
- Raises:
PopulationValidationError
- abstract _validate_data_type()
- abstract compute_feature_prob(attribute=ALL_LABEL, modality=ALL_LABEL)
Return a DataFrame containing the probability to be in each feature state while in the given modality.
The resulting DataFrame is of the following format: { “feature”: [feature_values], “prob”: [feature_probs] }
This method accepts attributes and modalities from self.modalities and also (ALL_LABEL, ALL_LABEL) couple, returning the global distribution.
- Parameters:
attribute – attribute label
modality – modality label
- Returns:
DataFrame[“feature”, “prob”]
- get_modality_distribution(attribute, modality)
Get the distribution corresponding to the given attribute and modality.
This method accepts attributes and modalities from self.modalities and also (ALL_LABEL, ALL_LABEL) couple, returning the global distribution.
- Parameters:
attribute – attribute label
modality – modality label
- Returns:
- class bhepop2.sources.marginal_distributions.QualitativeMarginalDistributions(data, name=None, attribute_selection: list = None)
Bases:
MarginalDistributions
Marginal distributions describing qualitative features.
Input data:
DataFrame with feature values as columns, and probabilities as column values, for each attribute/modality pair. An additional row containing a global distribution (for the whole population) must be present, with attribute and modality equal to
ALL_LABEL
.Example:
Table containing qualitative marginal distributions for attributes ownership and age Red
Green
Blue
attribute
modality
0.3
0.3
0.4
all
all
0.5
0.2
0.3
ownership
Owner
0.4
0.4
0.2
ownership
Tenant
0
0.5
0.5
age
0_29
…
…
…
…
…
0.7
0.1
0.2
age
75_or_more
- _evaluate_feature_values()
Evaluate the feature values from the distributions columns.
- Returns:
list of feature values
- _validate_data_type()
- compute_feature_prob(attribute=ALL_LABEL, modality=ALL_LABEL)
Return a DataFrame containing the probability to be in each feature state while in the given modality.
The resulting DataFrame is of the following format: { “feature”: [feature_values], “prob”: [feature_probs] }
This method accepts attributes and modalities from self.modalities and also (ALL_LABEL, ALL_LABEL) couple, returning the global distribution.
- Parameters:
attribute – attribute label
modality – modality label
- Returns:
DataFrame[“feature”, “prob”]
- get_value_for_feature(feature_index, rng)
Return a feature value for the given feature index.
Generate a singular value from the feature state corresponding to the given index.
- Parameters:
feature_index – index of the feature in self.feature_values
rng – Numpy random Generator
- Returns:
feature value
- compare_with_populations(populations, feature_name, **kwargs)
Compare the source data with populations containing the described feature (enriched or original)
The class returns an instance of a PopulationAnalysis subclass, which can be used to generate different kinds of comparisons between the populations and the source data.
- Parameters:
populations – dict of populations {population_name: population}
feature_name – population column containing the feature values
kwargs – additional arguments for the analysis instance
- Returns:
PopulationAnalysis subclass instance.
- class bhepop2.sources.marginal_distributions.QuantitativeMarginalDistributions(data, name=None, attribute_selection: list = None, abs_minimum: int = 0, relative_maximum: float = 1.5, delta_min: int = None)
Bases:
MarginalDistributions
,bhepop2.sources.base.QuantitativeAttributes
Marginal distributions describing quantitative features.
Input data:
DataFrame with deciles numbers as columns (D1, D2 to D9), and values as column values, for each attribute/modality pair. An additional row containing a global distribution (for the whole population) must be present, with attribute and modality equal to
ALL_LABEL
.Example:
Table containing quantitative marginal distributions for attributes ownership and age D1
…
D9
attribute
modality
18 852
…
46 522
all
all
16 542
…
50 060
ownership
Owner
8 764
…
29 860
ownership
Tenant
15 000
…
45 000
age
0_29
…
…
…
…
…
20 000
…
65 000
age
75_or_more
- _evaluate_feature_values()
Evaluate the feature values from the distribution values and class parameters.
- Returns:
list of feature values
- _validate_data_type()
- compute_feature_prob(attribute=ALL_LABEL, modality=ALL_LABEL)
Return a DataFrame containing the probability to be in each feature state while in the given modality.
The resulting DataFrame is of the following format: { “feature”: [feature_values], “prob”: [feature_probs] }
This method accepts attributes and modalities from self.modalities and also (ALL_LABEL, ALL_LABEL) couple, returning the global distribution.
- Parameters:
attribute – attribute label
modality – modality label
- Returns:
DataFrame[“feature”, “prob”]
- get_value_for_feature(feature_index, rng)
Return a value drawn from the interval corresponding to the feature index.
The first interval is defined as [self._abs_minimum, self.feature_values[0]]. and so on. The value is drawn using a uniform rule.
- Parameters:
feature_index –
rng –
- Returns:
- compare_with_populations(populations, feature_name, **kwargs)
Compare the source data with populations containing the described feature (enriched or original)
The class returns an instance of a PopulationAnalysis subclass, which can be used to generate different kinds of comparisons between the populations and the source data.
- Parameters:
populations – dict of populations {population_name: population}
feature_name – population column containing the feature values
kwargs – additional arguments for the analysis instance
- Returns:
PopulationAnalysis subclass instance.