bhepop2.analysis
This module provides tools to analyse populations.
Most of the time, population analysis is done by comparing it with reference data.
For enriched populations, comparison with the enrichment source data can be a good way to assert the quality of the enrichment.
Module Contents
Classes
DISCLAIMER: This class only works with MarginalDistributions data. |
|
DISCLAIMER: This class only works with MarginalDistributions data. |
|
DISCLAIMER: This class only works with MarginalDistributions data. |
- class bhepop2.analysis.PopulationAnalysis(populations: dict, modalities: dict, feature_column: str, distributions: pandas.DataFrame, distributions_name: str = DEFAULT_SOURCE_NAME, plot_title_format: str = DEFAULT_PLOT_TITLE_FORMAT, output_folder: str = None)
DISCLAIMER: This class only works with MarginalDistributions data.
The PopulationAnalysis class and its subclasses were implemented before the refactoring of the enrichment classes, which led to the composition of SyntheticPopulationEnrichment with EnrichmentSource, which is more generic. Therefore, this class expects distributions as in MarginalDistributions.data rather than a generic enrichment source data.
Analysis class for synthetic populations.
Synthetic populations must be identical except for their feature columns.
The values of the feature columns and their distributions are compared between populations and to the reference distribution.
Analysis is realised on the given modalities, which must be a subset of the modalities used for enrichment (and thus available in the population(s) and distributions).
- The following analysis are available:
Graphs comparing the distributions in the population(s) to the original distributions (one per modality)
A table describing the error of the population(s) in comparison to the distributions (one line per modality), ordered by number of individuals in the modality
- property analysis_table
- CLASS_COLUMN
- VALUE_COLUMN = 'value'
- DEFAULT_PLOT_TITLE_FORMAT = 'Modality {modality} from attribute {attribute}'
- set_output_folder(output_folder)
Set a new output folder for this analysis instance.
- Parameters:
output_folder – valid output folder path
- assert_output_folder()
Check that the output folder is set.
- Raises:
AssertionError
- _evaluate_analysis_table()
Create a table used for comparing populations/distributions.
- The resulting DataFrame contains the following columns:
attribute: attribute name
modality: modality name
self.PROPORTION_COLUMN: value describing the proportion taken for the corresponding
one column with the observed_name value
one column for each population name
- Returns:
analysis DataFrame
- abstract _format_distributions_for_analysis()
Format the distributions table for as an analysis table.
- Returns:
distributions as an analysis table
- _compute_distributions_by_attribute(population: pandas.DataFrame) pandas.DataFrame
Compute the feature values distribution for each modality.
Generate an analysis table for this population.
- Parameters:
population – population DataFrame
- Returns:
analysis table
- abstract _compute_distribution(population: pandas.DataFrame)
Get distribution of the feature values in the population.
- Parameters:
population – population DataFrame
- Returns:
analysis table of the population
- generate_analysis_plots()
Generate plots comparing the population(s) to the original distributions (one per modality).
Plots are exported to PNG images in the output folder.
- abstract plot_analysis_compare(attribute: str, modality: str)
Generate a plot comparing the populations and the distributions, for the given attribute and modality.
- Parameters:
attribute – attribute value
modality – attribute modality
- Returns:
Plotly Figure
- abstract generate_analysis_error_table(export_csv: bool = True)
Generate a table describing how analysed populations deviate from the original distributions.
- Parameters:
export_csv –
- Returns:
- get_plot_title(**kwargs) str
Get the plot title for the given keys.
This on the plot_title_format attribute, which can be set externally.
- Parameters:
kwargs – keys provided to the plot_title_format string
- Returns:
plot title
- class bhepop2.analysis.QuantitativeAnalysis(populations: dict, modalities: dict, feature_column: str, distributions: pandas.DataFrame, distributions_name: str = DEFAULT_SOURCE_NAME, plot_title_format: str = DEFAULT_PLOT_TITLE_FORMAT, output_folder: str = None)
Bases:
PopulationAnalysis
DISCLAIMER: This class only works with MarginalDistributions data.
The PopulationAnalysis class and its subclasses were implemented before the refactoring of the enrichment classes, which led to the composition of SyntheticPopulationEnrichment with EnrichmentSource, which is more generic. Therefore, this class expects distributions as in MarginalDistributions.data rather than a generic enrichment source data.
Analysis class for synthetic populations.
Synthetic populations must be identical except for their feature columns.
The values of the feature columns and their distributions are compared between populations and to the reference distribution.
Analysis is realised on the given modalities, which must be a subset of the modalities used for enrichment (and thus available in the population(s) and distributions).
- The following analysis are available:
Graphs comparing the distributions in the population(s) to the original distributions (one per modality)
A table describing the error of the population(s) in comparison to the distributions (one line per modality), ordered by number of individuals in the modality
- CLASS_COLUMN = 'decile'
- plot_analysis_compare(attribute: str, modality: str)
Comparison plot between reference data and simulation
- Parameters:
attribute –
modality –
- Returns:
Plotly figure
- generate_analysis_error_table(export_csv=True)
Generate a table describing how analysed populations deviate from the original distributions.
- Parameters:
export_csv – boolean a csv export should be realised
- Returns:
error table DataFrame
- _format_distributions_for_analysis()
Format the distributions table for as an analysis table.
- Returns:
distributions as an analysis table
- _compute_distribution(population: pandas.DataFrame) pandas.DataFrame
Compute decile distribution of the feature values.
- Parameters:
population – analysed population
- Returns:
dataframe of deciles
- class bhepop2.analysis.QualitativeAnalysis(populations: dict, modalities: dict, feature_column: str, distributions: pandas.DataFrame, distributions_name: str = DEFAULT_SOURCE_NAME, plot_title_format: str = DEFAULT_PLOT_TITLE_FORMAT, output_folder: str = None)
Bases:
PopulationAnalysis
DISCLAIMER: This class only works with MarginalDistributions data.
The PopulationAnalysis class and its subclasses were implemented before the refactoring of the enrichment classes, which led to the composition of SyntheticPopulationEnrichment with EnrichmentSource, which is more generic. Therefore, this class expects distributions as in MarginalDistributions.data rather than a generic enrichment source data.
Analysis class for synthetic populations.
Synthetic populations must be identical except for their feature columns.
The values of the feature columns and their distributions are compared between populations and to the reference distribution.
Analysis is realised on the given modalities, which must be a subset of the modalities used for enrichment (and thus available in the population(s) and distributions).
- The following analysis are available:
Graphs comparing the distributions in the population(s) to the original distributions (one per modality)
A table describing the error of the population(s) in comparison to the distributions (one line per modality), ordered by number of individuals in the modality
- CLASS_COLUMN = 'feature'
- plot_analysis_compare(attribute: str, modality: str)
Comparison plot between reference data and simulation
- Parameters:
attribute –
modality –
- Returns:
Plotly figure
- _format_distributions_for_analysis()
Format the distributions table for as an analysis table.
- Returns:
distributions as an analysis table
- _compute_distribution(population: pandas.DataFrame) pandas.DataFrame
Get distribution of the feature values in the population.
- Parameters:
population – population DataFrame
- Returns:
analysis table of the population