bhepop2.analysis

This module provides tools to analyse populations.

Most of the time, population analysis is done by comparing it with reference data.

For enriched populations, comparison with the enrichment source data can be a good way to assert the quality of the enrichment.

Module Contents

Classes

PopulationAnalysis

DISCLAIMER: This class only works with MarginalDistributions data.

QuantitativeAnalysis

DISCLAIMER: This class only works with MarginalDistributions data.

QualitativeAnalysis

DISCLAIMER: This class only works with MarginalDistributions data.

class bhepop2.analysis.PopulationAnalysis(populations: dict, modalities: dict, feature_column: str, distributions: pandas.DataFrame, distributions_name: str = DEFAULT_SOURCE_NAME, plot_title_format: str = DEFAULT_PLOT_TITLE_FORMAT, output_folder: str = None)

DISCLAIMER: This class only works with MarginalDistributions data.

The PopulationAnalysis class and its subclasses were implemented before the refactoring of the enrichment classes, which led to the composition of SyntheticPopulationEnrichment with EnrichmentSource, which is more generic. Therefore, this class expects distributions as in MarginalDistributions.data rather than a generic enrichment source data.


Analysis class for synthetic populations.

Synthetic populations must be identical except for their feature columns.

The values of the feature columns and their distributions are compared between populations and to the reference distribution.

Analysis is realised on the given modalities, which must be a subset of the modalities used for enrichment (and thus available in the population(s) and distributions).

The following analysis are available:
  • Graphs comparing the distributions in the population(s) to the original distributions (one per modality)

  • A table describing the error of the population(s) in comparison to the distributions (one line per modality), ordered by number of individuals in the modality

property analysis_table
CLASS_COLUMN
VALUE_COLUMN = 'value'
DEFAULT_PLOT_TITLE_FORMAT = 'Modality {modality} from attribute {attribute}'
set_output_folder(output_folder)

Set a new output folder for this analysis instance.

Parameters:

output_folder – valid output folder path

assert_output_folder()

Check that the output folder is set.

Raises:

AssertionError

_evaluate_analysis_table()

Create a table used for comparing populations/distributions.

The resulting DataFrame contains the following columns:
  • attribute: attribute name

  • modality: modality name

  • self.PROPORTION_COLUMN: value describing the proportion taken for the corresponding

  • one column with the observed_name value

  • one column for each population name

Returns:

analysis DataFrame

abstract _format_distributions_for_analysis()

Format the distributions table for as an analysis table.

Returns:

distributions as an analysis table

_compute_distributions_by_attribute(population: pandas.DataFrame) pandas.DataFrame

Compute the feature values distribution for each modality.

Generate an analysis table for this population.

Parameters:

population – population DataFrame

Returns:

analysis table

abstract _compute_distribution(population: pandas.DataFrame)

Get distribution of the feature values in the population.

Parameters:

population – population DataFrame

Returns:

analysis table of the population

generate_analysis_plots()

Generate plots comparing the population(s) to the original distributions (one per modality).

Plots are exported to PNG images in the output folder.

abstract plot_analysis_compare(attribute: str, modality: str)

Generate a plot comparing the populations and the distributions, for the given attribute and modality.

Parameters:
  • attribute – attribute value

  • modality – attribute modality

Returns:

Plotly Figure

abstract generate_analysis_error_table(export_csv: bool = True)

Generate a table describing how analysed populations deviate from the original distributions.

Parameters:

export_csv

Returns:

get_plot_title(**kwargs) str

Get the plot title for the given keys.

This on the plot_title_format attribute, which can be set externally.

Parameters:

kwargs – keys provided to the plot_title_format string

Returns:

plot title

class bhepop2.analysis.QuantitativeAnalysis(populations: dict, modalities: dict, feature_column: str, distributions: pandas.DataFrame, distributions_name: str = DEFAULT_SOURCE_NAME, plot_title_format: str = DEFAULT_PLOT_TITLE_FORMAT, output_folder: str = None)

Bases: PopulationAnalysis

DISCLAIMER: This class only works with MarginalDistributions data.

The PopulationAnalysis class and its subclasses were implemented before the refactoring of the enrichment classes, which led to the composition of SyntheticPopulationEnrichment with EnrichmentSource, which is more generic. Therefore, this class expects distributions as in MarginalDistributions.data rather than a generic enrichment source data.


Analysis class for synthetic populations.

Synthetic populations must be identical except for their feature columns.

The values of the feature columns and their distributions are compared between populations and to the reference distribution.

Analysis is realised on the given modalities, which must be a subset of the modalities used for enrichment (and thus available in the population(s) and distributions).

The following analysis are available:
  • Graphs comparing the distributions in the population(s) to the original distributions (one per modality)

  • A table describing the error of the population(s) in comparison to the distributions (one line per modality), ordered by number of individuals in the modality

CLASS_COLUMN = 'decile'
plot_analysis_compare(attribute: str, modality: str)

Comparison plot between reference data and simulation

Parameters:
  • attribute

  • modality

Returns:

Plotly figure

generate_analysis_error_table(export_csv=True)

Generate a table describing how analysed populations deviate from the original distributions.

Parameters:

export_csv – boolean a csv export should be realised

Returns:

error table DataFrame

_format_distributions_for_analysis()

Format the distributions table for as an analysis table.

Returns:

distributions as an analysis table

_compute_distribution(population: pandas.DataFrame) pandas.DataFrame

Compute decile distribution of the feature values.

Parameters:

population – analysed population

Returns:

dataframe of deciles

class bhepop2.analysis.QualitativeAnalysis(populations: dict, modalities: dict, feature_column: str, distributions: pandas.DataFrame, distributions_name: str = DEFAULT_SOURCE_NAME, plot_title_format: str = DEFAULT_PLOT_TITLE_FORMAT, output_folder: str = None)

Bases: PopulationAnalysis

DISCLAIMER: This class only works with MarginalDistributions data.

The PopulationAnalysis class and its subclasses were implemented before the refactoring of the enrichment classes, which led to the composition of SyntheticPopulationEnrichment with EnrichmentSource, which is more generic. Therefore, this class expects distributions as in MarginalDistributions.data rather than a generic enrichment source data.


Analysis class for synthetic populations.

Synthetic populations must be identical except for their feature columns.

The values of the feature columns and their distributions are compared between populations and to the reference distribution.

Analysis is realised on the given modalities, which must be a subset of the modalities used for enrichment (and thus available in the population(s) and distributions).

The following analysis are available:
  • Graphs comparing the distributions in the population(s) to the original distributions (one per modality)

  • A table describing the error of the population(s) in comparison to the distributions (one line per modality), ordered by number of individuals in the modality

CLASS_COLUMN = 'feature'
plot_analysis_compare(attribute: str, modality: str)

Comparison plot between reference data and simulation

Parameters:
  • attribute

  • modality

Returns:

Plotly figure

_format_distributions_for_analysis()

Format the distributions table for as an analysis table.

Returns:

distributions as an analysis table

_compute_distribution(population: pandas.DataFrame) pandas.DataFrame

Get distribution of the feature values in the population.

Parameters:

population – population DataFrame

Returns:

analysis table of the population