CAMPAREE API¶
The CAMPAREE Controller¶
-
class
camparee.camparee_controller.
CampareeController
[source]¶ The object essentially controls the flow of the pipeline. The run_camparee.py command instantiates a controller and calls one of the controller methods depending upon the pipeline-stage requested. The other methods in the class are helper methods.
-
assemble_input_samples
()[source]¶ Creates a list of sample objects, attached to the controller, that represent those samples that are to be run in the expression pipeline. If not running from the expression pipeline, this method is not used since the sample data is already contained in each packet. For each sample, a unique combination of adapter sequences are provided. The sample name is assumed to be that of the input filename without the extension. Gender may or may not be provided in the configuration data. If not set, the gender will be inferred by the expression pipeline.
-
static
check_file_existence
(directory_path, filename)[source]¶ Helper method to establish whether provided directory path and filename combine to point to an existing file :param directory_path: path to directory holding file :param filename: name of file :return: True if the path is valid and False otherwise
-
create_controller_log
()[source]¶ Generates a controller log containing the timestamp, seed, run id and current configuration data so that the user can replicate this run at a later date.
-
create_output_folder_structure
(stage_names)[source]¶ Use the provided stage names, the run id and the top level output directory path from the configuration data to create the top level directory of a preliminary directory structure. The attempt fails if either the top level output directory exists but either is not a directory or is a non-empty directory or if the user has insufficient permissions to create the directory. Created in the level directly below the top level output directory, are folders named after the stage names provided (i.e., controller, library_prep_pipeline, sequence_pipeline) and beneath each of these are data and log folders. Additional subdirectories are created later to organize the numerous files exprected and avoid congestion. :param stage_names: names of folders directly below the top level output directory (e.g., controller, library_prep)
-
perform_setup
(args, stage_names)[source]¶ This helper method sets up a number of attributes and behaviors in the controller. Stacktraces are suppressed and only user friendly errors are shown when the debugger is off (just a command line arg right now). The full configuration file data and run id are salted away and the random seed is set. The initial output folder structure (excluding the subdirectory structure needed to accommodate large numbers of file) is created. The output folder structure depends on the stage names. Also, the controller log is started. :param args: The command line arguments :param stage_names: The stage names
-
plant_seed
(seed)[source]¶ Helper method to add the seed to the controller for later use. The seed, if any, on the command line, takes precedence. If no seed is present on the command line, the controller configuration data is searched for it. If no seed is found there, one will be randomly generated. The seed will be added to the controller log file created for this run so that the user may re-create the run exactly at a later date, assuming all else remains the same. :param seed: The seed value found on the command line, if any.
-
retrieve_configuration
(configuration_file_path)[source]¶ Helper method to parse the configuration file given by the path info a dictionary attached to the controller object. For convenience, the portion of the configuration file that contains parametric data specific to the controller is set to a separate dictionary also attached to the controller. :param configuration_file_path: The absolute file path of the configuration file
-
run_camparee_pipeline
(args)[source]¶ This is how run_camparee.py calls the camparee pipeline. This method reads the command line arguments, parses the config file, and calls the necessary methods to run camparee. :param args: command line arguments
-
set_run_id
(run_id)[source]¶ Helper method to add the run id to the controller for later use. The run id, if any, on the command line takes precedence. If no run id is present on the command line, the controller configuration data is searched for it. If no run id is found in either place, an error is raised. :param run_id: The run id found on the command line, if any.
-
Expression Simulation Pipeline¶
-
class
camparee.expression_pipeline.
ExpressionPipeline
(configuration, scheduler_mode, output_directory_path, input_samples)[source]¶ This class represents a pipeline of steps that take user supplied fastq files through alignment, variants finding, parental genome construction, annotation, quantification and generation of transcripts and finally the generation of packets of molecules that may be used to simulate RNA sequencing.
-
generate_job_seeds
()[source]¶ Generate one seed per job that needs a seed, returns a dictionary mapping job names to seeds
We generate seeds for each job since they run on separate nodes of the cluster, potentially and so do not simply share Numpy seeds. We generate them all ahead of time so that if jobs need to be restart, they can reuse the same seed.
-
run_step
(step_name, sample, cmd_line_args, dependency_list=None, jobname_suffix=None)[source]¶ Helper function that runs the given step, with the given parameters. It wraps submission of the step to the scheduler/job monitor.
Parameters: - step_name (string) – Name of the CAMPAREE step to run. It should be in the list of steps stored in the steps dictionary.
- sample (Sample) – Sample to run through the step. For steps that aren’t associated with specific samples, set this to None.
- cmd_line_args (list) – List of positional parameters to pass to the get_commandline_call() method for the given step.
- dependency_list (list) – List of job names (if any) the current step depends on. Default: None.
- jobname_suffix (string) – Suffix to add to job submission ID. Default: None.
-
set_third_party_software
()[source]¶ Helper method to gather the names of all the 3rd party application files or directories and use them to set all the paths needed in the pipeline. Since the third party software is shipped with this application, validation should not be necessary. Software is identified generally by name and not specifically by filename since filenames may contain versioning and other artefacts. :return: the filenames for beagle, star, and kallisto, and the directory name for bowtie2.
-
validate_and_set_optional_inputs
(optional_inputs)[source]¶ Helper method to validate and set optional inputs.
Parameters: optional_inputs (dict) – Optional input files specified in the config file. Returns: True for valid optional inputs and False otherwise. Return type: boolean
-
validate_and_set_output_data
(output)[source]¶ Helper method to validate and set output data. :param output: The output dictionary extracted from the configuration file. :return: True for valid output data and False otherwise
-
validate_and_set_resources
(resources)[source]¶ Since the resources are input file intensive, and since information about resource paths is found in the configuration file, this method validates that all needed resource information is complete, consistent and all input data is found. :param resources: dictionary containing resources from the configuration file :return: True if valid and False otherwise
-
validate_and_set_sample_optional_inputs
(sample_inputs)[source]¶ Helper method to validate and set per sample optional inputs.
Parameters: sample_inputs (dict) – Subject input parameters specified in the ‘data’ section of the config file. Returns: True for valid per sample optional inputs and False otherwise. Return type: boolean
-
Genome Alignment & BAM Indexing Steps¶
-
class
camparee.genome_alignment.
GenomeAlignmentStep
(log_directory_path, data_directory_path, parameters={})[source]¶ -
execute
(sample, star_index_directory_path, star_bin_path)[source]¶ Use STAR to align fastq files for a given sample, to the reference genome.
Parameters: - sample (Sample) – Sample containing paths for FASTQ files to align, or pre-aligned BAM file.
- star_index_directory_path (string) – Path to directory containing STAR index.
- star_bin_path (string) – Path to STAR executable binary.
-
get_commandline_call
(sample, star_index_directory_path, star_bin_path)[source]¶ Prepare command to execute the GenomeAlignment from the command line, given all of the arugments used to run the execute() function.
Parameters: - sample (Sample) – Sample containing paths for FASTQ files to align, or pre-aligned BAM file.
- star_index_directory_path (string) – Path to directory containing STAR index.
- star_bin_path (string) – Path to STAR executable binary.
Returns: Command to execute on the command line. It will perform the same operations as a call to execute() with the same parameters.
Return type: string
-
get_genome_bam_path
(sample)[source]¶ Determine whether user provided a BAM file for the given sample, and return either this path, or the default path used by the GenomeAlignment step.
Parameters: sample (Sample) – Sample containing paths for FASTQ files to align, or pre-aligned BAM file. Returns: Path to BAM file associated with this sample. Either path given by user or default path used by GenomeAlignment step. Return type: string
-
get_validation_attributes
(sample, star_index_directory_path, star_bin_path)[source]¶ Prepare attributes required by is_output_valid() function to validate output generated the STAR genome job corresponding to the given sample.
Parameters: - sample (Sample) – Sample defining the FASTQ files to be aligned, or the pre-aligned BAM.
- star_index_directory_path (string) – Path to directory containing STAR index. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
- star_bin_path (string) – Path to STAR executable binary. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
Returns: The GenomeAlignment job’s data directory, sampleID, BAM path, and a flag indicating whether or not the user provided a pre-aligned BAM file.
Return type: dict
-
static
is_output_valid
(validation_attributes)[source]¶ Check if output of GenomeAlignment for a specific job/execution is correctly formed and valid, given a job’s data directory, sample id, BAM file path, and a flag indicating whether or not the user provided a pre-aligned BAM file. If the user provided a pre-aligned BAM file, this method assumes that the BAM file is complete if it exists. If this script performed the alignment, it will check STAR log files to confirm the BAM file is complete.
Parameters: validation_attributes (dict) – A job’s data_directory, sample_id, path to the BAM file, and a flag indicating whether or not the user provided a pre-aligned BAM. Returns: True - GenomeAlignment output files were created and are well formed. False - GenomeAlignment output files do not exist or are missing data. Return type: boolean
-
-
class
camparee.genome_alignment.
GenomeBamIndexStep
(log_directory_path, data_directory_path, parameters=None)[source]¶ -
execute
(sample, bam_file_path)[source]¶ Build index of a given bam file.
Parameters: - sample (Sample) – Sample associated with BAM file to be indexed.
- bam_file_path (string) – BAM file to be indexed.
-
get_commandline_call
(sample, bam_file_path)[source]¶ Prepare command to execute the GenomeIndex from the command line, given all of the arugments used to run the execute() function.
Parameters: - sample (Sample) – Sample associated with BAM file to be indexed.
- bam_file_path (string) – BAM file to be indexed.
Returns: Command to execute on the command line. It will perform the same operations as a call to execute() with the same parameters.
Return type: string
-
get_validation_attributes
(sample, bam_file_path)[source]¶ Prepare attributes required by is_output_valid() function to validate output generated the BAM index job corresponding to the given bam file.
Parameters: - sample (Sample) – Sample associated with BAM file to be indexed. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
- bam_file_path (string) – BAM file to be indexed.
Returns: Path to the BAM file indexed by this step.
Return type: dict
-
static
is_output_valid
(validation_attributes)[source]¶ Check if output of GenomeBamIndexStep for a specific job/execution is valid, given a job’s BAM file path.
Parameters: validation_attributes (dict) – The path to a job’s BAM file. Returns: True - BAM index file was created in same directory as the BAM file. False - BAM index file is missing from same directory as the BAM file. Return type: boolean
-
Variant Finder Step¶
-
class
camparee.variants_finder.
PositionInfo
(chromosome, position)[source]¶ This class is meant to capture all the read data associated with a particular chromsome and position on the genome. It is used to ascertain whether this position actually holds a variant. If it does, the data is formatted into a string to be written into the variants file.
-
calculate_entropy
()[source]¶ Use the top two abundances (if two) of the variants for the given position to compute an entropy. If only one abundance is given, return 0. :return: entropy for the given position
-
filter_reads
(min_abundance_threshold, reference_base)[source]¶ Filters out from this position, reads that are not considered true variants. Any reads with read counts of only 1 are excluded to start with. At most, only the top two remaining reads are retained. The lesser of those two reads may also be removed if it does not satisfy the minimum abundance threshold criterion. The minimum abundance threshold criterion specifies that the percent contribution of the lesser variant reads to the total reads be equal or greater than the threshold provided. In the event of a tie for one of both of those top two slots, preference is given to the reference base if it is included in the tie. If at any point in filtering, only one read remains and its description matches the reference base, it is removed, leaving no variants. Once complete, the reads for this position object contain only true variants (which may include the reference base if there is one another true variant). :param min_abundance_threshold: criterion for minimum abundance threshold :param reference_base: the base of the reference genome at this position.
-
-
class
camparee.variants_finder.
Read
¶ A named tuple that possesses all the attributes of a variant type: match (M), deletion (D), insertion (I) chromosome: chrN position: position on ref genome description: description of the variant (e.g., C, IAA, D5, etc.)
-
description
¶ Alias for field number 1
-
position
¶ Alias for field number 0
-
-
class
camparee.variants_finder.
VariantsFinderStep
(log_directory_path, data_directory_path, parameters={})[source]¶ This class creates a text file listing variants for those locations in the reference genome having variants. The variants include snps and indels with the number of reads attributed to each variant. The relevant bam-formatted input file is expected to be indexed and sorted.
This script outputs a file that gives the full breakdown at each location in the genome of the number of A’s, C’s, G’s and T’s as well as the number of each size of insertion and deletion. If it’s an insertion the sequence of the insertion itself is given. So for example a line of output like the following means 29 reads had a C in that location and three reads had an insertion of TTT. chr1:10128503 | C:29 | ITTT:3
Note that only the top two variants are kept and of those the lesser variant’s counts must meet certain user criteria (minimum threshold, read total count) to be considered a variant. Single reads that match the corresponding base in the reference genome are not variants and as such are not kept.
-
call_variants
(chromosome, reads)[source]¶ Parses the reads dictionary (read named tuple:read count) for each chromosome - position to create a line with the variants and their counts delimited by pipes. Dumping each chromosome’s worth of data at a time is done to avoid too sizable a dictionary. Additionally, if the user requests a sort by entropy, this function will do that ordering and send that data to stdout. :param chromosome: chromosome under consideration here :param reads: dictionary of reads to read counts
-
collect_reads
(chromosome)[source]¶ Iterate over the input txt file containing cigar, seq, start location, chromosome for each read and consolidate reads for each position on the genome.
-
execute
(sample, alignment_file_path, chr_ploidy_data, reference_genome, seed=None, chromosomes=None)[source]¶ Entry point into variants_finder. Iterates over the chromosomes in the list provided by the chr_ploidy_data keys to pick out variants. Chromosomes that are not pertainent to the sample’s gender are skipped. If no sample gender is specified, only those chromosomes that have the same ploidy for both genders are processed. :param sample: The sample for which the variants for to be found :param chr_ploidy_data: dictionary of chromosomes as keys and a dictionary of male/female ploidy as values. :param reference_genome: A dictionary representation of the reference genome :param seed: Seed for random number generator :param chromosomes: A listing of chromosomes to replace the list obtained from the alignment file. Used for debugging purposes.
-
filter_chromosome_list
(sample, chr_ploidy_data)[source]¶ Culls from the chromosome list, those chromosomes that are either not relevant given the sample gender or not relevant because no sample gender was provided. :param sample: subject sample which contains gender information :param chr_ploidy_data: dictionary of chromosomes as keys and a dictionary of male/female ploidy as values.
-
get_commandline_call
(sample, alignment_file_path, chr_ploidy_file_path, reference_genome_file_path, seed=None)[source]¶ Prepare command to execute the VariantsFinder from the command line, given all of the arugments used to run the execute() function.
Parameters: - sample (Sample) – Sample for which variants will be called.
- alignment_file_path (string) – Path to BAM file which will be parsed.
- chr_ploidy_file_path (string) – File that maps chromosome names to their male/female ploidy.
- reference_genome_file_path (string) – File that maps chromosome names in reference to nucleotide sequence.
- seed (integer) – Seed for random number generator. Used to repeated runs will produce the same results.
Returns: Command to execute on the command line. It will perform the same operations as a call to execute() with the same parameters.
Return type: string
-
get_validation_attributes
(sample, alignment_file_path, chr_ploidy_file_path, reference_genome_file_path, seed=None)[source]¶ Prepare attributes required by is_output_valid() function to validate output generated the VariantsFinder job corresponding to the given sample.
Parameters: - sample (Sample) – Sample for which variants will be called.
- alignment_file_path (string) – Path to BAM file which will be parsed. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
- chr_ploidy_file_path (string) – File that maps chromosome names to their male/female ploidy. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
- reference_genome_file_path (string) – File that maps chromosome names in reference to nucleotide sequence. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
- seed (integer) – Seed for random number generator. Used to repeated runs will produce the same results. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
Returns: A VariantsFinder job’s data_directory, log_directory, and sample_id.
Return type: dict
-
identify_variant
(position_info, variants)[source]¶ Helper method to filter position reads to identify variants :param position_info: position being evaluated :param variants: growing list of variants to which this position may be added if it contains variants
-
static
is_output_valid
(validation_attributes)[source]¶ Check if output of VariantsFinder for a specific job/execution is correctly formed and valid, given a job’s data directory, log directory, and sample id. Prepare these attributes for a given sample’s jobs using the get_validation_attributes() method.
Parameters: validation_attributes (dict) – A job’s data_directory, log_directory, and sample_id. Returns: True - VariantsFinder output files were created and are well formed. False - VariantsFinder output files do not exist or are missing data. Return type: boolean
-
load_variants
(variants, variants_file_path)[source]¶ Load the variants to a file in the user’s designated output directory one chromosome at a time. The filename has the stem of the alignment filename suffixed with _variants.txt :param variants: variants list for one chromosome.
-
static
main
()[source]¶ Entry point into script. Allows script to be executed/submitted via the command line.
-
remove_clips
(cigar, sequence)[source]¶ Remove soft and hard clips at the beginning and end of the cigar string and remove soft and hard clips at the beginning of the seq as well. Modified cigar string and sequence are returned :param cigar: raw cigar string from read :param sequence: raw sequence string from read :return: tuple of modified cigar and sequence strings (sans clips)
-
Intron Quantification Step¶
-
class
camparee.intron_quant.
IntronQuantificationStep
(log_directory_path, data_directory_path, parameters)[source]¶ -
execute
(aligned_file_path, output_directory, geneinfo_file_path)[source]¶ Entry point into the CAMPAREE step.
-
get_commandline_call
(aligned_file_path, output_directory, geneinfo_file_path)[source]¶ Prepare command to execute the IntronQuantification from the command line, given all of the arugments used to run the execute() method.
Parameters: - aligned_file_path (string) – Path to BAM file aligned to genome.
- output_directory (string) – Directory where the following output files will be saved: {CAMPAREE_CONSTANTS.INTRON_OUTPUT_FILENAME}, {CAMPAREE_CONSTANTS.INTRON_OUTPUT_ANTISENSE_FILENAME}, {CAMPAREE_CONSTANTS.INTERGENIC_OUTPUT_FILENAME}.
- geneinfo_file_path (string) – Geneinfo file in BED format with 1-based, inclusive coordinates.
Returns: Command to execute on the command line. It will perform the same operations as a call to execute() with the same parameters.
Return type: string
-
get_validation_attributes
(aligned_file_path, output_directory, geneinfo_file_path)[source]¶ Prepare attributes required by is_output_valid() function to validate output generated the IntronQuantification job.
Parameters: - aligned_file_path (string) – Path to BAM file aligned to genome. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
- output_directory (string) – Directory where the following output files are saved: {CAMPAREE_CONSTANTS.INTRON_OUTPUT_FILENAME}, {CAMPAREE_CONSTANTS.INTRON_OUTPUT_ANTISENSE_FILENAME}, {CAMPAREE_CONSTANTS.INTERGENIC_OUTPUT_FILENAME}.
- geneinfo_file_path (string) – Geneinfo file in BED format with 1-based, inclusive coordinates. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
Returns: A IntronQuantification job’s output_directory.
Return type: dict
-
static
is_output_valid
(validation_attributes)[source]¶ Check if output of this step, for a specific job/execution is correctly formed and valid, given the dictionary of valdiation attributes. Prepare these attributes for a given executing by calling the get_validation_attributes() method.
Parameters: validation_attributes (dict) – Key-value pairings of attributes generated by the get_validation_attributes() method. Returns: True - Output files for this step were created and are well formed. False - Output files for this steo do not exist or are missing data. Return type: boolean
-
Variant Compilation Step¶
-
class
camparee.variants_compilation.
VariantsCompilationStep
(log_directory_path, data_directory_path, parameters=None)[source]¶ -
execute
(sample_id_list, chr_ploidy_data, reference_genome, seed=None)[source]¶ Entry point into variants_compilation.
Parameters: - sample_id_list (list) – List of sample IDs
- chr_ploidy_data (dict) – Dictionary of chromosomes as keys and a dictionary of male/female ploidy as values.
- reference_genome (dict) – Dictionary representation of the reference genome
- seed (int) – Seed for random number generator. Used so repeated runs will produce the same results.
-
get_commandline_call
(samples, chr_ploidy_file_path, reference_genome_file_path, seed=None)[source]¶ Prepare command to execute the VariantsCompilationStep from the command line, given all of the arugments used to run the execute() function.
Parameters: - samples (list) – List of Sample() objects for which variants have been called and need to be merged.
- chr_ploidy_file_path (string) – File that maps chromosome names to their male/female ploidy.
- reference_genome_file_path (string) – File that maps chromosome names in reference to nucleotide sequence.
- seed (integer) – Seed for random number generator. Used so repeated runs will produce the same results.
Returns: Command to execute on the command line. It will perform the same operations as a call to execute() with the same parameters.
Return type: string
-
get_validation_attributes
(samples, chr_ploidy_file_path, reference_genome_file_path, seed=None)[source]¶ Prepare attributes required by is_output_valid() function to validate output generated the VariantsCompilationStep job.
Parameters: - samples (list) – List of Sample() objects for which variants have been called and need to be merged. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
- chr_ploidy_file_path (string) – File that maps chromosome names to their male/female ploidy. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
- reference_genome_file_path (string) – File that maps chromosome names in reference to nucleotide sequence. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
- seed (integer) – Seed for random number generator. Used so repeated runs will produce the same results. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
Returns: A VariantsCompilationStep run’s data_directory and log_directory.
Return type: dict
-
static
is_output_valid
(validation_attributes)[source]¶ Check if output of VariantsCompilationStep for a specific job/execution is correctly formed and valid, given the run’s data and log directories. Prepare these attributes using the get_validation_attributes() method.
Parameters: validation_attributes (dict) – A CAMPAREE run’s data_directory and log_directory. Returns: - True - VariantsCompilationStep output files were created and are
- well formed.
- False - VariantsCompilationStep output files do not exist or are
- missing data.
Return type: boolean
-
Beagle Step¶
-
class
camparee.beagle.
BeagleStep
(log_directory_path, data_directory_path, parameters={})[source]¶ -
execute
(beagle_jar_path, seed=None)[source]¶ Entry point into the beagle step. This ends up running the Beagle jar from the command line.
Parameters: - beagle_jar_path (string) – Path to the beagle JAR file.
- seed (int) – Seed for random number generator. Used so repeated runs will produce the same results.
-
get_commandline_call
(beagle_jar_path, seed=None)[source]¶ Prepare command to execute the BeagleStep from the command line, given all of the arugments used to run the execute() function.
Parameters: - beagle_jar_path (string) – Path to the beagle JAR file.
- seed (int) – Seed for random number generator. Used so repeated runs will produce the same results.
Returns: Command to execute on the command line. It will perform the same operations as a call to execute() with the same parameters.
Return type: string
-
get_validation_attributes
(beagle_jar_path, seed=None)[source]¶ Prepare attributes required by is_output_valid() function to validate output generated the BeagleStep job.
Parameters: - beagle_jar_path (string) – Path to the beagle JAR file. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
- seed (int) – Seed for random number generator. Used so repeated runs will produce the same results. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
Returns: A BeagleStep run’s data_directory and log_directory.
Return type: dict
-
static
is_output_valid
(validation_attributes)[source]¶ Check if output of BeagleStep for a specific job/execution is correctly formed and valid, given the run’s data and log directories. Prepare these attributes using the get_validation_attributes() method.
Parameters: validation_attributes (dict) – A CAMPAREE run’s data_directory and log_directory. Returns: True - BeagleStep output files were created and are well formed. False - BeagleStep output files do not exist or are missing data. Return type: boolean
-
Genome Builder Step¶
-
class
camparee.genome_builder.
Genome
(name, chromosome, start_sequence, start_position, genome_output_directory)[source]¶ Holds name, chromosome, current seq, current position (0 indexed) and current offset for a nascent, custom genome. The current offset is such that when it is added to the current position, one arrives at the corresponding position (0 indexed) on the reference genome. The object also provides methods for appending, inserting and deleting based upon instructions in the variants input file.
-
append_segment
(sequence)[source]¶ Append the given sequence segment to the custom genome. Since the sequence segment either has a one to one correspondence with that of reference genome or is a sequence segment drawn from the reference genome; execution of this method does not alter the current position of the custom genome relative to the current position of the reference genome/variant. So position advances by the sequence segment length but offset remains unchangeed. :param sequence: sequence segment to append
-
delete_segment
(length)[source]¶ Skip over (delete) a length of the reference sequence. Since the reference sequence is advancing while the custom sequence is not, the relative current position of the genome again changes relative to the current position of the reference sequence. As such, the current genome position does not advance but the offset increases by the length provided. :param length: number of bases in the reference sequence to skip over.
-
insert_segment
(sequence)[source]¶ Insert the given sequence segment into the custom genome. Since the given sequence segment does not correspond to anything in the reference genome; the current position of the custom genome relative to the current position of the reference genome/variant does change by the length of the sequence segment. Since the custom genome sequence is advancing while the reference sequence is not, the sequence segment length is subtracted from the offset while the genome current position is advanced by the length of the sequence segment. :param sequence: sequence segment to insert
-
save_to_file
()[source]¶ Saves the custom genome sequence into a single line of a fasta file. The genome name is suffixed to the given output filename steam. Since the genome sequence data is saved one chromosome at a time, the output file is appended to. That means that the output file should be empty when the first chromosome sequence is added. Since the sequence is memory is closed at this time, this genome can no longer be modified.
-
-
class
camparee.genome_builder.
GenomeBuilderStep
(log_directory_path, data_directory_path, parameters={})[source]¶ -
build_sequence_from_variant
(genome, variant, reference_base)[source]¶ Applies the variant provided to the custom genome provided in accordance with the variant’s format (e.g., D indicates delete followed by number of bases to delete, I indicates insert followed by bases to insert, and no D or I indicates a single base change. :param genome: custom genome to which the variant is applied :param variant: variant to apply :param reference_base: base to use in place of indels when the option to ignore indels is selected.
-
execute
(sample, phased_vcf_file_path, chr_ploidy_data, reference_genome, chromosome_list=None)[source]¶ Entry point for genome builder. Uses chr_ploidy_data and reference_genome resources along with phased vcf data (Beagle-generated by default) and variant finder output to build two custom genomes.
Parameters: - sample (Sample) – Sample for which the genome is being built.
- phased_vcf_file_path (string) – VCF file of phased genotypes for each sample. Generate using Beagle by default, but can be provided by the user.
- chr_ploidy_data (dict) – Dictionary indicating chromosomes to be processed and their ploidy based on sample gender.
- reference_genome (dict) – Dictionary relating chr to its reference sequence.
- chromosome_list (list) – A debug feature that overrides the chr_ploidy_data chr list. Useful for testing a specific chromosome only or a small subset of chromosomes.
-
get_commandline_call
(sample, phased_vcf_file_path, chr_ploidy_file_path, reference_genome_file_path, chromosome_list=None)[source]¶ Prepare command to execute the GenomeBuilderStep from the command line, given all of the arugments used to run the execute() function.
Parameters: - sample (Sample) – Sample for which to construct parental genomes
- phased_vcf_file_path (string) – VCF file of phased genotypes for each sample. Generate using Beagle by default, but can be provided by the user.
- chr_ploidy_file_path (string) – File that maps chromosome names to their male/female ploidy.
- reference_genome_file_path (string) – File that maps chromosome names in reference to nucleotide sequence.
- chromosome_list (list) – A debug feature that overrides the chr_ploidy_data chr list. Useful for testing a specific chromosome only or a small subset of chromosomes.
Returns: Command to execute on the command line. It will perform the same operations as a call to execute() with the same parameters.
Return type: string
-
get_missing_chr_list
()[source]¶ Return a list of those chromosomes from chr_ploidy_data that are missing for the sample’s gender. If no sample gender is specified, only return a list of chromosomes from chr_ploidy_data where the chromosomes are missing for both genders (unlikely scenario). :return: list of chromosomes that are missing for this sample (likely owing to its gender)
-
get_paired_chr_list
()[source]¶ Return a list of those chromosomes from chr_ploidy_data that are paired for the sample’s gender. If no sample gender is specified, only return a list of chromosomes from chr_ploidy_data where the chromosomes are paired for both genders. :return: list of chromosomes that are paired for this sample (likely owing to its gender)
-
get_unpaired_chr_list
()[source]¶ Return a list of those chromosomes from chr_ploidy_data that are unpaired for the sample’s gender. If no sample gender is specified, only return a list of chromosomes from chr_ploidy_data where the chromosomes are unpaired for both genders. :return: list of chromosomes that are unpaired for this sample (likely owing to its gender)
-
get_unpaired_chr_variant_data
()[source]¶ There should be at most, one variant for any given position in an unpaired chromosome. This method groups the variant records by chromosome for those chromosomes found in the unpaired chr list and adds a single instance variant to the an unpaired_chr_variants list for every such variant found and returns the list. :return: A list of all unpaired chromosome variants
-
get_validation_attributes
(sample, phased_vcf_file_path, chr_ploidy_file_path, reference_genome_file_path, chromosome_list=None)[source]¶ Prepare attributes required by is_output_valid() function to validate output generated the GenomeBuilderStep job corresponding to the given sample.
Parameters: - sample (Sample) – Sample for which custom parental genomes will be generated.
- phased_vcf_file_path (string) – VCF file of phased genotypes for each sample. Generate using Beagle by default, but can be provided by the user. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
- chr_ploidy_file_path (string) – File that maps chromosome names to their male/female ploidy. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
- reference_genome_file_path (string) – File that maps chromosome names in reference to nucleotide sequence. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
- chromosome_list (list) – A debug feature that overrides the chr_ploidy_data chr list. Useful for testing a specific chromosome only or a small subset of chromosomes. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
Returns: A GenomeBuilderStep job’s data_directory, log_directory, sample_id, and a list of the genome names used by the GenomeBuilderStep to refer to each of the parental genomes (i.e. 1 and 2 for male and female parent, respectively).
Return type: dict
-
static
group_data
(lines, group_function)[source]¶ Returns data grouped by the provided function :param lines: the lines of data to be grouped :param group_function: The function to apply to determine the groupping. :return: a generator providing the next key (the groupping parameter) and the groupped data as a list.
-
static
is_output_valid
(validation_attributes)[source]¶ Check if output of GenomeBuilderStep for a specific job/execution is correctly formed and valid, given a job’s data directory, log directory, and sample id. Prepare these attributes for a given sample’s jobs using the get_validation_attributes() method.
Parameters: validation_attributes (dict) – A job’s data_directory, log_directory, sample_id, and the list of genome names used by the GenomeBuilderStep to refer to each of the parental genomes (i.e. 1 and 2 for male and female parent, respectively). Returns: True - GenomeBuilderStep output files were created and are well formed. False - GenomeBuilderStep output files do not exist or are missing data. Return type: boolean
-
locate_sample
()[source]¶ Find the position of the sample in the phased vcf data :return: The position of the sample in a line of phased vcf data
-
static
main
()[source]¶ Entry point into script. Allows script to be executed/submitted via the command line.
-
make_paired_chromosome
(chromosome, sample_index)[source]¶ Here, the beagle data for the given sample is threaded together with the reference sequence to create a custom sequence for the given chromosome. Below is a snippet of a beagle vcf file for 6 samples along with a header. If phased vcf file comes from a program other than Beagle, it must match this format.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1 sample2 sample3 sample4 sample5 sample6 … chr1 257558 . A G . PASS . GT 0|1 0|0 0|0 0|0 0|0 0|0 chr1 257559 . G C,GAG . PASS . GT 0|1 0|2 0|0 0|0 0|0 0|0 chr1 257560 . C A . PASS . GT 1|0 1|0 0|0 1|0 0|0 0|0 chr1 257570 . C CAA,CA . PASS . GT 1|0 0|0 0|0 0|0 0|1 2|0
Parameters: - chromosome – The chromosome for which the reference sequence is altered by phased vcf data.
- sample_index – identifies the position of the subject sample in the phased vcf data.
-
make_reference_chromosome
(chromosome)[source]¶ Here, the reference sequence for the given chromosome is copied as it, into the custom genomes. :param chromosome: The chromosome for which the reference sequence is used.
-
Update Annotation Step¶
-
class
camparee.update_annotation_for_genome.
UpdateAnnotationForGenomeStep
(log_directory_path, data_directory_path, parameters={})[source]¶ Updates a gene annotation’s coordinates to account for insertions & deletions (indels) introduced by GenomeFilesPreparation when it creates variant genomes. Note, this is designed to update annotation for a single variant genome at a time.
Parameters: - genome_indel_filename (string) –
Path to file containing list of indel locations generated by the GenomeFilesPreparation. This file has no header and contains three, tab-delimited colums (example):
1:134937 D 2 1:138813 I 1- First column = Chromosome and coordinate of indel in the original
- reference genome. Note, coordinate is zero-based.
- Second column = “D” if variant is a deletion, “I” if variant is an
- insertion.
Third column = Length in bases of indel.
- input_annot_filename (string) –
Path to file containing gene/transcript annotations using coordinates to the original reference genome. This file should have 11, tab-delimited columns and includes a header (example):
chrom strand txStart txEnd exonCount exonStarts exonEnds transcriptID geneID geneSymbol biotype 1 + 11869 14409 3 11869,12613,13221 12227,12721,14409 ENST00000456328 ENSG00000223972 DDX11L1 pseudogeneAn annotation file with this format can be generated from a GTF file using the convert_gtf_to_annot_file_format() function in the Utils package. A template for this annotation format is available in the class variable Utils.annot_output_format.
- updated_annot_filename (string) – Path to the output file containing gene/transcript annotations with coordinates updated to match the variant genome.
- log_filename (string) – Path to the log file.
-
execute
(sample, genome_indel_suffix, input_annot_file_path, chr_ploidy_file_path)[source]¶ Main work-horse function that generates the updated annotation.
Parameters- genome_indel_suffix : int
- Suffix to apply to obtain proper genome indel file. Should be 1 or 2.
- input_annot_filename : string
- Full path to annotation file with coordinates for reference genome.
-
get_commandline_call
(sample, genome_indel_suffix, input_annot_file_path, chr_ploidy_file_path)[source]¶ Prepare command to execute the UpdateAnnotationForGenomeStep from the command line, given all of the arugments used to run the execute() function.
Parameters: - sample (Sample) – Sample for which to update annotation to parental genomes
- genome_indel_suffix (string) – Suffix to apply to obtain proper genome indel file. This suffix is also used in the name of the updated annotation file.
- input_annot_file_path (string) – Full path to annotation file with coordinates for reference genome.
- chr_ploidy_file_path (string) – File that maps chromosome names to their male/female ploidy.
Returns: Command to execute on the command line. It will perform the same operations as a call to execute() with the same parameters.
Return type: string
-
get_validation_attributes
(sample, genome_indel_suffix, input_annot_file_path, chr_ploidy_file_path)[source]¶ Prepare attributes required by is_output_valid() function to validate output generated the UpdateAnnotationForGenomeStep job corresponding to the given sample.
Parameters: - sample (Sample) – Sample for which to update annotation to parental genomes
- genome_indel_suffix (string) – Suffix to apply to obtain proper genome indel file. This suffix is also used in the name of the updated annotation file.
- input_annot_file_path (string) – Full path to annotation file with coordinates for reference genome.
- chr_ploidy_file_path (string) – File that maps chromosome names to their male/female ploidy.
Returns: A UpdateAnnotationForGenomeStep job’s data_directory, log_directory, sample_id, and the suffix used when building the updated genome sequence.
Return type: dict
-
static
is_output_valid
(validation_attributes)[source]¶ Check if output of UpdateAnnotationForGenomeStep for a specific job/ execution is correctly formed and valid, given a job’s data directory, log directory, and sample id. Prepare these attributes for a given sample’s jobs using the get_validation_attributes() method.
Parameters: validation_attributes (dict) – A job’s data_directory, log_directory, sample_id, and the suffix used when building the updated genome sequence. Returns: - True - UpdateAnnotationForGenomeStep output files were created and
- are well formed.
- False - UpdateAnnotationForGenomeStep output files do not exist or
- are missing data.
Return type: boolean
- genome_indel_filename (string) –
Transcriptome FASTA Preparation Step¶
-
class
camparee.transcriptome_fasta_preparation.
TranscriptomeFastaPreparationStep
(log_directory_path, data_directory_path, parameters={})[source]¶ Produces a transcriptome FASTA file, given a genome FASTA file, a file containing exon locations, and an annotation file. Additionally, any line in the annotation file related to a chromosome not available in the genome fasta file is discarded in a new, trimmed version of the annotation file.
The object is constructed with 2 input file sources (genome fasta, annotation) and 2 output file sources (trimmed annotation, transcriptome fasta). Additionally another output file, named like the genome fasta file but suffixed with ‘_edited’ contains a munged version of the genome fasta file where each chromosome sequence occupies one line.
-
create_exon_location_list
()[source]¶ Generate a unique listing of exon location strings from the provided annotation file. Note that the same exon may appear in multiple transcripts. So the listing is actually a set to avoid duplicate entries.
-
create_exon_sequence_map
(genome_chromosome, sequence)[source]¶ For the given genome chromosome and its sequence, create a dictionary of exon sequences keyed to the exon’s location (i.e., chr:start-end). :param genome_chromosome: given genome chromosome :param sequence: the genome sequence corresponding to the genome chromosome (without line breaks) :return: map of exon location : exon sequence
-
execute
(sample_id, genome_suffix, genome_fasta_file_path, annotation_file_path, include_suffix_w_tx_id=False)[source]¶ Main work-horse function that does the work of creating a transcriptome fasta file from the provided inputs.
Parameters: - sample_id (string) – Identifier for sample corresponding to this reference genome. Used to construct output and log paths for this specific execution.
- genome_suffix (string) – Suffix to identify the parent/allele of the source genome. Should be 1 or 2. This same suffix is a appended to all output files, and the individual transcript IDs in the transcriptome FASTA if the include_suffix_w_tx_id parameter is set to TRUE.
- genome_fasta_file_path (string) – Input genome fasta filename containing all the chromosomes of interest. No line breaks are allowed within the chromosome sequence. This is generally prepared by the GenomeBuilderStep, in which case it should have no line breaks within the chromosome sequence.
- annotation_file_path (string) – Input transcript annotation file - fields are (chromosome, strand, start, end, exon count, exon starts, exon ends, transcript ID, etc.). This is generally prepared by the UpdateAnnotationForGenomeStep.
- include_suffix_w_tx_id (boolean) – Append parent/allele suffix to transcript names in FASTA headers of the output file when set to True [Default: False].
-
get_commandline_call
(sample_id, genome_suffix, genome_fasta_file_path, annotation_file_path, include_suffix_w_tx_id=False)[source]¶ Prepare command to execute the TranscriptomeFastaPreparationStep from the command line, given all of the arugments used to run the execute() function.
Parameters: - sample_id (string) – Identifier for sample corresponding to this reference genome. Used to construct output and log paths for this specific execution.
- genome_suffix (string) – Suffix to identify the parent/allele of the source genome. Should be 1 or 2. This same suffix is a appended to all output files, and the individual transcript IDs in the transcriptome FASTA if the include_suffix_w_tx_id parameter is set to TRUE.
- genome_fasta_file_path (string) – Input genome fasta filename containing all the chromosomes of interest. No line breaks are allowed within the chromosome sequence.
- annotation_file_path (string) – Input information about the transcripts - fields are (chromosome, strand, start, end, exon count, exon starts, exon ends, transcript ID, gene ID, gene symbol)
- include_suffix_w_tx_id (boolean) – Append parent/allele suffix to transcript names in FASTA headers of the output file when set to True [Default: False].
Returns: Command to execute on the command line. It will perform the same operations as a call to execute() with the same parameters.
Return type: string
-
get_validation_attributes
(sample_id, genome_suffix, genome_fasta_file_path, annotation_file_path, include_suffix_w_tx_id=False)[source]¶ Prepare attributes required by is_output_valid() function to validate output generated the TranscriptomeFastaPreparationStep job corresponding to the given input files.
Parameters: - sample_id (string) – Identifier for sample corresponding to this reference genome. Used to construct output and log paths for this specific execution.
- genome_suffix (string) – Suffix to identify the parent/allele of the source genome. Should be 1 or 2. This same suffix is a appended to all output files, and the individual transcript IDs in the transcriptome FASTA if the include_suffix_w_tx_id parameter is set to TRUE.
- genome_fasta_file_path (string) – Input genome fasta filename containing all the chromosomes of interest. No line breaks are allowed within the chromosome sequence.
- annotation_file_path (string) – Input information about the transcripts - fields are (chromosome, strand, start, end, exon count, exon starts, exon ends, transcript ID, gene ID, gene symbol)
- include_suffix_w_tx_id (boolean) – Append parent/allele suffix to transcript names in FASTA headers of the output file when set to True [Default: False].
Returns: A TranscriptomeFastaPreparationStep job’s data_directory, log_directory, input genome_fasta_file_path, input annotation_file_path, and output transcriptome_fasta_file_path used when creating the transcriptome FASTA files.
Return type: dict
-
static
is_output_valid
(validation_attributes)[source]¶ Check if output of TranscriptomeFastaPreparationStep for a specific job/ execution is correctly formed and valid, given a job’s data directory, log directory, input genome FASTA filename, input annotation file, and output transcriptome FASTA filename. Prepare these attributes for a given jobs using the get_validation_attributes() method.
Parameters: validation_attributes (dict) – A job’s data_directory, log_directory, input genome_fasta_file_path, input annotation_file_path, and output transcriptome_fasta_file_path used when creating the transcriptome FASTA files. Returns: - True - TranscriptomeFastaPreparationStep output files were created
- and are well formed.
- False - TranscriptomeFastaPreparationStep output files do not exist
- or are missing data.
Return type: boolean
-
static
main
()[source]¶ Entry point into script when called directly.
Parses arguments, gathers input and output filenames, and calls methods that perform the actual operation.
-
scrub_genome_fasta_file
()[source]¶ Edits the genome fasta file, creating an edited version (genome fasta filename without extension + _edited.fa). Edits include: 1. Removing suplemmental information from the description line 2. Removing internal newlines in the sequence 3. Insuring all bases in sequence are represented in upper case. This edited file is the one used in subsequent scripts.
-
Kallisto Index Generation & Quantification Steps¶
-
class
camparee.kallisto.
KallistoIndexStep
(log_directory_path, data_directory_path, parameters=None)[source]¶ Wrapper around generating a kallisto transcriptome index.
-
execute
(sample_id, genome_suffix, kallisto_bin_path, transcriptome_fasta_path)[source]¶ Build kallisto index from the given FASTA file of transcripts.
Parameters: - sample_id (string) – Identifier for sample corresponding to reference transcriptome. Used to construct index and log paths for this specific kallisto execution.
- genome_suffix (string) – Suffix to identify the parent/allele of the transcriptome. Should be 1 or 2. This same suffix is a appended to all output files/directories.
- kallisto_bin_path (string) – Path to the kallisto exectuable binary.
- transcriptome_fasta_path (string) – Path to the FASTA file of transcripts, used as the basis for the kallisto index. This is generally prepared by the TranscriptomeFastaPreparationStep.
-
get_commandline_call
(sample_id, genome_suffix, kallisto_bin_path, transcriptome_fasta_path)[source]¶ Prepare command to execute the KallistoIndexStep from the command line, given all of the arugments used to run the execute() function.
Parameters: - sample_id (string) – Identifier for sample corresponding to reference transcriptome. Used to construct index and log paths for this specific kallisto execution.
- genome_suffix (string) – Suffix to identify the parent/allele of the transcriptome. Should be 1 or 2. This same suffix is a appended to all output files/directories.
- kallisto_bin_path (string) – Path to the kallisto binary.
- transcriptome_fasta_path (string) – Path to the FASTA file of transcripts, used as the basis for the kallisto index. This is generally prepared by the TranscriptomeFastaPreparationStep.
Returns: Command to execute on the command line. It will perform the same operations as a call to execute() with the same parameters.
Return type: string
-
get_validation_attributes
(sample_id, genome_suffix, kallisto_bin_path, transcriptome_fasta_path)[source]¶ Prepare attributes required by is_output_valid() function to validate output generated by the KallistoIndexStep job.
Parameters: - sample_id (string) – Identifier for sample corresponding to reference transcriptome. Used to construct index and log paths for this specific kallisto execution.
- genome_suffix (string) – Suffix to identify the parent/allele of the transcriptome. Should be 1 or 2. This same suffix is a appended to all output files/directories.
- kallisto_bin_path (string) – Path to the kallisto binary. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
- transcriptome_fasta_path (string) – Path to the FASTA file of transcripts, used as the basis for the kallisto index. This is generally prepared by the TranscriptomeFastaPreparationStep. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
Returns: A KallistoIndexStep job’s data_directory, log_directory, corresponding sample ID, and genome_suffix.
Return type: dict
-
static
is_output_valid
(validation_attributes)[source]¶ Check if output of KallistoIndexStep for a specific job/execution is correctly formed and valid, given a job’s data directory, log directory, sample ID, and genome suffix. Prepare these attributes for a given job using the get_validation_attributes() method.
Parameters: validation_attributes (dict) – A job’s data_directory, log_directory, corresponding sample_id, and genome_suffix used when creating the kallisto index. Returns: True - KallistoIndexStep output files were created and are well formed. False - KallistoIndexStep output files do not exist or are missing data. Return type: boolean
-
-
class
camparee.kallisto.
KallistoQuantStep
(log_directory_path, data_directory_path, parameters=None)[source]¶ Wrapper around quantifying transript-level counts with kallisto.
-
execute
(sample, genome_suffix, kallisto_bin_path)[source]¶ Use kallisto to generate transcript-level quantifications from fastq files for a given sample.
Parameters: - sample (Sample) – Sample containing paths for FASTQ files for quantification.
- genome_suffix (string) – Suffix to identify the parent/allele of the transcriptome. Should be 1 or 2. This same suffix is a appended to all output files/directories.
- kallisto_bin_path (string) – Path to the kallisto exectuable binary.
-
get_commandline_call
(sample, genome_suffix, kallisto_bin_path)[source]¶ Prepare command to execute the KallistoQuantStep from the command line, given all of the arugments used to run the execute() function.
Parameters: - sample (Sample) – Sample containing paths to FASTQ files for quantification.
- genome_suffix (string) – Suffix to identify the parent/allele of the transcriptome. Should be 1 or 2. This same suffix is a appended to all output files/directories.
- kallisto_bin_path (string) – Path to the kallisto exectuable binary.
Returns: Command to execute on the command line. It will perform the same operations as a call to execute() with the same parameters.
Return type: string
-
get_validation_attributes
(sample, genome_suffix, kallisto_bin_path)[source]¶ Prepare attributes required by is_output_valid() function to validate output generated by the KallistoQuantStep job.
Parameters: - sample (Sample) – Sample containing paths to FASTQ files for quantification.
- genome_suffix (string) – Suffix to identify the parent/allele of the transcriptome. Should be 1 or 2. This same suffix is a appended to all output files/directories.
- kallisto_bin_path (string) – Path to the kallisto exectuable binary. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
Returns: A KallistoQuantStep job’s data_directory, log_directory, corresponding sample ID, and genome_suffix.
Return type: dict
-
static
is_output_valid
(validation_attributes)[source]¶ Check if output of KallistoQuantStep for a specific job/execution is correctly formed and valid, given a job’s data directory, log directory, sample ID, and genome suffix. Prepare these attributes for a given job using the get_validation_attributes() method.
Parameters: validation_attributes (dict) – A job’s data_directory, log_directory, corresponding sample_id, and genome_suffix used when generating transcript-level quantifications. Returns: True - KallistoQuantStep output files were created and are well formed. False - KallistoQuantStep output files do not exist or are missing data. Return type: boolean
-
Bowtie2 Index Generation & Alignment Steps¶
-
class
camparee.bowtie2.
Bowtie2AlignStep
(log_directory_path, data_directory_path, parameters={})[source]¶ Wrapper around aligning reads with Bowtie2
-
execute
(sample, genome_suffix, bowtie2_bin_dir)[source]¶ Use Bowtie2 to align fastq files for a given sample to the refrence transcriptome.
Parameters: - sample (Sample) – Sample containing paths for FASTQ files for alignment.
- genome_suffix (string) – Suffix to identify the parent/allele of the transcriptome. Should be 1 or 2. This same suffix is a appended to all output files/directories.
- bowtie2_bin_dir (string) – Path to the directory containing the bowtie2 exectuable.
-
get_commandline_call
(sample, genome_suffix, bowtie2_bin_dir)[source]¶ Prepare command to execute the Bowtie2AlignStep from the command line, given all of the arugments used to run the execute() function.
Parameters: - sample (Sample) – Sample containing paths for FASTQ files for alignment.
- genome_suffix (string) – Suffix to identify the parent/allele of the transcriptome. Should be 1 or 2. This same suffix is a appended to all output files/directories.
- bowtie2_bin_dir (string) – Path to the directory containing the bowtie2 exectuable.
Returns: Command to execute on the command line. It will perform the same operations as a call to execute() with the same parameters.
Return type: string
-
get_validation_attributes
(sample, genome_suffix, bowtie2_bin_dir)[source]¶ Prepare attributes required by is_output_valid() function to validate output generated by the Bowtie2AlignStep job.
Parameters: - sample (Sample) – Sample containing paths for FASTQ files for alignment. [Note: only the sample_id is used, but the full Sample object is required here so get_validation_attributes() accepts the same arguments as get_commandline_call().]
- genome_suffix (string) – Suffix to identify the parent/allele of the transcriptome. Should be 1 or 2. This same suffix is a appended to all output files/directories.
- bowtie2_bin_dir (string) – Path to the directory containing the bowtie2 exectuable. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
Returns: A Bowtie2AlignStep job’s data_directory, log_directory, corresponding sample ID, and genome_suffix.
Return type: dict
-
static
is_output_valid
(validation_attributes)[source]¶ Check if output of Bowtie2AlignStep for a specific job/execution is correctly formed and valid, given a job’s data directory, log directory, sample ID, and genome suffix. Prepare these attributes for a given job using the get_validation_attributes() method.
Parameters: validation_attributes (dict) – A job’s data_directory, log_directory, corresponding sample_id, and genome_suffix used when aligning reads with Bowtie2. Returns: True - Bowtie2AlignStep output files were created and are well formed. False - Bowtie2AlignStep output files do not exist or are missing data. Return type: boolean
-
-
class
camparee.bowtie2.
Bowtie2IndexStep
(log_directory_path, data_directory_path, parameters={})[source]¶ Wrapper around generating a Bowtie2 index.
-
execute
(sample_id, genome_suffix, bowtie2_bin_dir, transcriptome_fasta_path)[source]¶ Build Bowtie2 index from the given FASTA file of transcripts.
Parameters: - sample_id (string) – Identifier for sample corresponding to reference transcriptome. Used to construct index and log paths for this specific Bowtie2 execution.
- genome_suffix (string) – Suffix to identify the parent/allele of the transcriptome. Should be 1 or 2. This same suffix is a appended to all output files/directories.
- bowtie2_bin_dir (string) – Path to the directory containing the bowtie2-build exectuable.
- transcriptome_fasta_path (string) – Path to the FASTA file of transcripts, used as the basis for the Bowtie2 index. This is generally prepared by the TranscriptomeFastaPreparationStep.
-
get_commandline_call
(sample_id, genome_suffix, bowtie2_bin_dir, transcriptome_fasta_path)[source]¶ Prepare command to execute the Bowtie2IndexStep from the command line, given all of the arugments used to run the execute() function.
Parameters: - sample_id (string) – Identifier for sample corresponding to reference transcriptome. Used to construct index and log paths for this specific Bowtie2 execution.
- genome_suffix (string) – Suffix to identify the parent/allele of the transcriptome. Should be 1 or 2. This same suffix is a appended to all output files/directories.
- bowtie2_bin_dir (string) – Path to the directory containing the bowtie2-build exectuable.
- transcriptome_fasta_path (string) – Path to the FASTA file of transcripts, used as the basis for the Bowtie2 index. This is generally prepared by the TranscriptomeFastaPreparationStep.
Returns: Command to execute on the command line. It will perform the same operations as a call to execute() with the same parameters.
Return type: string
-
get_validation_attributes
(sample_id, genome_suffix, bowtie2_bin_dir, transcriptome_fasta_path)[source]¶ Prepare attributes required by is_output_valid() function to validate output generated by the Bowtie2IndexStep job.
Parameters: - sample_id (string) – Identifier for sample corresponding to reference transcriptome. Used to construct index and log paths for this specific Bowtie2 execution.
- genome_suffix (string) – Suffix to identify the parent/allele of the transcriptome. Should be 1 or 2. This same suffix is a appended to all output files/directories.
- bowtie2_bin_dir (string) – Path to the directory containing the bowtie2-build exectuable. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
- transcriptome_fasta_path (string) – Path to the FASTA file of transcripts, used as the basis for the Bowtie2 index. This is generally prepared by the TranscriptomeFastaPreparationStep. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
Returns: A Bowtie2IndexStep job’s data_directory, log_directory, corresponding sample ID, and genome_suffix.
Return type: dict
-
static
is_output_valid
(validation_attributes)[source]¶ Check if output of Bowtie2IndexStep for a specific job/execution is correctly formed and valid, given a job’s data directory, log directory, sample ID, and genome suffix. Prepare these attributes for a given job using the get_validation_attributes() method.
Parameters: validation_attributes (dict) – A job’s data_directory, log_directory, corresponding sample_id, and genome_suffix used when creating the Bowtie2 index. Returns: True - Bowtie2IndexStep output files were created and are well formed. False - Bowtie2IndexStep output files do not exist or are missing data. Return type: boolean
-
Transcript Quantification Step¶
-
class
camparee.transcript_gene_quant.
TranscriptGeneQuantificationStep
(log_directory_path, data_directory_path, parameter=None)[source]¶ This class takes a kallisto output file and generates transcript- and gene-level quantification files, and a file of PSI (Percent Spliced In) values for alternative spliceforms.
-
create_transcript_gene_map
()[source]¶ Create dictionary to map transcript id to gene id using geneinfo file
-
execute
(sample_id, tx_abundance_file_path, annotation_file_path)[source]¶ Main work-horse function that generates transcript, gene, and PSI count files from transcript-level kallisto data.
Parameters: - sample_id (string) – Identifier for sample corresponding to the input kallisto file. Used to construct output and log paths for this specific execution.
- tx_abundance_file_path (string) – File of transcript abundances created by kallisto. Likely generated by KallistoQuantStep.
- annotation_file_path (string) – Input transcript annotation file. Used to map transcript IDs to gene IDs. This is generally prepared by the UpdateAnnotationForGenomeStep.
-
get_commandline_call
(sample_id, tx_abundance_file_path, annotation_file_path)[source]¶ Prepare command to execute the TranscriptGeneQuantificationStep from the command line, given all of the arugments used to run the execute() function.
Parameters: - sample_id (string) – Identifier for sample corresponding to the input kallisto file. Used to construct output and log paths for this specific execution.
- tx_abundance_file_path (string) – File of transcript abundances created by kallisto. Likely generated by KallistoQuantStep.
- annotation_file_path (string) – Input transcript annotation file. Used to map transcript IDs to gene IDs. This is generally prepared by the UpdateAnnotationForGenomeStep.
Returns: Command to execute on the command line. It will perform the same operations as a call to execute() with the same parameters.
Return type: string
-
get_validation_attributes
(sample_id, tx_abundance_file_path, annotation_file_path)[source]¶ Prepare attributes required by is_output_valid() function to validate output generated by the TranscriptGeneQuantificationStep job.
Parameters: - sample_id (string) – Identifier for sample corresponding to the input kallisto file. Used to construct output and log paths for this specific execution.
- tx_abundance_file_path (string) – File of transcript abundances created by kallisto. Likely generated by KallistoQuantStep. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
- annotation_file_path (string) – Input transcript annotation file. Used to map transcript IDs to gene IDs. This is generally prepared by the UpdateAnnotationForGenomeStep. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
Returns: A TranscriptGeneQuantificationStep job’s data_directory, log_directory, and corresponding sample ID.
Return type: dict
-
static
is_output_valid
(validation_attributes)[source]¶ Check if output of TranscriptGeneQuantificationStep for a specific job/execution is correctly formed and valid, given a job’s data directory, log directory, and sample ID. Prepare these attributes for a given job using the get_validation_attributes() method.
Parameters: validation_attributes (dict) – A job’s data_directory, log_directory, and corresponding sample_id. Returns: - True - TranscriptGeneQuantificationStep output files were created
- and are well formed.
- False - TranscriptGeneQuantificationStep output files do not exist
- or are missing data.
Return type: boolean
-
Allelic Imbalance Quantification Step¶
-
class
camparee.allelic_imbalance_quant.
AllelicImbalanceQuantificationStep
(log_directory_path, data_directory_path, parameters=None)[source]¶ This class contains scripts to output quantification of allelic imbalance.
- It requires
- an input file source for gene info
- Root of the aligned filenames (alignment to transcriptome of each parent with suffixes ‘_1’,’_2’.)
There is one output file with quantification information on the allelic imbalance of genes. Fields in this file: chromosome, strand, start, end, exon count, exon starts, exon ends, gene name.
-
create_transcript_gene_map
()[source]¶ Create dictionary to map transcript id to gene id using geneinfo file Map ‘*’ to ‘*’ to account for unmapped reads in align_file Create entries with suffix ‘_1’ and ‘_2’ for each transcript
-
execute
(sample_id, genome_alignment_file_path, parent1_annot_file_path, parent2_annot_file_path, parent1_tx_align_file_path, parent2_tx_align_file_path)[source]¶ This is the main method which quantifies allelic imbalance for all genes in the annotation based on the aligned files for parents 1 and 2.
Parameters: - sample_id (string) – Identifier for sample corresponding to the input genome and transcriptome alignment files. Used to construct output and log paths for this specific execution.
- genome_alignment_file_path (string) – Input BAM file of reads aligned to the original reference genome. This is used to identify multimappers so they are excluded from the allelic imbalance quantification. This is generally prepared by GenomeAlignmentStep, or provided by the user.
- parent1_annot_file_path (string) – Input transcript annotation file for parent 1. This is generally prepared by UpdateAnnotationForGenomeStep.
- parent2_annot_file_path (string) – Input transcript annotation file for parent 2. This is generally prepared by UpdateAnnotationForGenomeStep.
- parent1_tx_align_file_path (string) – Input SAM file of reads aligned to the variant genome from parent 1. This is generally prepared by Bowtie2AlignStep.
- parent2_tx_align_file_path (string) – Input SAM file of reads aligned to the variant genome from parent 2. This is generally prepared by Bowtie2AlignStep.
-
get_commandline_call
(sample_id, genome_alignment_file_path, parent1_annot_file_path, parent2_annot_file_path, parent1_tx_align_file_path, parent2_tx_align_file_path)[source]¶ Prepare command to execute the AllelicImbalanceQuantificationStep from the command line, given all of the arugments used to run the execute() function.
Parameters: - sample_id (string) – Identifier for sample corresponding to the input genome and transcriptome alignment files. Used to construct output and log paths for this specific execution.
- genome_alignment_file_path (string) – Input BAM file of reads aligned to the original reference genome. This is used to identify multimappers so they are excluded from the allelic imbalance quantification. This is generally prepared by GenomeAlignmentStep, or provided by the user.
- parent1_annot_file_path (string) – Input transcript annotation file for parent 1. This is generally prepared by UpdateAnnotationForGenomeStep.
- parent2_annot_file_path (string) – Input transcript annotation file for parent 2. This is generally prepared by UpdateAnnotationForGenomeStep.
- parent1_tx_align_file_path (string) – Input SAM file of reads aligned to the variant genome from parent 1. This is generally prepared by Bowtie2AlignStep.
- parent2_tx_align_file_path (string) – Input SAM file of reads aligned to the variant genome from parent 2. This is generally prepared by Bowtie2AlignStep.
Returns: Command to execute on the command line. It will perform the same operations as a call to execute() with the same parameters.
Return type: string
-
get_validation_attributes
(sample_id, genome_alignment_file_path, parent1_annot_file_path, parent2_annot_file_path, parent1_tx_align_file_path, parent2_tx_align_file_path)[source]¶ Prepare attributes required by is_output_valid() function to validate output generated by the AllelicImbalanceQuantificationStep job.
Parameters: - sample_id (string) – Identifier for sample corresponding to the input genome and transcriptome alignment files. Used to construct output and log paths for this specific execution.
- genome_alignment_file_path (string) – Input BAM file of reads aligned to the original reference genome. This is used to identify multimappers so they are excluded from the allelic imbalance quantification. This is generally prepared by GenomeAlignmentStep, or provided by the user. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
- parent1_annot_file_path (string) – Input transcript annotation file for parent 1. This is generally prepared by UpdateAnnotationForGenomeStep. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
- parent2_annot_file_path (string) – Input transcript annotation file for parent 2. This is generally prepared by UpdateAnnotationForGenomeStep. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
- parent1_tx_align_file_path (string) – Input SAM file of reads aligned to the variant genome from parent 1. This is generally prepared by Bowtie2AlignStep. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
- parent2_tx_align_file_path (string) – Input SAM file of reads aligned to the variant genome from parent 2. This is generally prepared by Bowtie2AlignStep. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
Returns: A AllelicImbalanceQuantificationStep job’s data_directory, log_directory, and corresponding sample ID.
Return type: dict
-
static
is_output_valid
(validation_attributes)[source]¶ Check if output of AllelicImbalanceQuantificationStep for a specific job/execution is correctly formed and valid, given a job’s data directory, log directory, and sample ID. Prepare these attributes for a given job using the get_validation_attributes() method.
Parameters: validation_attributes (dict) – A job’s data_directory, log_directory, and corresponding sample_id. Returns: - True - AllelicImbalanceQuantificationStep output files were created
- and are well formed.
- False - AllelicImbalanceQuantificationStep output files do not exist
- or are missing data.
Return type: boolean
-
static
main
()[source]¶ Entry point into script. Parses the argument list to obtain all the files needed and feeds them to the class constructor. Calls the appropriate methods thereafter.
-
read_info
(in_align_filename)[source]¶ Create dictionary which maps a read id in SAM file to a dictionary with two keys ‘transcript_id’ and ‘NM’. The value associated with ‘transcript_id’ is a list of all transcripts the read aligned to. The value associated with ‘NM’ is the corresponding edit distance information for each alignment. For non-mappers the transcript_id is ‘*’ and edit distance is 100 (Make it read length).
Molecule Maker Step¶
-
class
camparee.molecule_maker.
MoleculeMakerStep
(log_directory_path, data_directory_path=None, parameters=None)[source]¶ MoleculeMaker generates molecules based off of gene, intron, and allelic quantification files as well as customized genomic sequence and annotation
-
execute
(sample, sample_data_directory, output_type, output_molecule_count, seed=None, molecules_per_packet=None, rng=None)[source]¶ This is the main method that generates simulated molecules and saves/ exports them in the desired format. It uses the gene, transcript, intron, and allelic imbalance distributions generated by the other CAMPAREE steps.
Parameters: - sample (Sample) – Sample object corresponding to the input distributions. When exporting molecule packets, this Sample object is used to instantiate the MoleculePacket object.
- sample_data_directory (string) – Path to directory containing the data for the sample.
- output_type (string) – Type of file or object used to save or export simulated molecules. Sould be one of {’, ‘.join(MoleculeMakerStep.OUTPUT_OPTIONS_W_EXTENSIONS.keys())}.
- output_molecule_count (integer) – Total number of molecules to save/export for the current Sample.
- seed (integer) – [OPTIONAL] Seed for random number generator. Used so repeated runs can produce the same results.
- molecules_per_packet (integer) – [OPTIONAL] Maximum number of molecules in each molecule packet. Must be positive, non-zero integer (this is not currently checked).
- rng (numpy Generator) – [OPTIONAL] If provided, will use this for generating random numbers. Otherwise, uses default RNG
-
get_commandline_call
(sample, sample_data_directory, output_type, output_molecule_count, seed=None, molecules_per_packet=None)[source]¶ Prepare command to execute the MoleculeMakerStep from the command line, given all of the arugments used to run the execute() function.
Parameters: - sample (Sample) – Sample object corresponding to the input distributions. When exporting molecule packets, this Sample object is used to instantiate the MoleculePacket object.
- sample_data_directory (string) – Path to directory containing the sample data
- output_type (string) – Type of file or object used to save or export simulated molecules. Sould be one of {’, ‘.join(MoleculeMakerStep.OUTPUT_OPTIONS_W_EXTENSIONS.keys())}.
- output_molecule_count (integer) – Total number of molecules to save/export for the current Sample.
- seed (integer) – [OPTIONAL] Seed for random number generator. Used so repeated runs can produce the same results.
- molecules_per_packet (integer) – [OPTIONAL] Maximum number of molecules in each molecule packet. Must be positive, non-zero integer (this is not currently checked).
Returns: Command to execute on the command line. It will perform the same operations as a call to execute() with the same parameters.
Return type: string
-
get_validation_attributes
(sample, sample_data_directory, output_type, output_molecule_count, seed=None, molecules_per_packet=None)[source]¶ Prepare attributes required by is_output_valid() function to validate output generated by the MoleculeMakerStep job.
Parameters: - sample (Sample) – Sample object corresponding to the input distributions. When exporting molecule packets, this Sample object is used to instantiate the MoleculePacket object.
- sample_data_path (string) – Path to directory containing all the sample data.
- output_type (string) – Type of file or object used to save or export simulated molecules. Sould be one of {’, ‘.join(MoleculeMakerStep.OUTPUT_OPTIONS_W_EXTENSIONS.keys())}.
- output_molecule_count (integer) – Total number of molecules to save/export for the current Sample.
- seed (integer) – [OPTIONAL] Seed for random number generator. Used so repeated runs can produce the same results. [Note: this parameter is captured just so get_validation_attributes() accepts the same arguments as get_commandline_call(). It is not used here.]
- molecules_per_packet (integer) – [OPTIONAL] Maximum number of molecules in each molecule packet. Must be positive, non-zero integer (this is not currently checked).
Returns: A MoleculeMakerStep job’s sample_data_directory, log_directory, corresponding sample ID, output file type, output molecule count, and the number of molecules per packet.
Return type: dict
-
static
is_output_valid
(validation_attributes)[source]¶ Check if output of MoleculeMakerStep for a specific job/execution is correctly formed and valid, given a job’s data directory, log directory, sample ID, output file type, output molecule count, and the number of molecules per packet (if provided). Prepare these attributes for a given job using the get_validation_attributes() method.
Parameters: validation_attributes (dict) – A job’s data_directory, log_directory, corresponding sample_id, output file type, output molecule count, and the number of molecules per packet (if provided). Returns: True - MoleculeMakerStep output files were created and are well formed. False - MoleculeMakerStep output files do not exist or are missing data. Return type: boolean
-
load_allelic_quants
(file_path)[source]¶ Reads allelic quantification file into a dictionary: gene_id -> (allele 1 probability, allele 2 probability)
-
load_gene_quants
(file_path)[source]¶ Read in a gene quantification file as two lists of gene IDs and of their read quantifications
-
load_indels
(file_path, genome)[source]¶ Read in the file of indel locations for a given custom genome
Parameters: - file_path – path to the indel file
- genome – genomic sequences of this allele
The indel file is tab-separated with format “chrom:start type length” and looks like the following: 1:4897762 I 2 1:7172141 I 2 1:7172378 D 1
Assumption is that the file is sorted by start and no indels overlap
Returns a ‘split cigar string’ meaning a list of tuples (op, length) where op is one of M, I, D and length is the length of the match, insert, or deletion Good for use with beers_utils.cigar
-
load_intron_quants
(file_path)[source]¶ Load an intron quantification file as two dictionaries, (transcript ID -> sum FPK of all introns in transcript) and (transcript ID -> list of FPKs of each intron in transcript)
-
load_isoform_quants
(file_path)[source]¶ Reads an isoform quant file into a dictionary gene -> (list of transcript IDs, list of psi values)
-
static
main
()[source]¶ Entry point into script. Parses the argument list to obtain all the files needed and feeds them to the class constructor. Calls the appropriate methods thereafter.
-
Annotation Info¶
-
class
camparee.annotation_info.
AnnotationInfo
(geneinfo_file_path, chrom_lengths, flank_size=1500)[source]¶ Data structure containing all the information in a gene info file.
Stores genes, transcripts, intergenic regions, and exons both for easy access by ID and for quick lookup by position, using sorted lists by start position, allowing binary search.
-
add_flanks
(flank_size)[source]¶ Add flanks to each genic region up to size flank_size on each end These account for reads that go past the “ends” of the gene but should be thought of as belonging to that gene rather than to intergenic regions. Flanks are added as first/last introns to every transcript, so all transcripts will have at least two introns. Sometimes there is no room for a flank to be added, in which case the flank is given start/stop coordinates of 0,0.
Parameters: flank_size – Size of flanks to add Returns: flanked_genics, dictionary of genic regions, sorted by start position with flanks added
-
-
class
camparee.annotation_info.
Gene
(info, gene_id, chrom, strand, start, end, transcripts=None)[source]¶ Representation of a gene, including it’s transcripts.
start, end coordinates indicate the min/max of the start/end coordinates of all its transcripts
-
class
camparee.annotation_info.
Region
(info, chrom, strand, start, end, comment=None)[source]¶ Any genomic span of a chromosome strand = +,-, or . if not strand-specific region (eg: an intergenic region) ‘comment’ is any extra information to carry along, for the purposes of debugging/printing
CAMPAREE Utils¶
-
exception
camparee.camparee_utils.
CampareeException
[source]¶ Base class for other Camparee exceptions.
-
class
camparee.camparee_utils.
CampareeUtils
[source]¶ Utilities for steps in the CAMPAREE expression pipeline.
-
static
convert_gtf_to_annot_file_format
(input_gtf_filename, output_annot_filename)[source]¶ Convert a GTF file to a tab-delimited annotation file with one line per transcript. Each line in the annotation file will have the following columns:
1 - chrom 2 - strand 3 - txStart 4 - txEnd 5 - exonCount 6 - exonStarts 7 - exonEnds 8 - transcript_id 9 - gene_id10 - genesymbol 11 - biotype
This method derives transcript info from the “exon” lines in the GTF file and assumes the exons are listed in the order they appear in the transcript, as opposed to their genomic coordinates. The annotation file will list exons in order by plus-strand coordinates, so this method reverses the order of exons for all minus-strand transcripts.
Note, this function will add a header to the output file, marked by a ‘#’ character prefix.
See website for standard 9-column GTF specification: https://useast.ensembl.org/info/website/upload/gff.html
Parameters: - input_gtf_filename (string) – Path to GTF file to be converted annotation file format.
- output_annot_filename (string) – Path to output file in annotation format.
Returns: Set of unique chromosome/contig names contained in input GTF file. Only GTF entries of “exon” feature type contribute to this set.
Return type: set
-
static
create_chr_ploidy_data
(chr_ploidy_file_path)[source]¶ Parses the chr_ploidy_data from its tab delimited resource file into a dictionary of dictionaries like so: {
‘1’: {‘male’: 2, ‘female’: 2}, ‘X’: {‘male’: 1, ‘female’: 2}. …} :param chr_ploidy_file_path: full path to the chr_ploidy data file :return: chr_ploidy_data expressed as a dictionary of dictionary as shown above.
-
static
create_genome
(genome_file_path)[source]¶ Creates a genome dictionary from the genome file located at the provided path (if compressed, it must have a gz extension). The filename is assumed to contain the chr sequences without line breaks. :param genome_file_path: path to reference genome file (either compressed or not) :return: genome as a dictionary with the chromosomes/contigs as keys and the sequences as values.
-
static
create_oneline_seq_fasta
(input_fasta_file_path, output_oneline_fasta_file_path)[source]¶ Helper method to convert a FASTA file containing line breaks embedded within its sequences to a FASTA file containing each sequence on a single line.
Parameters: - input_fasta_file_path (string) – Path to input FASTA file with multi-line sequence data.
- output_oneline_fasta_file_path (string) – Path to output FASTA file to create, where single line sequences will be stored.
Returns: Set of unique chromosome/contig names contained in input FASTA file.
Return type: set
-
static
open_file
(filename, mode='r')[source]¶ Helper method which can open gzipped files by checking the filename for the ‘.gz’ extension. If no ‘.gz’ extension found, this method uses the standard open() function.
Parameters: - filename (string) – Name of the file to open. If this filename has a ‘.gz’ extension, the function uses the gzip package. If not, it uses open() function.
- mode (string) – Access mode (e.g. ‘r’ - read, ‘w’ - write) passed to the open function. Note, to open a file in binary mode, need to explicitly end the mode code with a ‘b’. [DEFAULT: ‘r’ - open file for reading in text mode].
Returns: Pointer to the opened file.
Return type: file object
-
static
Abstract CAMPAREE Step¶
-
class
camparee.abstract_camparee_step.
AbstractCampareeStep
(*args, **kwargs)[source]¶ Abstract class defining the minimal methods required by a step in the CAMPAREE pipeline.
-
get_commandline_call
()[source]¶ Prepare command to execute the step from the command line, given all of the parameters used to call the execute() method.
Parameters: same or equivalent parameters given to the execute() method. (The) – Returns: Command to execute on the command line. It will perform the same operations as a call to execute() with the same parameters. Return type: string
-
get_validation_attributes
()[source]¶ Prepare attributes required by the is_output_valid() method to validate output generated by executing this specific instance of the pipeline step (either through the command line call or the execute method).
Returns: Key-value pairings of attributes accepted by the is_output_valid() method. Return type: dict
-
static
is_output_valid
(validation_attributes)[source]¶ Check if output of this step, for a specific job/execution is correctly formed and valid, given the dictionary of valdiation attributes. Prepare these attributes for a given executing by calling the get_validation_attributes() method.
Parameters: validation_attributes (dict) – Key-value pairings of attributes generated by the get_validation_attributes() method. Returns: True - Output files for this step were created and are well formed. False - Output files for this steo do not exist or are missing data. Return type: boolean
-
CAMPAREE Step Provider¶
-
class
camparee.camparee_step_provider.
CampareeStepProvider
[source]¶ Short summary.
-
__steps
¶ Dictionary mapping pipeline step name, which is accessible and used by the rest of the code base, to the corresponding camparee_step. The camparee_step class must extend the AbstractCampareeStep class.
Type: dict
-
get
(step_name)[source]¶ Return camparee_step class corresponding to the given step name.
Parameters: step_name (string) – Name of the step corresponding to a specific camparee step interface. Returns: Class providing interface to the camparee step. Return type: AbstractCampareeStep
-
list_supported_camparee_steps
()[source]¶ Return list of camparee_steps currently registered for use.
Returns: Camparee_steps currently registered for use. Return type: list
-
register_step
(step_name, step_interface, package_name='camparee')[source]¶ Add interface to a given camparee step so it’s accessible and useable by the rest of the code base.
Parameters: - step_name (string) – Name of the step corresponding to a specific pipeline step interface.
- step_interface (AbstractCampareeStep) – Class provding an interface to the camparee pipeline step.
- package_name (string) – Name of the package from which to load the interface class. [Default: camparee].
-