vcfparser package¶
Submodules¶
vcfparser.meta_header_parser module¶
vcfparser.metaviewer module¶
vcfparser.record_parser module¶
-
class
vcfparser.record_parser.
Record
(record_values, record_keys)[source]¶ Bases:
object
A class that converts the record lines from input VCF into accessible record object.
-
__init__
(record_values, record_keys)[source]¶ Initializes the class with header keys and record values.
Parameters: - record_keys (list) –
- list of record keys generated for the record values
- generated from string in the VCF that starts with #CHROM
- stays the same for a particular VCF file
- record_values (list) –
- list of record values generated from the VCF record line
- genrated from the lines below # CHROM in VCF file
- values are dynamically updated in each for-loop
- record_keys (list) –
-
get_format_to_sample_map
(sample_names=None, formats=None, convert_to_iupac=None)[source]¶ Parameters: - sample_names (list) – list of sample names that needs to be processed (default = all samples are processed)
- formats (list) – list of format tags that needs to be processed (default = all format tags are processed)
- convert_to_iupac (list) – list of tags (from FORMAT) that needs to be converted into iupac bases (default tag = ‘GT’, default output = numeric bases)
Returns: dict of filtered sample names along with filtered format “tags:values”
Return type: dict
Examples
>>> import vcfparser.vcf_parser as vcfparse >>> myvcf = vcfparse.VcfParser("input_test.vcf") >>> records = myvcf.parse_records() >>> record = first(record) >>> record.mapped_format_to_sample = {'ms01e': {'GT': './.','PI': '.', 'PC': '.'}, 'MA622': {'GT': '0/0', 'PI': '.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PI': '.', 'PC': '.'}} >>> record.get_format_to_sample_map(self, sample_names= ['ms01e', 'MA611'], formats= ['GT', 'PC']) {'ms01e': {'GT': './.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PC': '.'}}
>>> record.get_format_to_sample_map(self, sample_names= ['ms01e', 'MA611'], formats= ['GT', 'PC', 'PG'], convert_to_iupac= ['GT', 'PG']) {'ms01e': {'GT': './.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PC': '.'}}
-
get_full_record_map
(convert_to_iupac=None)[source]¶ Maps record values with record keys.
Parameters: convert_to_iupac (list) – list of genotpye tags that needs to be converted into iupac bases (default tag = ‘GT’, default output = numeric bases) Returns: - dict – dict with key value pair with sample and infos modified
- TODO (Done (Gopal) Add example input and output)
Examples
>>> record.get_full_record_map() {'CHROM': '2', 'POS': '15881018', 'ID': '.', 'REF': 'G', 'ALT': 'A,C', 'QUAL': '5082.45', 'FILTER': 'PASS', 'INFO': {'AC': '2,0', 'AF': '1.00', 'AN': '8', 'BaseQRankSum': '-7.710e-01', 'ClippingRankSum': '0.00', 'DP': '902', 'ExcessHet': '0.0050', 'FS': '0.000', 'InbreedingCoeff': '0.8004', 'MLEAC': '12,1', 'MLEAF': '0.462,0.038', 'MQ': '60.29', 'MQRankSum': '0.00', 'QD': '33.99', 'ReadPosRankSum': '0.260', 'SF': '0,1,2,3,4,5,6', 'SOR': '0.657', 'set': 'HignConfSNPs'}, 'FORMAT': 'GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC', 'ms01e': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms02g': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms03g': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms04h': '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', 'MA611': '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', 'MA605': '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', 'MA622': '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.', 'samples': {'ms01e': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms02g': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms03g': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms04h': {'GT': '1/1', 'PI': '.', 'GQ': '6', 'PG': '1/1', 'PM': '.', 'PW': '1/1', 'AD': '0,2', 'PL': '49,6,0,.,.,.', 'DP': '2', 'PB': '.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PI': '.', 'GQ': '78', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '29,0,0', 'PL': '0,78,1170,78,1170,1170', 'DP': '29', 'PB': '.', 'PC': '.'}, 'MA605': {'GT': '0/0', 'PI': '.', 'GQ': '9', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '3,0,0', 'PL': '0,9,112,9,112,112', 'DP': '3', 'PB': '.', 'PC': '.'}, 'MA622': {'GT': '0/0', 'PI': '.', 'GQ': '99', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '40,0,0', 'PL': '0,105,1575,105,1575,1575', 'DP': '40', 'PB': '.', 'PC': '.'}}}
-
get_info_as_dict
(info_keys=None)[source]¶ Convert Info to dict for required keys
Parameters: info_keys (list) – Keys of interest (default = all keys will be mapped) Returns: key: value pair of only required keys Return type: dict Notes
If ‘=’ isn’t present then it will return its value as ‘.’.
Examples
>>> info_str = 'AC=2,0;AF=1.00;AN=8;BaseQRankSum' >>> info_keys= ['AC', 'BaseQRankSum'] >>> record.get_info_as_dict(info_keys) {'AC': '2,0', 'BaseQRankSum': '-7.710e-01'}
-
static
get_tag_values_from_samples
(order_mapped_samples, tag, sample_names, split_at=None)[source]¶ Splits the tags of given samples from order_dict of mapped_samples
Parameters: - order_mapped_samples (OrderedDict) – Ordered dictionary of FORMAT tags mapped to SAMPLE values.
- tag (str) – One of the FORMAT tag.
- sample_names (list) – Name of the samples to extract the values from.
- split_at (str) – Character to split the value string at. e.g “|”, “/”, “,” etc.
Returns: List of list containing SAMPLE value for the FORMAT tag
Return type: list of list
Examples
>>> order_mapped_samples = OrderedDict([('ms01e',{'GT': './.', 'PI': '.'}), ('MA622', {'GT': '0/0','PI': '.'})]) >>> tag = 'GT' >>> sample_names = ['ms01e', 'MA622'] >>> record.get_tag_values_from_samples(order_mapped_samples, tag, sample_names) [['./.'], ['0/0']] >>> # using "/|" # to split at GT values at both | and / >>> get_tag_values_from_samples(order_mapped_samples, tag, sample_names, split_at= "/|") [['.', '.'], ['0', '0']]
-
hasAllele
(allele='0', tag='GT', bases='numeric')[source]¶ Parameters: - allele (str) – allele to check if it is present in given samples(default = ‘0’)
- tag (str) – format tags of interest (default = ‘GT’)
- bases (str) – iupac or numeric (default = ‘numeric’)
Returns: dict of sample with values having given allele
Return type: dict
Example
>>> record.hasAllele(allele='0', tag='GT', bases='numeric') {'MA611': '0/0', 'MA605': '0/0', 'MA622': '0/0'}
-
hasVAR
(genotype='0/0', tag='GT', bases='numeric')[source]¶ Parameters: - genotype (str) – genotype to check if it is present in given samples(default = ‘0/0’)
- tag (str) – format tags of interest (default = ‘GT’)
- bases (str) – iupac or numeric (default = ‘numeric’)
Returns: dict of sample with values having given genotype
Return type: dict
Example
>>> record.hasVAR(genotype='0/0')
{‘MA611’: ‘0/0’, ‘MA605’: ‘0/0’, ‘MA622’: ‘0/0’}
-
has_phased
(tag='GT', bases='numeric')[source]¶ Parameters: - tag (str) – format tags of interest (default = ‘GT’)
- bases (str) – iupac or numeric (default = ‘numeric’)
Returns: dict of sample with values having ‘/’ in samples formats
Return type: dict
Examples
>>> rec_keys = ['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622'] >>> rec_values = ['2', '15881018', '.', 'G', 'A,C', '5082.45', 'PASS', 'AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs', 'GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.'] >>> from vcfpaser.record_parser import Record >>> rec_obj = Record(rec_values, rec_keys) >>> rec_obj.has_phased(tag="GT", bases="iupac") {}
-
has_unphased
(tag='GT', bases='numeric')[source]¶ Parameters: - tag (str) – format tags of interest (default = ‘GT’)
- bases (str) – iupac or numeric (default = ‘numeric’)
Returns: dict of sample with values having ‘/’ in samples formats
Return type: dict
Examples
>>> rec_keys = ['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622'] >>> rec_values = ['2', '15881018', '.', 'G', 'A,C', '5082.45', 'PASS', 'AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs', 'GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.'] >>> from vcfparser.record_parser import Record >>> rec_obj = Record(rec_values, rec_keys) >>> rec_obj.has_unphased(tag="GT", bases="iupac") {'ms01e': './.', 'ms02g': './.', 'ms03g': './.', 'ms04h': 'A/A', 'MA611': 'G/G', 'MA605': 'G/G', 'MA622': 'G/G'}
-
isHETVAR
(tag='GT', bases='numeric')[source]¶ Parameters: - tag (str) – format tags of interest (default = ‘GT’)
- bases (str) – iupac or numeric (default = ‘numeric’)
Returns: dict of sample with values having homoref
Return type: dict
Examples
>>> record.isHETVAR(tag="GT", bases="numeric") {}
-
isHOMREF
(tag='GT', bases='numeric')[source]¶ Parameters: - tag (str) – format tags of interest (default = ‘GT’)
- bases (str) – iupac or numeric (default = ‘numeric’)
Returns: dict of sample with values having homoref
Return type: dict
Examples
>>> rec_keys = ['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622'] >>> rec_values = ['2', '15881018', '.', 'G', 'A,C', '5082.45', 'PASS', 'AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs', 'GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.'] >>> from vcfparser.record_parser import Record >>> rec_obj = Record(rec_values, rec_keys) >>> rec_obj.isHOMREF(tag="GT", bases="iupac") {'MA611': 'G/G', 'MA605': 'G/G', 'MA622': 'G/G'}
-
isHOMVAR
(tag='GT', bases='numeric')[source]¶ Parameters: - tag (str) – format tags of interest (default = ‘GT’)
- bases (str) – iupac or numeric (default = ‘numeric’)
Returns: dict of sample with values having homoref
Return type: dict
Examples
>>> record.isHOMVAR(tag="GT", bases="iupac") {'ms01e': './.', 'ms02g': './.', 'ms03g': './.', 'ms04h': 'A/A', 'MA611': 'G/G', 'MA605': 'G/G', 'MA622': 'G/G'}
-
isMissing
(tag='GT')[source]¶ Parameters: tag (str) – format tags of interest (default = ‘GT’) Returns: dict of sample with values having homoref Return type: dict Examples
>>> record.isMissing(tag='PI') {'ms01e': '.', 'ms02g': '.', 'ms03g': '.', 'ms04h': '.', 'MA611': '.', 'MA605': '.', 'MA622': '.'}
-
vcfparser.vcf_parser module¶
-
class
vcfparser.vcf_parser.
VcfParser
(filename)[source]¶ Bases:
object
A class to parse the metadata information and yield records from the input VCF.
-
__init__
(filename)[source]¶ Parameters: filename (file) – input vcf file that needs to be parsed. bgzipped files are also supported. Returns: VCF object for iterating and querying. Return type: Object
-
parse_metadata
()[source] function to parse the metadata information from VCF header.
Returns: MetaDataParser object for iterating and querying the metadata information. Return type: Object Uses
MetaDataParser class to create MetaData object
-
parse_records
(chrom=None, pos_range=None, no_processors=1)[source] Parse records and yield it.
Parameters: - chrom (str) – chormosome name or number. Default = None
- pos_range (tuple) – genomic position of interest, e.g: (5, 15). Both upper and lower limits are inclusive. Default = None
- no_of_recs (int) – number of records to process
Uses
Record module to create a Record object
Yields: Record object for interating and quering the record information.
-