vcfparser package¶
Submodules¶
vcfparser.meta_header_parser module¶
vcfparser.record_parser module¶
-
class
vcfparser.record_parser.
Record
(line, header_line)[source]¶ Bases:
object
A class for to store and extract the data lines in the vcf file.
-
__init__
(line, header_line)[source]¶ Initializes the class with record lines and header lines.
Parameters: - line (str) – tab separated data lines (records) lines below # CHROM in vcf file
- header_line (str) – a line in vcf starting with # CHROM
-
get_info_dict
(required_keys=None)[source]¶ Convert Info to dict for required keys
Parameters: required_keys (list) – Keys of interest (default = all keys will be mapped) Returns: key: value pair of only required keys Return type: dict Notes
If ‘=’ isn’t present then it will return its value as ‘.’.
Examples
>>> info_str = 'AC=2,0;AF=1.00;AN=8;BaseQRankSum' >>> required_keys= ['AC', 'BaseQRankSum'] >>> get_info_dict(self, required_keys) {'AC':2, 'BaseQRankSum' : '.'}
-
get_mapped_samples
(sample_names=None, formats=None)[source]¶ Parameters: - sample_names (list) – list of sample names that needs to be filtered (default = all samples will be filtered)
- formats (list) – list of format tags that needs to be filtered (default = all formats will be filtered)
Returns: dict of filtered sample names along with filtered formats
Return type: dict
Examples
>>> mapped_sample = {'ms01e': {'GT': './.','PI': '.', 'PC': '.'}, 'MA622': {'GT': '0/0', 'PI': '.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PI': '.', 'PC': '.'}} >>> get_mapped_samples(self, sample_names= ['ms01e', 'MA611'], formats= ['GT', 'PC']) {'ms01e': {'GT': './.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PC': '.'}}
-
hasAllele
(allele='0', tag='GT', bases='numeric')[source]¶ Parameters: - allele (str) – allele to check if it is present in given samples(default = ‘0’)
- tag (str) – format tags of interest (default = ‘GT’)
- bases (str) – iupac or numeric (default = ‘numeric’)
Returns: dict of sample with values having given allele
Return type: dict
-
hasVAR
(genotype='0/0', tag='GT', bases='numeric')[source]¶ Parameters: - genotype (str) – genotype to check if it is present in given samples(default = ‘0/0’)
- tag (str) – format tags of interest (default = ‘GT’)
- bases (str) – iupac or numeric (default = ‘numeric’)
Returns: dict of sample with values having given genotype
Return type: dict
-
has_phased
(tag='GT', bases='numeric')[source]¶ Parameters: - tag (str) – format tags of interest (default = ‘GT’)
- bases (str) – iupac or numeric (default = ‘numeric’)
Returns: dict of sample with values having ‘/’ in samples formats
Return type: dict
Examples
>>> rec_keys_eg = 'CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ms01e ms02g ms03g ms04h MA611 MA605 MA622'
>>> rec_valeg = '2 15881018 . G A,C 5082.45 PASS AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC ./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:. ./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:. ./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:. 1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:. 0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:. 0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:. 0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.' >>> from record_parser import Record >>> rec_obj = Record(rec_valeg, rec_keys_eg) >>> rec_obj.has_phased(tag="GT", bases="iupac") {}
-
has_unphased
(tag='GT', bases='numeric')[source]¶ Parameters: - tag (str) – format tags of interest (default = ‘GT’)
- bases (str) – iupac or numeric (default = ‘numeric’)
Returns: dict of sample with values having ‘/’ in samples formats
Return type: dict
Examples
>>> rec_keys_eg = 'CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ms01e ms02g ms03g ms04h MA611 MA605 MA622'
>>> rec_valeg = '2 15881018 . G A,C 5082.45 PASS AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC ./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:. ./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:. ./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:. 1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:. 0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:. 0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:. 0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.' >>> from record_parser import Record >>> rec_obj = Record(rec_valeg, rec_keys_eg) >>> rec_obj.has_unphased(tag="GT", bases="iupac") {'ms01e': './.', 'ms02g': './.', 'ms03g': './.', 'ms04h': '1/1', 'MA611': '0/0', 'MA605': '0/0', 'MA622': '0/0'}
-
isHETVAR
(tag='GT', bases='numeric')[source]¶ Parameters: - tag (str) – format tags of interest (default = ‘GT’)
- bases (str) – iupac or numeric (default = ‘numeric’)
Returns: dict of sample with values having homoref
Return type: dict
-
isHOMREF
(tag='GT', bases='numeric')[source]¶ Parameters: - tag (str) – format tags of interest (default = ‘GT’)
- bases (str) – iupac or numeric (default = ‘numeric’)
Returns: dict of sample with values having homoref
Return type: dict
Examples
>>> rec_keys_eg = 'CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ms01e ms02g ms03g ms04h MA611 MA605 MA622'
>>> rec_valeg = '2 15881018 . G A,C 5082.45 PASS AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC ./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:. ./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:. ./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:. 1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:. 0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:. 0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:. 0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.' >>> from record_parser import Record >>> rec_obj = Record(rec_valeg, rec_keys_eg) >>> rec_obj.isHOMREF(tag="GT", bases="iupac") {'MA611': 'G/G', 'MA605': 'G/G', 'MA622': 'G/G'}
-
isHOMVAR
(tag='GT', bases='numeric')[source]¶ Parameters: - tag (str) – format tags of interest (default = ‘GT’)
- bases (str) – iupac or numeric (default = ‘numeric’)
Returns: dict of sample with values having homoref
Return type: dict
-
isMissing
(tag='GT')[source]¶ Parameters: tag (str) – format tags of interest (default = ‘GT’) Returns: dict of sample with values having homoref Return type: dict
-
map_records_long
()[source]¶ Maps record values with record keys.
Returns: dict with key value pair with sample and infos modified Return type: dict
-
static
split_tag_from_samples
(order_mapped_samples, tag, sample_names)[source]¶ Splits the tags of given samples from order_dict of mapped_samples
Parameters: - order_mapped_samples (OrderedDict) –
- tag (str) –
- sample_names (list) –
Returns: list of list containing splitted tags
Return type: list of list
Examples
>>> order_mapped_samples = OrderedDict([('ms01e',{'GT': './.', 'PI': '.'), ('MA622', 'GT': '0/0','PI': '.'})]) >>> tag = 'GT' >>> sample_names = ['ms01e', 'MA622'] >>> split_tag_from_samples(order_mapped_samples, tag, sample_names) [['.', '.'], ['0', '0']]
-
vcfparser.vcf_parser module¶
-
class
vcfparser.vcf_parser.
VcfParser
(filename)[source]¶ Bases:
object
Parses a given vcf file into and outputs metainfo and yields records.
-
__init__
(filename)[source]¶ Parameters: filename (file) – input vcf file that needs to be parsed. bzip files are also supported.
-
parse_metadata
()[source] initialize variables to store meta infos
-
parse_records
(chrom=None, pos_range=None, no_of_recs=1)[source] Parse records from file and yield it.
Parameters: - chrom (str) –
- pos_range (str) –
- no_of_recs (int) –
Yields: Record object on which we can perform different operations and extract required values
-