vcfparser package¶

Submodules¶

vcfparser.meta_header_parser module¶

class vcfparser.meta_header_parser.MetaDataParser(header_file)[source]¶

Bases: object

Parses a meta lines of the vcf files.

parse_lines()[source]¶: Parse a vcf metadataline

static split_to_dict(string)[source]¶

vcfparser.record_parser module¶

class vcfparser.record_parser.Record(line, header_line)[source]¶

Bases: object

A class for to store and extract the data lines in the vcf file.

__init__(line, header_line)[source]¶

Initializes the class with record lines and header lines.

Parameters:	line (str) – tab separated data lines (records) lines below # CHROM in vcf file header_line (str) – a line in vcf starting with # CHROM

deletion_overlapping_variant()[source]¶

get_info_dict(required_keys=None)[source]¶

Convert Info to dict for required keys

Parameters:	required_keys (list) – Keys of interest (default = all keys will be mapped)
Returns:	key: value pair of only required keys
Return type:	dict

Notes

If ‘=’ isn’t present then it will return its value as ‘.’.

Examples

>>> info_str = 'AC=2,0;AF=1.00;AN=8;BaseQRankSum'
>>> required_keys= ['AC', 'BaseQRankSum']
>>> get_info_dict(self, required_keys)
{'AC':2, 'BaseQRankSum' : '.'}

get_mapped_samples(sample_names=None, formats=None)[source]¶

Parameters:	sample_names (list) – list of sample names that needs to be filtered (default = all samples will be filtered) formats (list) – list of format tags that needs to be filtered (default = all formats will be filtered)
Returns:	dict of filtered sample names along with filtered formats
Return type:	dict

Examples

>>> mapped_sample = {'ms01e': {'GT': './.','PI': '.', 'PC': '.'}, 'MA622': {'GT': '0/0', 'PI': '.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PI': '.', 'PC': '.'}}
>>> get_mapped_samples(self, sample_names= ['ms01e', 'MA611'], formats= ['GT', 'PC'])
{'ms01e': {'GT': './.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PC': '.'}}

hasAllele(allele='0', tag='GT', bases='numeric')[source]¶

Parameters:	allele (str) – allele to check if it is present in given samples(default = ‘0’) tag (str) – format tags of interest (default = ‘GT’) bases (str) – iupac or numeric (default = ‘numeric’)
Returns:	dict of sample with values having given allele
Return type:	dict

hasINDEL()[source]¶

hasSNP()[source]¶

hasVAR(genotype='0/0', tag='GT', bases='numeric')[source]¶

Parameters:	genotype (str) – genotype to check if it is present in given samples(default = ‘0/0’) tag (str) – format tags of interest (default = ‘GT’) bases (str) – iupac or numeric (default = ‘numeric’)
Returns:	dict of sample with values having given genotype
Return type:	dict

has_phased(tag='GT', bases='numeric')[source]¶

Parameters:	tag (str) – format tags of interest (default = ‘GT’) bases (str) – iupac or numeric (default = ‘numeric’)
Returns:	dict of sample with values having ‘/’ in samples formats
Return type:	dict

Examples

>>> rec_keys_eg = 'CHROM        POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ms01e   ms02g   ms03g   ms04h   MA611   MA605   MA622'

>>> rec_valeg = '2      15881018        .       G       A,C     5082.45 PASS    AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC        ./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:. ./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:. ./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:. 1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.        0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:. 0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:. 0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.'
>>> from record_parser import Record
>>> rec_obj = Record(rec_valeg, rec_keys_eg)
>>> rec_obj.has_phased(tag="GT", bases="iupac")
{}

has_unphased(tag='GT', bases='numeric')[source]¶

Parameters:	tag (str) – format tags of interest (default = ‘GT’) bases (str) – iupac or numeric (default = ‘numeric’)
Returns:	dict of sample with values having ‘/’ in samples formats
Return type:	dict

Examples

>>> rec_keys_eg = 'CHROM        POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ms01e   ms02g   ms03g   ms04h   MA611   MA605   MA622'

>>> rec_valeg = '2      15881018        .       G       A,C     5082.45 PASS    AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC        ./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:. ./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:. ./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:. 1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.        0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:. 0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:. 0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.'
>>> from record_parser import Record
>>> rec_obj = Record(rec_valeg, rec_keys_eg)
>>> rec_obj.has_unphased(tag="GT", bases="iupac")
{'ms01e': './.', 'ms02g': './.', 'ms03g': './.', 'ms04h': '1/1', 'MA611': '0/0', 'MA605': '0/0', 'MA622': '0/0'}

hasnoVAR(tag='GT')[source]¶: Returns samples with empty genotype

isHETVAR(tag='GT', bases='numeric')[source]¶

Parameters:	tag (str) – format tags of interest (default = ‘GT’) bases (str) – iupac or numeric (default = ‘numeric’)
Returns:	dict of sample with values having homoref
Return type:	dict

isHOMREF(tag='GT', bases='numeric')[source]¶

Parameters:	tag (str) – format tags of interest (default = ‘GT’) bases (str) – iupac or numeric (default = ‘numeric’)
Returns:	dict of sample with values having homoref
Return type:	dict

Examples

>>> rec_keys_eg = 'CHROM        POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ms01e   ms02g   ms03g   ms04h   MA611   MA605   MA622'

>>> rec_valeg = '2      15881018        .       G       A,C     5082.45 PASS    AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC        ./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:. ./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:. ./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:. 1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.        0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:. 0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:. 0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.'
>>> from record_parser import Record
>>> rec_obj = Record(rec_valeg, rec_keys_eg)
>>> rec_obj.isHOMREF(tag="GT", bases="iupac")
{'MA611': 'G/G', 'MA605': 'G/G', 'MA622': 'G/G'}

isHOMVAR(tag='GT', bases='numeric')[source]¶

Parameters:	tag (str) – format tags of interest (default = ‘GT’) bases (str) – iupac or numeric (default = ‘numeric’)
Returns:	dict of sample with values having homoref
Return type:	dict

isMissing(tag='GT')[source]¶

Parameters:	tag (str) – format tags of interest (default = ‘GT’)
Returns:	dict of sample with values having homoref
Return type:	dict

iupac_to_numeric()[source]¶

map_records_long()[source]¶

Maps record values with record keys.

Returns:	dict with key value pair with sample and infos modified
Return type:	dict

static split_tag_from_samples(order_mapped_samples, tag, sample_names)[source]¶

Splits the tags of given samples from order_dict of mapped_samples

Parameters:	order_mapped_samples (OrderedDict) – tag (str) – sample_names (list) –
Returns:	list of list containing splitted tags
Return type:	list of list

Examples

>>> order_mapped_samples = OrderedDict([('ms01e',{'GT': './.', 'PI': '.'), ('MA622', 'GT': '0/0','PI': '.'})])
>>> tag = 'GT'
>>> sample_names = ['ms01e', 'MA622']
>>> split_tag_from_samples(order_mapped_samples, tag, sample_names)
[['.', '.'], ['0', '0']]

unmap_fmt_samples_dict(mapped_dict)[source]¶: Converts mapped dict again into string to write into the file.

vcfparser.vcf_parser module¶

class vcfparser.vcf_parser.VcfParser(filename)[source]¶

Bases: object

Parses a given vcf file into and outputs metainfo and yields records.

parse_metadata()[source]¶

parse_records()[source]¶

__init__(filename)[source]¶

Parameters:	filename (file) – input vcf file that needs to be parsed. bzip files are also supported.

parse_metadata()[source]: initialize variables to store meta infos

parse_records(chrom=None, pos_range=None, no_of_recs=1)[source]

Parse records from file and yield it.

Parameters:	chrom (str) – pos_range (str) – no_of_recs (int) –
Yields:	Record object on which we can perform different operations and extract required values

vcfparser.vcf_writer module¶

class vcfparser.vcf_writer.VCFWriter(filename)[source]¶

Bases: object

A vcf writer to write headerline and datalines into new file

add_contig(id, length, key='contig')[source]¶

add_filter(id, desc='', key='FILTER')[source]¶

add_filter_long(id, num='.', type='.', desc='', key='FILTER')[source]¶

add_format(id, num='.', type='.', desc='', key='FORMAT')[source]¶

add_header_line(record_keys)[source]¶

add_info(id, num='.', type='.', desc='', key='INFO')[source]¶

add_normal_metadata(key, value)[source]¶: This is used to add normal key value metadata like: fileformat, filedate, refrence

add_record_value(preheader, info, format_, sample_str)[source]¶

vcfparser package¶

Submodules¶

vcfparser.meta_header_parser module¶

vcfparser.record_parser module¶

vcfparser.vcf_parser module¶

vcfparser.vcf_writer module¶

Module contents¶

Vcfparser

Navigation

Related Topics