vcfparser package

Submodules

vcfparser.meta_header_parser module

class vcfparser.meta_header_parser.MetaDataParser(header_file)[source]

Bases: object

Parses a meta lines of the vcf files.

parse_lines()[source]

Parse a vcf metadataline

vcfparser.meta_header_parser.split_to_dict(string)[source]

vcfparser.metaviewer module

class vcfparser.metaviewer.MetaDataViewer(vcf_meta_file, filename='vcfmetafile')[source]

Bases: object

print_requested_metadata(metadata_of_interest)[source]
save_as_json()[source]

Convert the dictionary to a json object and write to a file

save_as_orderdict()[source]

Converts the json type dictionary to dictionary that has all the values under same keys in one list of values.

save_as_table()[source]

write data to a file as text

vcfparser.metaviewer.obj_to_dict(metainfo)[source]
vcfparser.metaviewer.unpack_str(s)[source]

vcfparser.record_parser module

class vcfparser.record_parser.Alleles(mapped_samples, tag='GT')[source]

Bases: object

__init__(mapped_samples, tag='GT')[source]

This class is used to store sample names with their types.

class vcfparser.record_parser.GenotypeVal(allele)[source]

Bases: object

__init__(allele)[source]

” For a given genotype data like (‘0/0’, ‘1|1’, ‘0/1’); this class computes and store values like whether it is homref, hom_alt or hetvar

class vcfparser.record_parser.Record(record_values, record_keys)[source]

Bases: object

A class that converts the record lines from input VCF into accessible record object.

__init__(record_values, record_keys)[source]

Initializes the class with header keys and record values.

Parameters:
  • record_keys (list) –
    • list of record keys generated for the record values
    • generated from string in the VCF that starts with #CHROM
    • stays the same for a particular VCF file
  • record_values (list) –
    • list of record values generated from the VCF record line
    • genrated from the lines below # CHROM in VCF file
    • values are dynamically updated in each for-loop
deletion_overlapping_variant()[source]
get_format_to_sample_map(sample_names=None, formats=None, convert_to_iupac=None)[source]
Parameters:
  • sample_names (list) – list of sample names that needs to be processed (default = all samples are processed)
  • formats (list) – list of format tags that needs to be processed (default = all format tags are processed)
  • convert_to_iupac (list) – list of tags (from FORMAT) that needs to be converted into iupac bases (default tag = ‘GT’, default output = numeric bases)
Returns:

dict of filtered sample names along with filtered format “tags:values”

Return type:

dict

Examples

>>> import vcfparser.vcf_parser as vcfparse
>>> myvcf = vcfparse.VcfParser("input_test.vcf")
>>> records = myvcf.parse_records()
>>> record = first(record)
>>> record.mapped_format_to_sample = {'ms01e': {'GT': './.','PI': '.', 'PC': '.'}, 'MA622': {'GT': '0/0', 'PI': '.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PI': '.', 'PC': '.'}}
>>> record.get_format_to_sample_map(self, sample_names= ['ms01e', 'MA611'], formats= ['GT', 'PC'])
{'ms01e': {'GT': './.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PC': '.'}}
>>> record.get_format_to_sample_map(self, sample_names= ['ms01e', 'MA611'], formats= ['GT', 'PC', 'PG'], convert_to_iupac= ['GT', 'PG'])
{'ms01e': {'GT': './.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PC': '.'}}
get_full_record_map(convert_to_iupac=None)[source]

Maps record values with record keys.

Parameters:convert_to_iupac (list) – list of genotpye tags that needs to be converted into iupac bases (default tag = ‘GT’, default output = numeric bases)
Returns:
  • dict – dict with key value pair with sample and infos modified
  • TODO (Done (Gopal) Add example input and output)

Examples

>>> record.get_full_record_map()
{'CHROM': '2', 'POS': '15881018', 'ID': '.', 'REF': 'G', 'ALT': 'A,C', 'QUAL': '5082.45', 'FILTER': 'PASS', 'INFO': {'AC': '2,0', 'AF': '1.00', 'AN': '8', 'BaseQRankSum': '-7.710e-01', 'ClippingRankSum': '0.00', 'DP': '902', 'ExcessHet': '0.0050', 'FS': '0.000', 'InbreedingCoeff': '0.8004', 'MLEAC': '12,1', 'MLEAF': '0.462,0.038', 'MQ': '60.29', 'MQRankSum': '0.00', 'QD': '33.99', 'ReadPosRankSum': '0.260', 'SF': '0,1,2,3,4,5,6', 'SOR': '0.657', 'set': 'HignConfSNPs'}, 'FORMAT': 'GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC', 'ms01e': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms02g': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms03g': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms04h': '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', 'MA611': '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', 'MA605': '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', 'MA622': '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.', 'samples': {'ms01e': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms02g': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms03g': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms04h': {'GT': '1/1', 'PI': '.', 'GQ': '6', 'PG': '1/1', 'PM': '.', 'PW': '1/1', 'AD': '0,2', 'PL': '49,6,0,.,.,.', 'DP': '2', 'PB': '.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PI': '.', 'GQ': '78', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '29,0,0', 'PL': '0,78,1170,78,1170,1170', 'DP': '29', 'PB': '.', 'PC': '.'}, 'MA605': {'GT': '0/0', 'PI': '.', 'GQ': '9', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '3,0,0', 'PL': '0,9,112,9,112,112', 'DP': '3', 'PB': '.', 'PC': '.'}, 'MA622': {'GT': '0/0', 'PI': '.', 'GQ': '99', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '40,0,0', 'PL': '0,105,1575,105,1575,1575', 'DP': '40', 'PB': '.', 'PC': '.'}}}
get_info_as_dict(info_keys=None)[source]

Convert Info to dict for required keys

Parameters:info_keys (list) – Keys of interest (default = all keys will be mapped)
Returns:key: value pair of only required keys
Return type:dict

Notes

If ‘=’ isn’t present then it will return its value as ‘.’.

Examples

>>> info_str = 'AC=2,0;AF=1.00;AN=8;BaseQRankSum'
>>> info_keys= ['AC', 'BaseQRankSum']
>>> record.get_info_as_dict(info_keys)
{'AC': '2,0', 'BaseQRankSum': '-7.710e-01'}
get_mapped_tag_list(sample_names=None, tag=None, bases='numeric')[source]
static get_tag_values_from_samples(order_mapped_samples, tag, sample_names, split_at=None)[source]

Splits the tags of given samples from order_dict of mapped_samples

Parameters:
  • order_mapped_samples (OrderedDict) – Ordered dictionary of FORMAT tags mapped to SAMPLE values.
  • tag (str) – One of the FORMAT tag.
  • sample_names (list) – Name of the samples to extract the values from.
  • split_at (str) – Character to split the value string at. e.g “|”, “/”, “,” etc.
Returns:

List of list containing SAMPLE value for the FORMAT tag

Return type:

list of list

Examples

>>> order_mapped_samples = OrderedDict([('ms01e',{'GT': './.', 'PI': '.'}), ('MA622', {'GT': '0/0','PI': '.'})])
>>> tag = 'GT'
>>> sample_names = ['ms01e', 'MA622']
>>> record.get_tag_values_from_samples(order_mapped_samples, tag, sample_names)
[['./.'], ['0/0']]
>>> # using "/|"  # to split at GT values at both | and /
>>> get_tag_values_from_samples(order_mapped_samples, tag, sample_names, split_at= "/|")
[['.', '.'], ['0', '0']]
hasAllele(allele='0', tag='GT', bases='numeric')[source]
Parameters:
  • allele (str) – allele to check if it is present in given samples(default = ‘0’)
  • tag (str) – format tags of interest (default = ‘GT’)
  • bases (str) – iupac or numeric (default = ‘numeric’)
Returns:

dict of sample with values having given allele

Return type:

dict

Example

>>> record.hasAllele(allele='0', tag='GT', bases='numeric')
{'MA611': '0/0', 'MA605': '0/0', 'MA622': '0/0'}
hasINDEL()[source]
hasSNP(tag='GT', bases='numeric')[source]
hasVAR(genotype='0/0', tag='GT', bases='numeric')[source]
Parameters:
  • genotype (str) – genotype to check if it is present in given samples(default = ‘0/0’)
  • tag (str) – format tags of interest (default = ‘GT’)
  • bases (str) – iupac or numeric (default = ‘numeric’)
Returns:

dict of sample with values having given genotype

Return type:

dict

Example

>>> record.hasVAR(genotype='0/0')

{‘MA611’: ‘0/0’, ‘MA605’: ‘0/0’, ‘MA622’: ‘0/0’}

has_phased(tag='GT', bases='numeric')[source]
Parameters:
  • tag (str) – format tags of interest (default = ‘GT’)
  • bases (str) – iupac or numeric (default = ‘numeric’)
Returns:

dict of sample with values having ‘/’ in samples formats

Return type:

dict

Examples

>>> rec_keys = ['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622']
>>> rec_values = ['2', '15881018', '.', 'G', 'A,C', '5082.45', 'PASS', 'AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs', 'GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.']
>>> from vcfpaser.record_parser import Record
>>> rec_obj = Record(rec_values, rec_keys)
>>> rec_obj.has_phased(tag="GT", bases="iupac")
{}
has_unphased(tag='GT', bases='numeric')[source]
Parameters:
  • tag (str) – format tags of interest (default = ‘GT’)
  • bases (str) – iupac or numeric (default = ‘numeric’)
Returns:

dict of sample with values having ‘/’ in samples formats

Return type:

dict

Examples

>>> rec_keys = ['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622']
>>> rec_values = ['2', '15881018', '.', 'G', 'A,C', '5082.45', 'PASS', 'AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs', 'GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.']
>>> from vcfparser.record_parser import Record
>>> rec_obj = Record(rec_values, rec_keys)
>>> rec_obj.has_unphased(tag="GT", bases="iupac")
{'ms01e': './.', 'ms02g': './.', 'ms03g': './.', 'ms04h': 'A/A', 'MA611': 'G/G', 'MA605': 'G/G', 'MA622': 'G/G'}
hasnoVAR(tag='GT')[source]

Returns samples with empty genotype

isHETVAR(tag='GT', bases='numeric')[source]
Parameters:
  • tag (str) – format tags of interest (default = ‘GT’)
  • bases (str) – iupac or numeric (default = ‘numeric’)
Returns:

dict of sample with values having homoref

Return type:

dict

Examples

>>> record.isHETVAR(tag="GT", bases="numeric")
{}
isHOMREF(tag='GT', bases='numeric')[source]
Parameters:
  • tag (str) – format tags of interest (default = ‘GT’)
  • bases (str) – iupac or numeric (default = ‘numeric’)
Returns:

dict of sample with values having homoref

Return type:

dict

Examples

>>> rec_keys = ['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622']
>>> rec_values = ['2', '15881018', '.', 'G', 'A,C', '5082.45', 'PASS', 'AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs', 'GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.']
>>> from vcfparser.record_parser import Record
>>> rec_obj = Record(rec_values, rec_keys)
>>> rec_obj.isHOMREF(tag="GT", bases="iupac")
{'MA611': 'G/G', 'MA605': 'G/G', 'MA622': 'G/G'}
isHOMVAR(tag='GT', bases='numeric')[source]
Parameters:
  • tag (str) – format tags of interest (default = ‘GT’)
  • bases (str) – iupac or numeric (default = ‘numeric’)
Returns:

dict of sample with values having homoref

Return type:

dict

Examples

>>> record.isHOMVAR(tag="GT", bases="iupac")
{'ms01e': './.', 'ms02g': './.', 'ms03g': './.', 'ms04h': 'A/A', 'MA611': 'G/G', 'MA605': 'G/G', 'MA622': 'G/G'}
isMissing(tag='GT')[source]
Parameters:tag (str) – format tags of interest (default = ‘GT’)
Returns:dict of sample with values having homoref
Return type:dict

Examples

>>> record.isMissing(tag='PI')
{'ms01e': '.', 'ms02g': '.', 'ms03g': '.', 'ms04h': '.', 'MA611': '.', 'MA605': '.', 'MA622': '.'}
iupac_to_numeric(ref_alt, genotype_in_iupac)[source]
mapped_rec_to_str(mapped_sample_dict)[source]
static split_genotype_tags()[source]
unmap_fmt_samples_dict(mapped_dict)[source]

Converts mapped dict again into string to write into the file.

vcfparser.vcf_parser module

class vcfparser.vcf_parser.VcfParser(filename)[source]

Bases: object

A class to parse the metadata information and yield records from the input VCF.

parse_metadata()[source]
parse_records()[source]
__init__(filename)[source]
Parameters:filename (file) – input vcf file that needs to be parsed. bgzipped files are also supported.
Returns:VCF object for iterating and querying.
Return type:Object
parse_metadata()[source]

function to parse the metadata information from VCF header.

Returns:MetaDataParser object for iterating and querying the metadata information.
Return type:Object

Uses

MetaDataParser class to create MetaData object

parse_records(chrom=None, pos_range=None, no_processors=1)[source]

Parse records and yield it.

Parameters:
  • chrom (str) – chormosome name or number. Default = None
  • pos_range (tuple) – genomic position of interest, e.g: (5, 15). Both upper and lower limits are inclusive. Default = None
  • no_of_recs (int) – number of records to process

Uses

Record module to create a Record object

Yields:Record object for interating and quering the record information.

vcfparser.vcf_writer module

class vcfparser.vcf_writer.VCFWriter(filename)[source]

Bases: object

A vcf writer to write headerline and datalines into new file

add_contig(id, length, key='contig')[source]
add_filter(id, desc='', key='FILTER')[source]
add_filter_long(id, num='.', type='.', desc='', key='FILTER')[source]
add_format(id, num='.', type='.', desc='', key='FORMAT')[source]
add_header_line(record_keys)[source]
add_info(id, num='.', type='.', desc='', key='INFO')[source]
add_normal_metadata(key, value)[source]

This is used to add normal key value metadata like: fileformat, filedate, refrence

add_record_value(preheader, info, format_, sample_str)[source]

Module contents