Welcome to Vcfparser’s documentation!¶
vcfparser¶
Python (version >=3.6) package for parsing the genomics and transcriptomics VCF data.
- Free software: MIT license
- Documentation: https://vcfparser.readthedocs.io.
Features¶
- No external dependency except python (version >=3.6).
- Minimalistic in nature.
- Provides a lot of features to API users.
- Cython compiling is provided to optimize performance.
Installation¶
This is only preferred while developing/optimizing VcfSimplify along with vcfparser.
Navigate to the VCFsimplify directory -> activate python -> call the ‘vcfparser’ package.
$ C:\Users\>cd VCF-Simplify
$ C:\Users\>cd VCF-Simplify>dir
Volume in drive C is StorageDrive
Volume Serial Number is .........
Directory of C:\Users\VCF-Simplify
07/12/2020 10:14 AM <DIR> .
07/12/2020 10:14 AM <DIR> ..
07/12/2020 08:55 AM <DIR> .github
............................
............................
07/12/2020 10:42 AM <DIR> vcfparser
07/12/2020 08:55 AM 1,494 VcfSimplify.py
11 File(s) 20,873,992 bytes
13 Dir(s) 241,211,793,408 bytes free
$ C:\Users\VCF-Simplify>python
Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 22:39:24) [MSC v.1916 (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from vcfparser import VcfParser
>>>
$ pip install vcfparser
Cythonize (optional but helpful)¶
The installed “vcfparser” package can be cythonized to optimize performance. Cythonizing the package can increase the speed of the parser by about x.x - y.y (?) times.
TODO: Bhuwan - add required cython method in here
Usage¶
>>> from vcfparser import VcfParser
>>> vcf_obj = VcfParser('input_test.vcf')
Get metadata information from the vcf file¶
>>> metainfo = vcf_obj.parse_metadata()
>>> metainfo.fileformat
'VCFv4.2'
>>> metainfo.filters_
[{'ID': 'LowQual', 'Description': 'Low quality'}, {'ID': 'my_indel_filter', 'Description': 'QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0'}, {'ID': 'my_snp_filter', 'Description': 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0'}]
>>> metainfo.alt_
[{'ID': 'NON_REF', 'Description': 'Represents any possible alternative allele at this location'}]
>>> metainfo.sample_names
['ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622']
>>> metainfo.record_keys
['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622']
Get Records from the vcf file¶
>>> records = vcf_obj.parse_records()
# Note: Records are returned as generator.
>>> first_record = next(records)
>>> first_record.CHROM
'2'
>>> first_record.POS
'15881018'
>>> first_record.REF
'G'
>>> first_record.ALT
'A,C'
>>> first_record.QUAL
'5082.45'
>>> first_record.FILTER
['PASS']
>>> first_record.get_mapped_samples()
{'ms01e': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms02g': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms03g': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms04h': {'GT': '1/1', 'PI': '.', 'GQ': '6', 'PG': '1/1', 'PM': '.', 'PW': '1/1', 'AD': '0,2', 'PL': '49,6,0,.,.,.', 'DP': '2', 'PB': '.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PI': '.', 'GQ': '78', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '29,0,0', 'PL': '0,78,1170,78,1170,1170', 'DP': '29', 'PB': '.', 'PC': '.'}, 'MA605': {'GT': '0/0', 'PI': '.', 'GQ': '9', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '3,0,0', 'PL': '0,9,112,9,112,112', 'DP': '3', 'PB': '.', 'PC': '.'}, 'MA622': {'GT': '0/0', 'PI': '.', 'GQ': '99', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '40,0,0', 'PL': '0,105,1575,105,1575,1575', 'DP': '40', 'PB': '.', 'PC': '.\n'}}
TODO: Bhuwan (priority - high) The very last example “first_record.get_mapped_samples()” is returning the value of the last sample/key with “n”. i.e: ‘PC’: ‘.n’ Please fix that issue - strip(‘n’) in the line before parsing.
Alternately, we can loop over each record by using a for-loop:
for record in records:
chrom = record.CHROM
pos = record.POS
id = record.ID
ref = record.REF
alt = record.ALT
qual = record.QUAL
filter = record.FILTER
format_ = record.format_
infos = record.get_info_dict()
mapped_sample = record.get_mapped_samples()
Installation¶
Stable release¶
To install vcfparser, run this command in your terminal:
$ pip install vcfparser
This is the preferred method to install vcfparser, as it will always install the most recent stable release.
If you don’t have pip installed, use this Python installation guide.
From sources¶
The sources for vcfparser can be downloaded from the Github repo.
You can either clone the public repository:
$ git clone git://github.com/everestial/vcfparser
Or download the tarball:
$ curl -OL https://github.com/everestial/vcfparser/tarball/master
Once you have a copy of the source, you can install it with:
$ python setup.py install
vcfparser¶
vcfparser package¶
Submodules¶
vcfparser.meta_header_parser module¶
vcfparser.metaviewer module¶
vcfparser.record_parser module¶
-
class
vcfparser.record_parser.
Record
(record_values, record_keys)[source]¶ Bases:
object
A class that converts the record lines from input VCF into accessible record object.
-
__init__
(record_values, record_keys)[source]¶ Initializes the class with header keys and record values.
Parameters: - record_keys (list) –
- list of record keys generated for the record values
- generated from string in the VCF that starts with #CHROM
- stays the same for a particular VCF file
- record_values (list) –
- list of record values generated from the VCF record line
- genrated from the lines below # CHROM in VCF file
- values are dynamically updated in each for-loop
- record_keys (list) –
-
get_format_to_sample_map
(sample_names=None, formats=None, convert_to_iupac=None)[source]¶ Parameters: - sample_names (list) – list of sample names that needs to be processed (default = all samples are processed)
- formats (list) – list of format tags that needs to be processed (default = all format tags are processed)
- convert_to_iupac (list) – list of tags (from FORMAT) that needs to be converted into iupac bases (default tag = ‘GT’, default output = numeric bases)
Returns: dict of filtered sample names along with filtered format “tags:values”
Return type: dict
Examples
>>> import vcfparser.vcf_parser as vcfparse >>> myvcf = vcfparse.VcfParser("input_test.vcf") >>> records = myvcf.parse_records() >>> record = first(record) >>> record.mapped_format_to_sample = {'ms01e': {'GT': './.','PI': '.', 'PC': '.'}, 'MA622': {'GT': '0/0', 'PI': '.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PI': '.', 'PC': '.'}} >>> record.get_format_to_sample_map(self, sample_names= ['ms01e', 'MA611'], formats= ['GT', 'PC']) {'ms01e': {'GT': './.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PC': '.'}}
>>> record.get_format_to_sample_map(self, sample_names= ['ms01e', 'MA611'], formats= ['GT', 'PC', 'PG'], convert_to_iupac= ['GT', 'PG']) {'ms01e': {'GT': './.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PC': '.'}}
-
get_full_record_map
(convert_to_iupac=None)[source]¶ Maps record values with record keys.
Parameters: convert_to_iupac (list) – list of genotpye tags that needs to be converted into iupac bases (default tag = ‘GT’, default output = numeric bases) Returns: - dict – dict with key value pair with sample and infos modified
- TODO (Done (Gopal) Add example input and output)
Examples
>>> record.get_full_record_map() {'CHROM': '2', 'POS': '15881018', 'ID': '.', 'REF': 'G', 'ALT': 'A,C', 'QUAL': '5082.45', 'FILTER': 'PASS', 'INFO': {'AC': '2,0', 'AF': '1.00', 'AN': '8', 'BaseQRankSum': '-7.710e-01', 'ClippingRankSum': '0.00', 'DP': '902', 'ExcessHet': '0.0050', 'FS': '0.000', 'InbreedingCoeff': '0.8004', 'MLEAC': '12,1', 'MLEAF': '0.462,0.038', 'MQ': '60.29', 'MQRankSum': '0.00', 'QD': '33.99', 'ReadPosRankSum': '0.260', 'SF': '0,1,2,3,4,5,6', 'SOR': '0.657', 'set': 'HignConfSNPs'}, 'FORMAT': 'GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC', 'ms01e': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms02g': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms03g': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms04h': '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', 'MA611': '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', 'MA605': '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', 'MA622': '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.', 'samples': {'ms01e': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms02g': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms03g': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms04h': {'GT': '1/1', 'PI': '.', 'GQ': '6', 'PG': '1/1', 'PM': '.', 'PW': '1/1', 'AD': '0,2', 'PL': '49,6,0,.,.,.', 'DP': '2', 'PB': '.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PI': '.', 'GQ': '78', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '29,0,0', 'PL': '0,78,1170,78,1170,1170', 'DP': '29', 'PB': '.', 'PC': '.'}, 'MA605': {'GT': '0/0', 'PI': '.', 'GQ': '9', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '3,0,0', 'PL': '0,9,112,9,112,112', 'DP': '3', 'PB': '.', 'PC': '.'}, 'MA622': {'GT': '0/0', 'PI': '.', 'GQ': '99', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '40,0,0', 'PL': '0,105,1575,105,1575,1575', 'DP': '40', 'PB': '.', 'PC': '.'}}}
-
get_info_as_dict
(info_keys=None)[source]¶ Convert Info to dict for required keys
Parameters: info_keys (list) – Keys of interest (default = all keys will be mapped) Returns: key: value pair of only required keys Return type: dict Notes
If ‘=’ isn’t present then it will return its value as ‘.’.
Examples
>>> info_str = 'AC=2,0;AF=1.00;AN=8;BaseQRankSum' >>> info_keys= ['AC', 'BaseQRankSum'] >>> record.get_info_as_dict(info_keys) {'AC': '2,0', 'BaseQRankSum': '-7.710e-01'}
-
static
get_tag_values_from_samples
(order_mapped_samples, tag, sample_names, split_at=None)[source]¶ Splits the tags of given samples from order_dict of mapped_samples
Parameters: - order_mapped_samples (OrderedDict) – Ordered dictionary of FORMAT tags mapped to SAMPLE values.
- tag (str) – One of the FORMAT tag.
- sample_names (list) – Name of the samples to extract the values from.
- split_at (str) – Character to split the value string at. e.g “|”, “/”, “,” etc.
Returns: List of list containing SAMPLE value for the FORMAT tag
Return type: list of list
Examples
>>> order_mapped_samples = OrderedDict([('ms01e',{'GT': './.', 'PI': '.'}), ('MA622', {'GT': '0/0','PI': '.'})]) >>> tag = 'GT' >>> sample_names = ['ms01e', 'MA622'] >>> record.get_tag_values_from_samples(order_mapped_samples, tag, sample_names) [['./.'], ['0/0']] >>> # using "/|" # to split at GT values at both | and / >>> get_tag_values_from_samples(order_mapped_samples, tag, sample_names, split_at= "/|") [['.', '.'], ['0', '0']]
-
hasAllele
(allele='0', tag='GT', bases='numeric')[source]¶ Parameters: - allele (str) – allele to check if it is present in given samples(default = ‘0’)
- tag (str) – format tags of interest (default = ‘GT’)
- bases (str) – iupac or numeric (default = ‘numeric’)
Returns: dict of sample with values having given allele
Return type: dict
Example
>>> record.hasAllele(allele='0', tag='GT', bases='numeric') {'MA611': '0/0', 'MA605': '0/0', 'MA622': '0/0'}
-
hasVAR
(genotype='0/0', tag='GT', bases='numeric')[source]¶ Parameters: - genotype (str) – genotype to check if it is present in given samples(default = ‘0/0’)
- tag (str) – format tags of interest (default = ‘GT’)
- bases (str) – iupac or numeric (default = ‘numeric’)
Returns: dict of sample with values having given genotype
Return type: dict
Example
>>> record.hasVAR(genotype='0/0')
{‘MA611’: ‘0/0’, ‘MA605’: ‘0/0’, ‘MA622’: ‘0/0’}
-
has_phased
(tag='GT', bases='numeric')[source]¶ Parameters: - tag (str) – format tags of interest (default = ‘GT’)
- bases (str) – iupac or numeric (default = ‘numeric’)
Returns: dict of sample with values having ‘/’ in samples formats
Return type: dict
Examples
>>> rec_keys = ['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622'] >>> rec_values = ['2', '15881018', '.', 'G', 'A,C', '5082.45', 'PASS', 'AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs', 'GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.'] >>> from vcfpaser.record_parser import Record >>> rec_obj = Record(rec_values, rec_keys) >>> rec_obj.has_phased(tag="GT", bases="iupac") {}
-
has_unphased
(tag='GT', bases='numeric')[source]¶ Parameters: - tag (str) – format tags of interest (default = ‘GT’)
- bases (str) – iupac or numeric (default = ‘numeric’)
Returns: dict of sample with values having ‘/’ in samples formats
Return type: dict
Examples
>>> rec_keys = ['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622'] >>> rec_values = ['2', '15881018', '.', 'G', 'A,C', '5082.45', 'PASS', 'AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs', 'GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.'] >>> from vcfparser.record_parser import Record >>> rec_obj = Record(rec_values, rec_keys) >>> rec_obj.has_unphased(tag="GT", bases="iupac") {'ms01e': './.', 'ms02g': './.', 'ms03g': './.', 'ms04h': 'A/A', 'MA611': 'G/G', 'MA605': 'G/G', 'MA622': 'G/G'}
-
isHETVAR
(tag='GT', bases='numeric')[source]¶ Parameters: - tag (str) – format tags of interest (default = ‘GT’)
- bases (str) – iupac or numeric (default = ‘numeric’)
Returns: dict of sample with values having homoref
Return type: dict
Examples
>>> record.isHETVAR(tag="GT", bases="numeric") {}
-
isHOMREF
(tag='GT', bases='numeric')[source]¶ Parameters: - tag (str) – format tags of interest (default = ‘GT’)
- bases (str) – iupac or numeric (default = ‘numeric’)
Returns: dict of sample with values having homoref
Return type: dict
Examples
>>> rec_keys = ['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622'] >>> rec_values = ['2', '15881018', '.', 'G', 'A,C', '5082.45', 'PASS', 'AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs', 'GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.'] >>> from vcfparser.record_parser import Record >>> rec_obj = Record(rec_values, rec_keys) >>> rec_obj.isHOMREF(tag="GT", bases="iupac") {'MA611': 'G/G', 'MA605': 'G/G', 'MA622': 'G/G'}
-
isHOMVAR
(tag='GT', bases='numeric')[source]¶ Parameters: - tag (str) – format tags of interest (default = ‘GT’)
- bases (str) – iupac or numeric (default = ‘numeric’)
Returns: dict of sample with values having homoref
Return type: dict
Examples
>>> record.isHOMVAR(tag="GT", bases="iupac") {'ms01e': './.', 'ms02g': './.', 'ms03g': './.', 'ms04h': 'A/A', 'MA611': 'G/G', 'MA605': 'G/G', 'MA622': 'G/G'}
-
isMissing
(tag='GT')[source]¶ Parameters: tag (str) – format tags of interest (default = ‘GT’) Returns: dict of sample with values having homoref Return type: dict Examples
>>> record.isMissing(tag='PI') {'ms01e': '.', 'ms02g': '.', 'ms03g': '.', 'ms04h': '.', 'MA611': '.', 'MA605': '.', 'MA622': '.'}
-
vcfparser.vcf_parser module¶
-
class
vcfparser.vcf_parser.
VcfParser
(filename)[source]¶ Bases:
object
A class to parse the metadata information and yield records from the input VCF.
-
__init__
(filename)[source]¶ Parameters: filename (file) – input vcf file that needs to be parsed. bgzipped files are also supported. Returns: VCF object for iterating and querying. Return type: Object
-
parse_metadata
()[source] function to parse the metadata information from VCF header.
Returns: MetaDataParser object for iterating and querying the metadata information. Return type: Object Uses
MetaDataParser class to create MetaData object
-
parse_records
(chrom=None, pos_range=None, no_processors=1)[source] Parse records and yield it.
Parameters: - chrom (str) – chormosome name or number. Default = None
- pos_range (tuple) – genomic position of interest, e.g: (5, 15). Both upper and lower limits are inclusive. Default = None
- no_of_recs (int) – number of records to process
Uses
Record module to create a Record object
Yields: Record object for interating and quering the record information.
-
vcfparser.vcf_writer module¶
Module contents¶
Tutorial on MetaData¶
Advanced tutorial on vcf parser module showing available methods for parsing metadata.
First import VcfParser
module and instantiate an vcf object by
passing vcf file as an argument.
Initial setup:¶
>>> from vcfparser import VcfParser
>>> vcf_obj = VcfParser('input_test.vcf')
We can also pass gzipped vcf file as an argument.
>>> vcf_obj = VcfParser('input_test.vcf.gz')
VcfParser
module has two main methods:
- parse_metadata: It contains methods for extracting information related to the metadata header.
- parse_records: It contains methods for retrieving the record values from the vcf file.
Parsing VCF metadata:¶
To parse the metadata information we can call parse_metadata()
:
>>> # pass the VCF object to the 'parse_metadata()' function
>>> metainfo = vcf_obj.parse_metadata()
- Metainfo provides several attributes and objects that helps in extracting specific metadata information.
- These informations are reported as a list or dictionary.
To list all the available attributes within metainfo do:
>>> metainfo.__dir__()
['header_file', 'infos_', 'filters_', 'contig', 'format_', 'alt_', 'other_lines', 'fileformat', 'reference', 'sample_names', 'is_gvcf', 'gvcf_blocks', 'record_keys', 'VCFspec', 'gatk_commands', 'raw_meta_data', '_format_pattern', '_meta_pattern', 'sample_with_pos', '__module__', '__doc__', '__init__', '_parse_gvcf_block', '_parse_gatk_commands', 'parse_lines', '__dict__', '__weakref__', '__repr__', '__hash__', '__str__', '__getattribute__', '__setattr__', '__delattr__', '__lt__', '__le__', '__eq__', '__ne__', '__gt__', '__ge__', '__new__', '__reduce_ex__', '__reduce__', '__subclasshook__', '__init_subclass__', '__format__', '__sizeof__', '__dir__', '__class__']
>>> # or
>>> dir(metainfo)
['VCFspec', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_format_pattern', '_meta_pattern', '_parse_gatk_commands', '_parse_gvcf_block', 'alt_', 'contig', 'fileformat', 'filters_', 'format_', 'gatk_commands', 'gvcf_blocks', 'header_file', 'infos_', 'is_gvcf', 'other_lines', 'parse_lines', 'raw_meta_data', 'record_keys', 'reference', 'sample_names', 'sample_with_pos', 'testA']
To call the specific objects and attributes do:
>>> metainfo.VCFspec
[{'fileformat': 'VCFv4.2'}, {'GVCF': True}]
>>> metainfo.infos_
[{'ID': 'AF', 'Number': 'A', 'Type': 'Float', 'Description': 'Allele Frequency, for each ALT allele, in the same order as listed'}, {'ID': 'BaseQRankSum', 'Number': '1', 'Type': 'Float', 'Description': 'Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities'}, {'ID': 'ClippingRankSum', 'Number': '1', 'Type': 'Float', 'Description': 'Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases'}, {'ID': 'DP', 'Number': '1', 'Type': 'Integer', 'Description': 'Approximate read depth; some reads may have been filtered'},
{'ID': 'DS', 'Number': '0', 'Type': 'Flag', 'Description': 'Were any of the samples downsampled?'}, {'ID': 'END', 'Number': '1', 'Type': 'Integer', 'Description': 'Stop position of the interval'}, {'ID': 'ExcessHet', 'Number': '1', 'Type': 'Float', 'Description': 'Phred-scaled p-value for exact test of excess heterozygosity'}, {'ID': 'FS', 'Number': '1', 'Type': 'Float', 'Description': "Phred-scaled p-value using Fisher's exact test to detect strand bias"}, {'ID': 'HaplotypeScore', 'Number': '1', 'Type': 'Float', 'Description': 'Consistency of the site with at most two segregating haplotypes'}, {'ID': 'InbreedingCoeff', 'Number': '1', 'Type': 'Float', 'Description': 'Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation'}, {'ID': 'MLEAC', 'Number': 'A', 'Type': 'Integer', 'Description': 'Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed'}, {'ID': 'MLEAF', 'Number': 'A', 'Type': 'Float', 'Description': 'Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed'}, {'ID': 'MQ', 'Number': '1', 'Type': 'Float', 'Description': 'RMS Mapping Quality'}, {'ID': 'MQRankSum', 'Number': '1', 'Type': 'Float', 'Description': 'Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities'}, {'ID': 'QD', 'Number': '1', 'Type': 'Float', 'Description': 'Variant Confidence/Quality by Depth'}, {'ID': 'RAW_MQ', 'Number': '1', 'Type': 'Float', 'Description': 'Raw data for RMS Mapping
Quality'}, {'ID': 'ReadPosRankSum', 'Number': '1', 'Type': 'Float', 'Description': 'Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias'}, {'ID': 'SOR', 'Number': '1', 'Type': 'Float', 'Description': 'Symmetric Odds Ratio of 2x2 contingency table to detect strand bias'}, {'ID': 'set', 'Number': '1', 'Type': 'String', 'Description': 'Source VCF for the merged record in CombineVariants'}, {'ID': 'SF', 'Number': '.', 'Type': 'String', 'Description': 'Source File (index to sourceFiles, f when filtered)'}, {'ID': 'AC', 'Number': '.',
'Type': 'Integer', 'Description': 'Allele count in genotypes'}, {'ID': 'AN', 'Number': '1', 'Type': 'Integer', 'Description': 'Total number of alleles in called genotypes'}, {'ID': 'TS', 'Type': 'Test', 'Description': 'Allele count in genotypes'}]
>>> metainfo.record_keys
['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622']
Tutorial on record parser¶
Advanced tutorial on vcf parser module showing available methods for parsing records.
First import VcfParser
module and instantiate an vcf object by
passing vcf file as an argument.
Initial setup:¶
>>> from vcfparser import VcfParser
>>> vcf_obj = VcfParser('input_test.vcf')
>>> vcf_obj = VcfParser('input_test.vcf.gz')
VcfParser
module has two main methods:- parse_metadata: to extract the metadata information from VCF metadata header.
- parse_records: to retrieve the record values from the VCF record lines.
Accessing VCF records:¶
Step 01:
>>> # pass the VCF object to the 'parse_records()' function
>>> records = vcf_obj.parse_records()
Step 02:
Yield record values - Method A: using next()
- records is an generator object. Therefore, applying
next(records)
yields the very first record as Record object.- Subsequent
next(records)
will yield subsequent records after that first record from the VCF.parse_records()
uses theRecord
class which can be used directly ifrecord_keys
andrecord_vals
are handy.
For more info about Record visit Record
.
>>> first_record = next(records)
>>> print(first_record)
2 15881224 . T G 143.24 PASS AC=0;AF=0.036;AN=12;BaseQRankSum=1.75;ClippingRankSum=0.00;DP=591;ExcessHet=3.0103;FS=3.522;InbreedingCoeff=-0.1072;MLEAC=1;MLEAF=0.036;MQ=41.48;MQRankSum=0.366;QD=15.92;ReadPosRankSum=0.345;SF=0,1,2,3,4,5,6;SOR=2.712;set=HignConfSNPs GT:PM:PG:GQ:AD:PW:PI:PL:PC:PB:DP ./.:.:./.:.:0:./.:.:.,.,.:.:.:0 0/0:.:0/0:3:1:0/0:.:.,.,.:.:.:1 0/0:.:0/0:12:4:0/0:.:.,.,.:.:.:4 0/0:.:0/0:3:4:0/0:.:.,.,.:.:.:4 0/0:.:0/0:30:17,0:0/0:.:0,30,450:.:.:17 0/0:.:0/0:15:7,0:0/0:.:0,15,225:.:.:7 0/0:.:0/0:39:25,0:0/0:.:0,39,585:.:.:25
Yield record values - Method B: using for-loop
Each record in the VCF can also be accessed on a for-loop
>>> for record in records:
... print(record)
... record.POS
... break
...
2 15881018 . G A,C 5082.45 PASS AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs
GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC 0/1:5:.:0|1:.:./.:0,0:0,0,0,.,.,.:0:.:. ./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:. ./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:. 1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:. 0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:. 0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:. 0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.
'15881018'
Step 03: Extract data using Record object attribute and methods
Record object also has several attributes and methods which allows us to extract the record values as list or dictionary.
>>> " list of available attribute and methods "
>>> dir(first_record) # or print(dir(record)) on a for loop
['ALT', 'CHROM', 'FILTER', 'ID', 'POS', 'QUAL', 'REF', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_map_fmt_to_samples', '_to_iupac', 'deletion_overlapping_variant', 'format_', 'get_info_as_dict', 'get_mapped_samples', 'get_mapped_tag_list', 'hasAllele',
'hasINDEL', 'hasSNP', 'hasVAR', 'has_phased', 'has_unphased', 'hasnoVAR', 'info_str', 'isHETVAR', 'isHOMREF', 'isHOMVAR', 'isMissing', 'iupac_to_numeric', 'map_records_long', 'mapped_format_to_sample', 'rec_line', 'record_keys', 'record_vals', 'ref_alt', 'sample_names', 'sample_vals', 'get_tag_values_from_samples', 'unmap_fmt_samples_dict', 'vTest']
Attributes
>>> # available attributes in the "record" object are:
CHROM, POS, REF, ALT, ref_alt, QUAL, FILTER, info_str, format_, sample_names, sample_vals, mapped_format_to_sample
>>> "Access simple position level attribute values as"
>>> first_record.CHROM
'2'
>>> first_record.POS
'15881018'
>>> first_record.REF, first_record.ALT, first_record.QUAL, first_record.FILTER
('G', ['A', 'C'], '5082.45', ['PASS'])
>>> first_record.ref_alt # call REF and ALT allele together
['C', 'CA']
>>> # keys represented in the "CHROM" line of the VCF
>>> first_record.record_keys
['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622']
>>> # Note: "record_keys" available within record object are same as the one from metainfo object.
>>> metainfo.record_keys # from "parse_metadata()"
['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622']
>>>
>>> first_record.record_values # record values as list
['2', '15881018', '.', 'G', 'A,C', '5082.45', 'PASS', 'AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs', 'GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.']
>>> "Population level information is provided by the INFO key"
>>> # accessed using 'info_str'
>>> first_record.info_str # info values as string
'AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs'
>>> "Sample level infomation are extracted by matching the FORMAT tags with their corresponding values in the SAMPLE"
>>> first_record.format_ # available tags in FORMAT
['GT', 'PI', 'GQ', 'PG', 'PM', 'PW', 'AD', 'PL', 'DP', 'PB', 'PC']
>>> first_record.sample_names # sample names
['ms01e', 'ms02g', 'ms03g', 'ms04h', 'MA611', 'MA605', 'MA622']
>>> first_record.sample_vals # sample values as list
['./.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.']
>>> # a default full map of the FORMAT tags to SAMPLE values
>>> first_record.mapped_format_to_sample
OrderedDict([('ms01e', {'GT': '.', 'AD': '.', 'PI': '.', 'PW': '.', 'PG': '.',
'PM': '.', 'GQ': '.', 'DP': '.', 'PB': '.', 'PC': '.', 'PL': '.'}), ('ms02g', {'GT': '.', 'AD': '.', 'PI': '.', 'PW': '.', 'PG': '.', 'PM': '.', 'GQ': '.', 'DP': '.', 'PB': '.', 'PC': '.', 'PL': '.'}), ('ms03g', {'GT': '.', 'AD': '.', 'PI': '.', 'PW': '.', 'PG': '.', 'PM': '.', 'GQ': '.', 'DP': '.', 'PB': '.', 'PC': '.', 'PL': '.'}), ('ms04h', {'GT': '.', 'AD': '.', 'PI': '.', 'PW': '.', 'PG': '.', 'PM': '.', 'GQ': '.', 'DP': '.', 'PB': '.', 'PC': '.', 'PL': '.'}), ('MA611', {'GT': '0/0', 'AD': '20,0', 'PI': '.', 'PW': '0/0', 'PG': '0/0', 'PM': '.', 'GQ': '54', 'DP': '20', 'PB': '.', 'PC': '.', 'PL': '0,54,810'}), ('MA605',
{'GT': '0/0', 'AD': '6,0', 'PI': '.', 'PW': '0/0', 'PG': '0/0', 'PM': '.', 'GQ': '18', 'DP': '6', 'PB': '.', 'PC': '.', 'PL': '0,18,206'}), ('MA622', {'GT': '0/0', 'AD': '27,0', 'PI': '.', 'PW': '0/0', 'PG': '0/0', 'PM': '.', 'GQ': '72', 'DP': '27', 'PB': '.', 'PC': '.', 'PL': '0,72,1080'})])
Methods on record object
Very specific parsing of the record object can be done using the provided methods.
These methods take several args and kwargs to narrow down the information available in the Record
object.
>>> "Parse the INFO string data using get_info_as_dict()"
>>> first_record.info_str # the original info values as string
'AC=2,0;AF=1.00;AN=8;BaseQRankSum=-7.710e-01;ClippingRankSum=0.00;DP=902;ExcessHet=0.0050;FS=0.000;InbreedingCoeff=0.8004;MLEAC=12,1;MLEAF=0.462,0.038;MQ=60.29;MQRankSum=0.00;QD=33.99;ReadPosRankSum=0.260;SF=0,1,2,3,4,5,6;SOR=0.657;set=HignConfSNPs'
>>> first_record.get_info_as_dict() # info values as dictionary
{'AC': '2,0', 'AF': '1.00', 'AN': '8', 'BaseQRankSum': '-7.710e-01', 'ClippingRankSum': '0.00', 'DP': '902', 'ExcessHet': '0.0050', 'FS': '0.000', 'InbreedingCoeff': '0.8004', 'MLEAC': '12,1', 'MLEAF': '0.462,0.038', 'MQ': '60.29', 'MQRankSum': '0.00', 'QD': '33.99', 'ReadPosRankSum': '0.260', 'SF': '0,1,2,3,4,5,6', 'SOR': '0.657', 'set': 'HignConfSNPs'}
>>> # info_keys can be provided extract specific keys:value
>>> first_record.get_info_as_dict(info_keys= ['AC', 'AF'])
{'AC': '2,0', 'AF': '1.00'}
>>> "More controlled FORMAT tag to SAMPLE value mapping can be done using get_format_to_sample_map()"
>>> # it helps to extract specific FORMAT tag values from specific SAMPLE
>>> first_record.get_format_to_sample_map(sample_names= ['ms01e', 'MA611'], formats= ['GT', 'PC'])
{'ms01e': {'GT': './.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PC': '.'}}
>>> "the mapped genotype values can be converted to IUPAC bases using the convert_to_iupac flag"
>>> first_record.get_format_to_sample_map(sample_names= ['ms01e', 'MA611'], formats= ['GT', 'PC'], convert_to_iupac=['GT'])
{'ms01e': {'GT': './.', 'PC': '.', 'GT_iupac': './.'}, 'MA611': {'GT': '0/0', 'PC': '.', 'GT_iupac': 'G/G'}}
>>> first_record.get_format_to_sample_map(sample_names= ['ms01e', 'MA611'], formats= ['GT', 'PC'], convert_to_iupac=['GT', 'PG'])
{'ms01e': {'GT': './.', 'PC': '.', 'GT_iupac': './.', 'PG_iupac': './.'}, 'MA611': {'GT': '0/0', 'PC': '.', 'GT_iupac': 'G/G', 'PG_iupac': 'G/G'}}
>>> # get a full mapping for all the record_keys and FORMAT within SAMPLE
>>> # Note: This mapping is only activated when called with lazy instantiation
>>> first_record.get_full_record_map()
{'CHROM': '2', 'POS': '15881018', 'ID': '.', 'REF': 'G', 'ALT': 'A,C', 'QUAL': '5082.45', 'FILTER': 'PASS', 'INFO': {'AC': '2,0', 'AF': '1.00', 'AN': '8', 'BaseQRankSum': '-7.710e-01', 'ClippingRankSum': '0.00', 'DP': '902', 'ExcessHet': '0.0050', 'FS': '0.000', 'InbreedingCoeff': '0.8004', 'MLEAC': '12,1', 'MLEAF': '0.462,0.038', 'MQ': '60.29', 'MQRankSum': '0.00', 'QD': '33.99', 'ReadPosRankSum': '0.260', 'SF': '0,1,2,3,4,5,6', 'SOR': '0.657', 'set': 'HignConfSNPs'}, 'FORMAT': 'GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC', 'ms01e': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms02g': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms03g': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms04h': '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', 'MA611': '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', 'MA605': '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', 'MA622': '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.', 'samples': {'ms01e': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms02g': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms03g': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.'}, 'ms04h': {'GT': '1/1', 'PI': '.', 'GQ': '6', 'PG': '1/1', 'PM': '.', 'PW': '1/1', 'AD': '0,2', 'PL': '49,6,0,.,.,.', 'DP': '2', 'PB': '.', 'PC': '.'}, 'MA611': {'GT': '0/0', 'PI': '.', 'GQ': '78', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '29,0,0', 'PL': '0,78,1170,78,1170,1170', 'DP': '29', 'PB': '.', 'PC': '.'}, 'MA605': {'GT': '0/0', 'PI': '.', 'GQ': '9', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '3,0,0', 'PL': '0,9,112,9,112,112', 'DP': '3', 'PB': '.', 'PC': '.'}, 'MA622': {'GT': '0/0', 'PI': '.', 'GQ': '99', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '40,0,0', 'PL': '0,105,1575,105,1575,1575', 'DP': '40', 'PB': '.', 'PC': '.'}}}
>>> # full mapping has the option to convert genotype bases to IUPAC
>>> first_record.get_full_record_map(convert_to_iupac= ['GT'])
{'CHROM': '2', 'POS': '15881018', 'ID': '.', 'REF': 'G', 'ALT': 'A,C', 'QUAL': '5082.45', 'FILTER': 'PASS', 'INFO': {'AC': '2,0', 'AF': '1.00', 'AN': '8', 'BaseQRankSum': '-7.710e-01', 'ClippingRankSum': '0.00', 'DP': '902', 'ExcessHet': '0.0050', 'FS': '0.000', 'InbreedingCoeff': '0.8004', 'MLEAC': '12,1', 'MLEAF': '0.462,0.038', 'MQ': '60.29', 'MQRankSum': '0.00', 'QD': '33.99', 'ReadPosRankSum': '0.260', 'SF': '0,1,2,3,4,5,6', 'SOR': '0.657', 'set': 'HignConfSNPs'}, 'FORMAT': 'GT:PI:GQ:PG:PM:PW:AD:PL:DP:PB:PC', 'ms01e': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms02g': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms03g': './.:.:.:./.:.:./.:0,0:0,0,0,.,.,.:0:.:.', 'ms04h': '1/1:.:6:1/1:.:1/1:0,2:49,6,0,.,.,.:2:.:.', 'MA611': '0/0:.:78:0/0:.:0/0:29,0,0:0,78,1170,78,1170,1170:29:.:.', 'MA605': '0/0:.:9:0/0:.:0/0:3,0,0:0,9,112,9,112,112:3:.:.', 'MA622': '0/0:.:99:0/0:.:0/0:40,0,0:0,105,1575,105,1575,1575:40:.:.', 'samples': {'ms01e': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.', 'GT_iupac': './.'}, 'ms02g': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.', 'GT_iupac': './.'}, 'ms03g': {'GT': './.', 'PI': '.', 'GQ': '.', 'PG': './.', 'PM': '.', 'PW': './.', 'AD': '0,0', 'PL': '0,0,0,.,.,.', 'DP': '0', 'PB': '.', 'PC': '.', 'GT_iupac': './.'}, 'ms04h': {'GT': '1/1', 'PI': '.', 'GQ': '6', 'PG': '1/1', 'PM': '.', 'PW': '1/1', 'AD': '0,2', 'PL': '49,6,0,.,.,.', 'DP': '2', 'PB': '.', 'PC': '.', 'GT_iupac': 'A/A'}, 'MA611': {'GT': '0/0', 'PI': '.', 'GQ': '78', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '29,0,0', 'PL': '0,78,1170,78,1170,1170', 'DP': '29', 'PB': '.', 'PC': '.', 'GT_iupac': 'G/G'}, 'MA605': {'GT': '0/0', 'PI': '.', 'GQ': '9', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '3,0,0', 'PL': '0,9,112,9,112,112', 'DP': '3', 'PB': '.', 'PC': '.', 'GT_iupac': 'G/G'}, 'MA622': {'GT': '0/0', 'PI': '.', 'GQ': '99', 'PG': '0/0', 'PM': '.', 'PW': '0/0', 'AD': '40,0,0', 'PL': '0,105,1575,105,1575,1575', 'DP': '40', 'PB': '.', 'PC': '.', 'GT_iupac': 'G/G'}}}
>>> # Note: "convert_to_iupac" will add the genotype tag with suffix "_iupac" to show the genotype in IUPAC bases.
Genotype parsing
Genotype checks and parsing are one of most important use case of VCF data.
VcfParser
provides several methods to do those checks and extract data.
- Check samples that have alleles of your interest.
>>> first_record.hasAllele(allele='1', tag= 'GT', bases = 'iupac')
{'ms04h': 'A/A'}
>>> first_record.hasAllele(allele='1', tag= 'GT', bases = 'numeric')
{'ms04h': '1/1'}
>>> first_record.hasAllele(allele='1', tag= 'PG', bases = 'numeric')
{'ms04h': '1/1'}
>>> first_record.hasAllele(allele='0', tag= 'PG', bases = 'numeric')
{'MA611': '0/0', 'MA605': '0/0', 'MA622': '0/0'}
>>> first_record.hasAllele(allele='0', tag= 'PG', bases = 'iupac')
{'MA611': 'G/G', 'MA605': 'G/G', 'MA622': 'G/G'}
- Check samples with specific genotype. Both numeric and iupac checks are available.
>>> first_record.hasVAR(genotype='0/0', tag= 'PG', bases = 'numeric')
{'MA611': '0/0', 'MA605': '0/0', 'MA622': '0/0'}
>>> first_record.hasVAR(genotype='G/G', tag= 'PG', bases = 'iupac')
{'MA611': 'G/G', 'MA605': 'G/G', 'MA622': 'G/G'}
>>> first_record.hasVAR(genotype='1/1', tag= 'PG', bases = 'numeric')
{'ms04h': '1/1'}
>>> first_record.hasVAR(genotype='A/A', tag= 'PG', bases = 'iupac')
{'ms04h': 'A/A'}
>>> # genotypes can be checked in phased state
>>> first_record.hasVAR(genotype='0|0', tag='GT', bases='numeric')
{}
- Check phased vs unphased genotype. Specific genotype tag can be checked; default is ‘GT’.
>>> first_record.has_phased()
{}
>>> first_record.has_unphased()
{'ms01e': './.', 'ms02g': './.', 'ms03g': './.', 'ms04h': '1/1', 'MA611': '0/0', 'MA605': '0/0', 'MA622': '0/0'}
>>> first_record.has_unphased(tag= 'PG')
{'ms01e': './.', 'ms02g': './.', 'ms03g': './.', 'ms04h': '1/1', 'MA611': '0/0', 'MA605': '0/0', 'MA622': '0/0'}
>>> first_record.has_unphased(tag='PG', bases='numeric')
{'MA611': '0/0', 'MA605': '0/0', 'MA622': '0/0'}
>>> first_record.has_unphased(tag= 'PG', bases = 'iupac')
{'ms01e': './.', 'ms02g': './.', 'ms03g': './.', 'ms04h': 'A/A', 'MA611': 'G/G', 'MA605': 'G/G', 'MA622': 'G/G'}
- Return samples with no variants (i.e. contains ‘./.’, ‘.|.’, ‘.’)
>>> first_record.hasnoVAR()
{'ms01e': './.', 'ms02g': './.', 'ms03g': './.'}
>>> first_record.hasnoVAR(tag='GT')
{'ms01e': '.', 'ms02g': '.', 'ms03g': '.', 'ms04h': '.'}
>>> first_record.hasnoVAR(tag= 'PG')
{'ms01e': './.', 'ms02g': './.', 'ms03g': './.'}
- Samples with homozygous reference genotypes can be retrieved as.
>>> first_record.isHOMREF(tag='GT', bases='numeric')
{'MA611': '0/0', 'MA605': '0/0', 'MA622': '0/0'}
>>> first_record.isHOMREF(tag='GT', bases='iupac')
{'MA611': 'C/C', 'MA605': 'C/C', 'MA622': 'C/C'}
>>> #if another FORMAT tag also represents a genotype, specific the FORMAT tag
>>> first_record.isHOMREF(tag='PG', bases='numeric')
{'MA611': '0/0', 'MA605': '0/0', 'MA622': '0/0'}
>>> first_record.isHOMREF(tag='PG', bases='iupac')
{'MA611': 'C/C', 'MA605': 'C/C', 'MA622': 'C/C'}
- Similarly, samples with homozygous variant genotypes can also be retrieved.
>>> first_record.isHOMVAR()
{'ms04h': '1/1'}
>>> first_record.isHOMVAR(tag= 'PG', bases= 'iupac')
{'ms04h': 'A/A'}
- Samples with heterozygous variant genotypes in given record”
>>> first_record.isHETVAR()
{}
- This returns samples with missing variants for certain FORMAT tags(i.e. contains ‘./.’, ‘.|.’, ‘.’). Currently we used ‘GT’ tag as default.
>>> first_record.isMissing()
{'ms01e': './.', 'ms02g': './.', 'ms03g': './.'}
>>> # missing checks can be applied to other FORMAT tags too.
>>> first_record.isMissing(tag = 'PI')
{'ms01e': '.', 'ms02g': '.', 'ms03g': '.', 'ms04h': '.', 'MA611': '.', 'MA605': '.', 'MA622': '.'}
>>> first_record.isMissing(tag='GQ')
{'ms01e': '.', 'ms02g': '.', 'ms03g': '.', 'ms04h': '.'}
Credits¶
Development Lead¶
- Bishwa K. Giri <kirannbishwa01@gmail.com, bkgiri@uncg.edu>
Contributors¶
TODO: Done Add email address?
- Bhuwan Aryal <Bhuwanaryal19@gmail.com>
- Gopal Kisi <Gkisi2772@gmail.com>