API¶
Utility Functions for Manipulating DNA sequences.¶
This module contains utility functions for manipulating DNA sequences.
- samwell.dnautils.has_long_homopolymer(bases: str, max_hp_length: int) bool ¶
Returns true if the given bases has a homopolymer length longer than the given length.
- Parameters
bases – the bases to examine.
max_hp_length – the maximum homopolymer length to allow.
- samwell.dnautils.mask_long_homopolymers(bases: str, min_long_hp_length: int, mask_base: str = 'N') str ¶
Returns the bases masked for regions with long homopolymers
- Parameters
bases – the bases to mask.
min_long_hp_length – the minimum homopolymer length (inclusive) to mask.
mask_base – the base to use when masking
Functions for Creating Useful Iterators¶
This module contains classes and functions for creating useful iterators.
Examples of a “Peekable” Iterator¶
“Peekable” iterators are useful to “peek” at the next item in an iterator without consuming it.
For example, this is useful when consuming items in iterator while a predicate is true, and not
consuming the first element where the element is not true. See the
takewhile()
and
dropwhile()
methods.
An empty peekable iterator throws StopIteration:
>>> from samwell.itertools import peekable
>>> piter = peekable(iter([]))
>>> piter.peek()
StopIteration
A peekable iterator will return the next item before consuming it.
>>> piter = peekable(iter([1, 2, 3]))
>>> piter.peek()
1
>>> next(piter)
1
>>> [j for j in piter]
[2, 3]
The can_peek() function can be used to determine if the iterator can be peeked without StopIteration being thrown:
>>> piter = peekable([1])
>>> piter.peek() if piter.can_peek() else -1
1
>>> next(piter)
1
>>> piter.peek() if piter.can_peek() else -1
-1
>>> next(piter)
StopIteration
The peekable() function should be preferred to calling PeekableIterator’s constructor directly as it supports creation from iterable objects as well as iterators, while the constructor requires an iterator.
Examples of a “Merging” Iterator¶
A “merging” iterator can merge two iterators in order based on a given ordering function. This is useful for merging two iterators that are already in order.
>>> from samwell.itertools import MergingIterator
>>> even = iter([2, 4, 6, 8])
>>> odd = iter([1, 3, 5, 9])
>>> merging = MergingIterator(even, odd, lambda x: x)
>>> list(merging)
[1, 2, 3, 4, 5, 6, 7, 8, 9]
Module Contents¶
The module contains the following public classes:
PeekableIterator
– Iterator that allows you to peek at thenext value before calling next
samwell.itertools.MergingIterator
– Iterator that allows merging of twoiterator using a keyfunc to decide from which iterator to draw the next item
The module contains the following methods:
peekable()
– Creates an iterator that allows you to peek atthe next value before calling next
- class samwell.itertools.MergingIterator(iter1: Iterator[IterType], iter2: Iterator[IterType], keyfunc: Callable[IterType, Any])¶
An iterator that merges two iterators; if they are sorted and keyfunc is passed, yields results in order.
- Parameters
iter1 – an iterator
iter2 – an iterator
keyfunc – a function that extracts a key from an item that is used to order items
- class samwell.itertools.PeekableIterator(source: Iterator[IterType])¶
A peekable iterator wrapping an iterable.
This allows returning the next item without consuming it.
- Parameters
source – an iterator over the objects
- dropwhile(pred: Callable[IterType, bool]) samwell.itertools.PeekableIterator[IterType] ¶
Drops elements from the iterator while the predicate is true.
Updates the iterator to point at the first non-matching element, or exhausts the iterator if all elements match the predicate.
- Parameters
pred (Callable[[V], bool]) – a function that takes a value from the iterator
false. (and returns true or) –
- Returns
a reference to this iterator, so calls can be chained
- Return type
- maybe_peek() Optional[IterType] ¶
Returns the next element without consuming it, or None otherwise.
- peek() IterType ¶
Returns the next element without consuming it, or StopIteration otherwise.
- takewhile(pred: Callable[IterType, bool]) List[IterType] ¶
Consumes from the iterator while pred is true, and returns the result as a List.
The iterator is left pointing at the first non-matching item, or if all items match then the iterator will be exhausted.
- Parameters
pred – a function that takes the next value from the iterator and returns true or false.
- Returns
A list of the values from the iterator, in order, up until and excluding the first value that does not match the predicate.
- Return type
List[V]
- samwell.itertools.peekable(source: Union[Iterator[IterType], Iterable[IterType]]) samwell.itertools.PeekableIterator[IterType] ¶
Creates a peekable iterator that allows you to peek at the next value before calling next
The peek method will return the next element without consuming it, otherwise StopIteration.
- Parameters
source – either an iterator over the objects, or a callable that is called until it returns the sentinel.
- Returns
Utility Classes for Querying Overlaps with Genomic Regions¶
DEPRECATED - if you have the option use ~pybedlite.overlap_detector in favor of this.
Examples of Detecting Overlaps¶
>>> from samwell.overlap_detector import Interval, OverlapDetector
>>> detector = OverlapDetector()
>>> query = Interval("chr1", 2, 20)
>>> detector.overlaps_any(query)
False
>>> detector.add(Interval("chr2", 1, 100))
>>> detector.add(Interval("chr1", 21, 100))
>>> detector.overlaps_any(query)
False
>>> detector.add(Interval("chr1", 1, 1))
>>> detector.overlaps_any(query)
True
>>> detector.get_overlaps(query)
[Interval("chr1", 1, 1)]
>>> detector.add(Interval("chr1", 3, 10))
>>> detector.overlaps_any(query)
True
>>> detector.get_overlaps(query)
[Interval("chr1", 1, 1), interval("chr1", 3, 10)]
Module Contents¶
The module contains the following public classes:
Interval
– Represents a region mapping to the genomethat is 0-based and open-ended
OverlapDetector
– Detects and returns overlaps betweena set of genomic regions and another genomic region
- class samwell.overlap_detector.Interval(refname: str, start: int, end: int, negative: bool = False, name: Optional[str] = None)¶
A region mapping to the genome that is 0-based and open-ended
- overlap(other: pybedlite.overlap_detector.Interval) int ¶
Returns the overlap between this interval and the other, or zero if there is none.
- Parameters
other (Interval) – the other interval to find the overlap with
- class samwell.overlap_detector.OverlapDetector¶
Detects and returns overlaps between a set of genomic regions and another genomic region.
If two intervals have the same coordinates, only the first that was added will be kept.
Since
Interval
objects are used both to populate the overlap detector and to query it, the coordinate system in use is also 0-based open-ended.- add(interval: pybedlite.overlap_detector.Interval) None ¶
Adds a interval to this detector.
- Parameters
interval – the interval to add to this detector
- add_all(intervals: Iterable[pybedlite.overlap_detector.Interval]) None ¶
Adds one or more intervals to this detector.
- Parameters
intervals – the intervals to add to this detector
- classmethod from_bed(path: pathlib.Path) pybedlite.overlap_detector.OverlapDetector ¶
Builds an
OverlapDetector
from a BED file.- Parameters
path – the path to the BED file
- Returns
An overlap detector for the regions in the BED file.
- get_enclosed(interval: pybedlite.overlap_detector.Interval) List[pybedlite.overlap_detector.Interval] ¶
Returns the set of intervals in this detector that are enclosed by the query interval. I.e. target.start >= query.start and target.end <= query.end.
- Args:
interval: the query interval
- Returns:
The list of intervals in this detector that are enclosed within the query interval. The intervals will be return in ascending genomic order.
- get_enclosing_intervals(interval: pybedlite.overlap_detector.Interval) List[pybedlite.overlap_detector.Interval] ¶
Returns the set of intervals in this detector that wholly enclose the query interval. i.e. query.start >= target.start and query.end <= target.end.
- Args:
interval: the query interval
- Returns:
The list of intervals in this detector that enclose the query interval. The intervals will be returned in ascending genomic order.
- get_overlaps(interval: pybedlite.overlap_detector.Interval) List[pybedlite.overlap_detector.Interval] ¶
Returns any intervals in this detector that overlap the given interval.
- Parameters
interval – the interval to check
- Returns
The list of intervals in this detector that overlap the given interval, or the empty list if no overlaps exist. The intervals will be return in ascending genomic order.
- overlaps_any(interval: pybedlite.overlap_detector.Interval) bool ¶
Determines whether the given interval overlaps any interval in this detector.
- Parameters
interval – the interval to check
- Returns
True if and only if the given interval overlaps with any interval in this detector.
Utility Classes and Methods for SAM/BAM¶
This module contains utility classes for reading and writing SAM/BAM files, as well as for
manipulating Cigars. It is recommended to use the reader()
and
writer()
methods rather than pysam.AlignmentFile
directly (see
below for motivation).
Motivation for Reader and Writer methods¶
The following are the reasons for choosing to implement methods to open a SAM/BAM file for
reading and writing, rather than relying on pysam.AlignmentFile
directly:
Provides a centralized place for the implementation of opening a SAM/BAM for reading and writing. This is useful if any additional parameters are added, or changes to standards or defaults are made.
Makes the requirement to provide a header when opening a file for writing more explicit.
Adds support for
Path
.Remove the reliance on specifying the mode correctly, including specifying the file type (i.e. SAM, BAM, or CRAM), as well as additional options (ex. compression level). This makes the code more explicit and easier to read.
An explicit check is performed to ensure the file type is specified when writing using a file-like object rather than a path to a file.
Examples of Opening a SAM/BAM for Reading or Writing¶
Opening a SAM/BAM file for reading, auto-recognizing the file-type by the file extension. See
SamFileType
for the supported file types.
>>> from samwell.sam import reader
>>> with reader("/path/to/sample.sam") as fh:
... for record in fh:
... print(record.name) # do something
>>> with reader("/path/to/sample.bam") as fh:
... for record in fh:
... print(record.name) # do something
Opening a SAM/BAM file for reading, explicitly passing the file type.
>>> from samwell.sam import SamFileType
>>> with reader(path="/path/to/sample.ext1", file_type=SamFileType.SAM) as fh:
... for record in fh:
... print(record.name) # do something
>>> with reader(path="/path/to/sample.ext2", file_type=SamFileType.BAM) as fh:
... for record in fh:
... print(record.name) # do something
Opening a SAM/BAM file for reading, using an existing file-like object
>>> with open("/path/to/sample.sam", "rb") as file_object:
... with reader(path=file_object, file_type=SamFileType.BAM) as fh:
... for record in fh:
... print(record.name) # do something
Opening a SAM/BAM file for writing follows similar to the reader()
method,
but the SAM file header object is required.
>>> from samwell.sam import writer
>>> header: Dict[str, Any] = {
... "HD": {"VN": "1.5", "SO": "coordinate"},
... "RG": [{"ID": "1", "SM": "1_AAAAAA", "LB": "lib", "PL": "ILLUMINA", "PU": "xxx.1"}],
... "SQ": [
... {"SN": "chr1", "LN": 249250621},
... {"SN": "chr2", "LN": 243199373}
... ]
... }
>>> with writer(path="/path/to/sample.bam", header=header) as fh:
... pass # do something
Examples of Manipulating Cigars¶
Creating a Cigar
from a pysam.AlignedSegment
.
>>> from samwell.sam import Cigar
>>> with reader("/path/to/sample.sam") as fh:
... record = next(fh)
... cigar = Cigar.from_cigartuples(record.cigartuples)
... print(str(cigar))
50M2D5M10S
>>> cigar = Cigar.from_cigarstring("50M2D5M10S")
>>> print(str(cigar))
50M2D5M10S
If the cigar string is invalid, the exception message will show you the problem character(s) in square brackets.
>>> cigar = Cigar.from_cigarstring("10M5U")
... CigarException("Malformed cigar: 10M5[U]")
The cigar contains a tuple of CigarOp
) and associated operator length. A number of
useful methods are part of both classes.
The number of bases aligned on the query (i.e. the number of bases consumed by the cigar from the query):
>>> cigar = Cigar.from_cigarstring("50M2D5M2I10S")
>>> [e.length_on_query for e in cigar.elements]
[50, 0, 5, 2, 10]
>>> [e.length_on_target for e in cigar.elements]
[50, 2, 5, 0, 0]
>>> [e.operator.is_indel for e in cigar.elements]
[False, True, False, True, False]
Module Contents¶
The module contains the following public classes:
SamFileType
– Enumeration of valid SAM/BAM/CRAM file types.
SamOrder
– Enumeration of possible SAM/BAM/CRAM sort orders.
CigarOp
– Enumeration of operators that can appear in a Cigar string.
CigarElement
– Class representing an element in a Cigar string.
CigarParsingException
– The exception raised specific to parsing acigar
Cigar
– Class representing a cigar string.
The module contains the following methods:
reader()
– opens a SAM/BAM/CRAM file for reading.
writer()
– opens a SAM/BAM/CRAM file for writing
set_qc_fail()
– sets the QC fail flag in a
pysam.AlignedSegment
record and sets additional SAM tags giving the tool name and reason for why the QC fail flag was set.
get_qc_fail()
– gets the tool name and reason for why the QC fail flagwas set, or None if it is not set.
- class samwell.sam.Cigar(elements: Tuple[samwell.sam.CigarElement, ...] = ())¶
Class representing a cigar string.
- - elements
zero or more cigar elements
- Type
Tuple[CigarElement, …]
- coalesce() samwell.sam.Cigar ¶
Returns a copy of the cigar adjacent operators of the same type coalesced into single operators.
- classmethod from_cigarstring(cigarstring: str) samwell.sam.Cigar ¶
Constructs a Cigar from a string returned by pysam.
If “*” is given, returns an empty Cigar.
- classmethod from_cigartuples(cigartuples: Optional[List[Tuple[int, int]]]) samwell.sam.Cigar ¶
Returns a Cigar from a list of tuples returned by pysam.
Each tuple denotes the operation and length. See
CigarOp
for more information on the various operators. If None is given, returns an empty Cigar.
- reversed() samwell.sam.Cigar ¶
Returns a copy of the Cigar with the elements in reverse order.
- class samwell.sam.CigarElement(length: int, operator: samwell.sam.CigarOp)¶
Represents an element in a Cigar
- - length
the length of the element
- Type
- - operator
the operator of the element
- Type
- class samwell.sam.CigarOp(value)¶
Enumeration of operators that can appear in a Cigar string.
- D = (2, 'D', False, True)¶
Deletion versus the reference
- EQ = (7, '=', True, True)¶
Matches the reference
- H = (5, 'H', False, False)¶
Hard clip
- I = (1, 'I', True, False)¶
E741
- Type
Insertion versus the reference # noqa
- M = (0, 'M', True, True)¶
Match or Mismatch the reference
- N = (3, 'N', False, True)¶
Skipped region from the reference
- P = (6, 'P', False, False)¶
Padding
- S = (4, 'S', True, False)¶
Soft clip
- X = (8, 'X', True, True)¶
Mismatches the reference
- static from_character(character: str) samwell.sam.CigarOp ¶
Returns the operator from the single character.
- static from_code(code: int) samwell.sam.CigarOp ¶
Returns the operator from the given operator code.
Note: this is mainly used to get the operator from
pysam
.
- exception samwell.sam.CigarParsingException¶
The exception raised specific to parsing a cigar.
- samwell.sam.NO_REF_NAME: str = '*'¶
The reference position to use to indicate no position in SAM/BAM.
- class samwell.sam.SamFileType(value)¶
Enumeration of valid SAM/BAM/CRAM file types.
- classmethod from_path(path: Union[pathlib.Path, str]) samwell.sam.SamFileType ¶
Infers the file type based on the file extension.
- Parameters
path – the path to the SAM/BAM/CRAM to read or write.
- class samwell.sam.SamOrder(value)¶
Enumerations of possible sort orders for a SAM file.
- Coordinate = 'coordinate'¶
coordinate sorted
- QueryName = 'queryname'¶
queryname sorted
- Unsorted = 'unsorted'¶
the SAM / BAM / CRAM is unsorted
- samwell.sam.SamPath = typing.Union[typing.IO[typing.Any], pathlib.Path, str]¶
The reference index to use to indicate no reference in SAM/BAM.
- samwell.sam.get_qc_fail(rec: pysam.libcalignedsegment.AlignedSegment) Optional[Tuple[str, str]] ¶
Gets the tool and reason for why the QC fail flag is set, otherwise None if not set.
If the QC fail flag is set, but the tool and filter reason SAM tags are not set, None will be returned. Use pysam.AlignedSegment.is_qcfail() to check if the record is simply QC failed.
- Parameters
rec – the record to fail
- samwell.sam.get_qc_fail_by_tool(rec: pysam.libcalignedsegment.AlignedSegment, tool: Optional[Callable[..., Any]] = None) Optional[Tuple[str, str]] ¶
Gets the tool and reason for why the QC fail flag if the flag was set by the passed tool.
- None will be returned in the following cases:
The QC fail flag is not set
The QC fail flag isset, but the tool and filter reason SAM tags are not set
The tool and filter reason SAM tags were set by a different tool
Use pysam.AlignedSegment.is_qcfail() to check if the record is simply QC failed.
- Parameters
rec – the record to fail
tool – the tool that must have set the QC fail flag
- samwell.sam.isize(r1: pysam.libcalignedsegment.AlignedSegment, r2: pysam.libcalignedsegment.AlignedSegment) int ¶
Computes the insert size for a pair of records.
- samwell.sam.reader(path: Union[IO[Any], pathlib.Path, str], file_type: Optional[samwell.sam.SamFileType] = None) pysam.libcalignmentfile.AlignmentFile ¶
Opens a SAM/BAM/CRAM for reading.
- Parameters
path – a file handle or path to the SAM/BAM/CRAM to read or write.
file_type – the file type to assume when opening the file. If None, then the file type will be auto-detected.
- samwell.sam.set_pair_info(r1: pysam.libcalignedsegment.AlignedSegment, r2: pysam.libcalignedsegment.AlignedSegment, proper_pair: bool = True) None ¶
Resets mate pair information between reads in a pair. Requires that both r1 and r2 are mapped. Can be handed reads that already have pairing flags setup or independent R1 and R2 records that are currently flagged as SE reads.
- Parameters
r1 – read 1
r2 – read 2 with the same queryname as r1
- samwell.sam.set_qc_fail(rec: pysam.libcalignedsegment.AlignedSegment, tool: Callable[..., Any], reason: str) None ¶
Sets the QC fail flag, and adds tags containing the tool name and reason for failing. :param rec: the record to fail :param tool: the tool (as a callable) that failed this record :param reason: the reason for failing
- samwell.sam.writer(path: Union[IO[Any], pathlib.Path, str], header: Union[str, Dict[str, Any], pysam.libcalignmentfile.AlignmentHeader], file_type: Optional[samwell.sam.SamFileType] = None) pysam.libcalignmentfile.AlignmentFile ¶
Opens a SAM/BAM/CRAM for writing.
- Parameters
path – a file handle or path to the SAM/BAM/CRAM to read or write.
header – Either a string to use for the header or a multi-level dictionary. The multi-level dictionary should be given as follows. The first level are the four types (‘HD’, ‘SQ’, …). The second level are a list of lines, with each line being a list of tag-value pairs. The header is constructed first from all the defined fields, followed by user tags in alphabetical order.
file_type – the file type to assume when opening the file. If None, then the filetype will be auto-detected and must be a path-like object.
Utility methods for running BWA¶
- This module contains methods for running BWA. Currently only the “mem” algorithm is supported:
align()
– Aligns the given reads with BWA mem.
- The options for running BWA can be customized via the three Options classes:
AlgorithmOptions
– Bwa mem algorithm options.ScoringOptions
– Bwa mem scoring optionsInputOutputOptions
– Bwa mem input and output options
An input read for alignment must be minimally transformed into a FASTQ-like record:
FastqRecord
– Fastq record used as input to alignment.
Implementation¶
Alignment of reads is performed asynchronously.
- This is achieved by creating three sub-processes in
align()
: 1. A process to consume the input iterable of reads and write them (a) to the stdin of BWA mem, and (b) to a queue of reads that are awaiting alignment results from BWA mem. 2. A process to run BWA mem, where FASTQ records are written to the process’ stdin, alignment results are returned to stdout, and any error/logging information from BWA mem is returned to stderr. 3. A process to route the stderr of BWA mem to the given stderr handle (stderr_out).
Then align()
method consumes the stdout of the BWA mem process and collates
that with the queue of reads that have been written/given to BWA mem (from process 1). For each
input read, one or more alignments is expected to be returned by BWA mem. The order in which
alignments of reads are returned by BWA mem is the same order as the order of reads given to BWA
mem. The align()
method then returns an iterable over the alignment
results.
Exceptions may occur in the thread to input FASTQ records to BWA mem, which are propagated to the caller. Furthermore, an exception is returned if the # of reads given to BWA mem is not the same as the # of reads returned by BWA mem.
Some specific handling occurs around reading the BWA mem output with pysam
, since the
latter blocks waiting for at least some reads from BWA mem, which may not happen if there was an
issue in the various upstream processes (input to BWA mem or BWA mem itself). This would have
caused a deadlock.
Examples
Typically, we have AlignedSegment
records obtained from reading from a SAM or BAM
file. The first must be converted into FastqRecord
objects.
>>> from samwell.sam.bwa_mem import FastqRecord
>>> from samwell.sam import reader
>>> reads = reader("/path/to/sample.sam")
>>> fastq_reads = map(lambda read: FastqRecord.build(read), reads)
Next, those align_mem
method.
>>> from samwell.sam.bwa_mem import align
>>> results = map(lambda read: align(read), fastq_reads)
This returns an iterable over the alignment results. An alignment result is a tuple
consisting of the original FastqRecord
and an iterator over the
alignments (see AlignedSegment
).
>>> result = next(result)
>>> fastq_read, alignments = result
>>> str(fastq_read)
@name
GATTACA
+
HIJKLKM
>>> len(alignments)
2
>>> alignment = str(next(alignments))
>>> alignment.query_name
name
>>> type(alignment)
<class 'pysam.libcalignedsegment.AlignedSegment'>
- class samwell.sam.bwa_mem.AlgorithmOptions(threads: Optional[int] = None, min_seed_len: Optional[int] = None, band_width: Optional[int] = None, off_diagonal_dropoff: Optional[int] = None, internal_seeds_length_factor: Optional[float] = None, max_third_seed_occurrence: Optional[int] = None, max_seed_occurrence: Optional[int] = None, drop_ratio: Optional[float] = None, min_chain_weight: Optional[int] = None, max_mate_rescue_rounds: Optional[int] = None, skip_mate_rescue: Optional[bool] = None, skip_pairing: Optional[bool] = None)¶
Bwa mem algorithm options
- internal_seeds_length_factor¶
look for internal seeds inside a seed longer than min_seed_len * internal_seeds_length_factor
- Type
Optional[float]
- class samwell.sam.bwa_mem.FastqRecord(name: str, bases: str, quals: str, source: Optional[FastqRecordSourceType] = None, needs_alignment: bool = True, read_number: Optional[int] = None)¶
Fastq record used as input to alignment.
- source¶
optionally the
AlignedSegment
from which this was built- Type
Optional[FastqRecordSourceType]
- read_number¶
optionally the read number; should be set to 1 or 2 for paired end reads.
- Type
Optional[int]
- classmethod build(read: pysam.libcalignedsegment.AlignedSegment, needs_alignment: bool = True, aligned_bases_only: bool = False, clip_three_prime: int = 0) samwell.sam.bwa_mem.FastqRecord ¶
Builds a
FastqRecord
from aAlignedSegment
- Parameters
read – the read to convert
needs_alignment – True if the read should be aligned, False otherwise
aligned_bases_only – only align the aligned bases (excludes soft-clipped bases)
clip_three_prime – the number of bases to clip on the three-prime end of the read relative to the original direction of sequencing. This will be applied after extracting the bases based on
aligned_bases_only
.
- class samwell.sam.bwa_mem.InputOutputOptions(interleaved_pairs: Optional[bool] = None, read_group: Optional[str] = None, header_insert: Optional[Union[str, pathlib.Path]] = None, alts_as_primary: Optional[bool] = None, verbosity: Optional[int] = None, min_alignment_score: Optional[int] = None, max_hits_within_max_score: Optional[Union[int, Tuple[int, int]]] = None, all_alignments: Optional[bool] = None, append_fastq_comment: Optional[bool] = None, add_fasta_header_to_xr: Optional[bool] = None, softclip_supplementary: Optional[bool] = None, split_hits_are_secondary: Optional[bool] = None, insert_size_params: Optional[Union[float, Tuple[float, float], Tuple[float, float, int], Tuple[float, float, int, int]]] = None, bases_per_batch: Optional[int] = 115000)¶
Bwa mem input and output options
- interleaved_pairs¶
read pairs are consecutive (r1 then r2), otherwise fragment reads
- Type
Optional[bool]
- header_insert¶
insert STR to header if it starts with @; or insert lines in FILE
- Type
Optional[Union[str, pathlib.Path]]
- alts_as_primary¶
treat ALT contigs as part of the primary assembly (i.e. ignore <idxbase>.alt file)
- Type
Optional[bool]
- max_hits_within_max_score¶
if there are <INT hits with score >80% of the max score, output all in XA
- insert_size_params¶
specify the mean, standard deviation (10% of the mean if absent), max (4 sigma from the mean if absent) and min of the insert size distribution. FR orientation only.
- class samwell.sam.bwa_mem.ReadType(value)¶
The read type for BWA mem.
- class samwell.sam.bwa_mem.ScoringOptions(match_score: Optional[int] = None, mismatch_score: Optional[int] = None, gap_open: Optional[Union[int, Tuple[int, int]]] = None, gap_extend: Optional[Union[int, Tuple[int, int]]] = None, clipping_penalty: Optional[Union[int, Tuple[int, int]]] = None, unpaired_penalty: Optional[int] = None, read_type: Optional[samwell.sam.bwa_mem.ReadType] = None)¶
Bwa mem scoring options
- match_score¶
the score for a sequence match, which scales options -TdBOELU unless overridden
- Type
Optional[int]
- gap_open¶
gap open penalties for deletions and insertions (single value to use the same for both)
- gap_extend¶
gap extension penalty; a gap of size k cost ‘{-O} + {-E}*k’ (single value to use the same for both)
- read_type¶
read type. Setting -x changes multiple parameters unless overriden: pacbio: -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0 (PacBio reads to ref) ont2d: -k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0 (Oxford Nanopore 2D-reads to ref) intractg: -B9 -O16 -L5 (intra-species contigs to ref)
- Type
Optional[samwell.sam.bwa_mem.ReadType]
- samwell.sam.bwa_mem.align(reads: Iterable[samwell.sam.bwa_mem.FastqRecord], idxbase: pathlib.Path, executable_path: pathlib.Path = PosixPath('bwa'), algo_opts: Optional[samwell.sam.bwa_mem.AlgorithmOptions] = None, scoring_opts: Optional[samwell.sam.bwa_mem.ScoringOptions] = None, io_opts: Optional[samwell.sam.bwa_mem.InputOutputOptions] = None, suppress_secondaries: bool = False, stderr_out: Any = <_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>) Iterable[Tuple[samwell.sam.bwa_mem.FastqRecord, List[pysam.libcalignedsegment.AlignedSegment]]] ¶
Aligns the given reads with BWA mem.
See
bwa_mem
for a detailed explanation for the implementation approach.- Parameters
reads – the reads to align
idxbase – the path prefix for all the BWA-specific index files
executable_path – the path to the BWA executable
algo_opts – the algorithm options
scoring_opts – the scoring options
io_opts – the input and output options
suppress_secondaries – true to discard all secondary alignments, false otherwise
- Returns
An iterable over the alignment results. An alignment result is a tuple consisting of the original
FastqRecord
and an iterator over the alignments (seeAlignedSegment
)
Utility Functions for Soft-Clipping records in SAM/BAM Files¶
This module contains utility functions for soft-clipping reads. There are four variants that support clipping the beginnings and ends of reads, and specifying the amount to be clipped in terms of query bases or reference bases:
softclip_start_of_alignment_by_query()
clips the start of the alignment in terms of query bases
softclip_end_of_alignment_by_query()
clips the end of the alignment in terms of query bases
softclip_start_of_alignment_by_ref()
clips the start of the alignment in terms of reference bases
softclip_end_of_alignment_by_ref()
clips the end of the alignment in terms of reference bases
The difference between query and reference based versions is apparent only when there are insertions or deletions in the read as indels have lengths on either the query (insertions) or reference (deletions) but not both.
Upon clipping a set of additional SAM tags are removed from reads as they are likely invalid.
For example, to clip the last 10 query bases of all records and reduce the qualities to Q2:
>>> from samwell.sam import reader, clipping
>>> with reader("/path/to/sample.sam") as fh:
... for rec in fh:
... clipping.softclip_end_of_alignment_by_query(rec, 10, 2)
... print(rec.cigarstring)
It should be noted that any clipping potentially makes the common SAM tags NM, MD and UQ invalid, as well as potentially other alignment based SAM tags. Any clipping added to the start of an alignment changes the position (reference_start) of the record. Any reads that have no aligned bases after clipping are set to be unmapped. If writing the clipped reads back to a BAM it should be noted that:
Mate pairs may have incorrect information about their mate’s positions
Even if the input was coordinate sorted, the output may be out of order
To rectify these problems it is necessary to do the equivalent of:
cat clipped.bam | samtools sort -n | samtools fixmate | samtools sort | samtools calmd
- class samwell.sam.clipping.ClippingInfo(query_bases_clipped: int, ref_bases_clipped: int)¶
Named tuple holding the number of bases clipped on the query and reference respectively.
- property query_bases_clipped¶
Alias for field number 0
- property ref_bases_clipped¶
Alias for field number 1
- samwell.sam.clipping.softclip_end_of_alignment_by_query(rec: pysam.libcalignedsegment.AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = ('MD', 'NM', 'UQ')) samwell.sam.clipping.ClippingInfo ¶
Adds soft-clipping to the end of a read’s alignment.
Clipping is applied before any existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.
If the read is unmapped or bases_to_clip < 1 then nothing is done.
If the read has fewer clippable bases than requested the read will be unmapped.
- Parameters
rec – the BAM record to clip
bases_to_clip – the number of additional bases of clipping desired in the read/query
clipped_base_quality – if not None, set bases in the clipped region to this quality
tags_to_invalidate – the set of extended attributes to remove upon clipping
- Returns
- a named tuple containing the number of query/read bases and the number
of target/reference bases clipped.
- Return type
- samwell.sam.clipping.softclip_end_of_alignment_by_ref(rec: pysam.libcalignedsegment.AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = ('MD', 'NM', 'UQ')) samwell.sam.clipping.ClippingInfo ¶
Soft-clips the end of an alignment by bases_to_clip bases on the reference.
Clipping is applied beforeany existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.
If the read is unmapped or bases_to_clip < 1 then nothing is done.
If the read has fewer clippable bases than requested the read will be unmapped.
- Parameters
rec – the BAM record to clip
bases_to_clip – the number of additional bases of clipping desired on the reference
clipped_base_quality – if not None, set bases in the clipped region to this quality
tags_to_invalidate – the set of extended attributes to remove upon clipping
- Returns
- a named tuple containing the number of query/read bases and the number
of target/reference bases clipped.
- Return type
- samwell.sam.clipping.softclip_start_of_alignment_by_query(rec: pysam.libcalignedsegment.AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = ('MD', 'NM', 'UQ')) samwell.sam.clipping.ClippingInfo ¶
Adds soft-clipping to the start of a read’s alignment.
Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.
If the read is unmapped or bases_to_clip < 1 then nothing is done.
If the read has fewer clippable bases than requested the read will be unmapped.
- Parameters
rec – the BAM record to clip
bases_to_clip – the number of additional bases of clipping desired in the read/query
clipped_base_quality – if not None, set bases in the clipped region to this quality
tags_to_invalidate – the set of extended attributes to remove upon clipping
- Returns
- a named tuple containing the number of query/read bases and the number
of target/reference bases clipped.
- Return type
- samwell.sam.clipping.softclip_start_of_alignment_by_ref(rec: pysam.libcalignedsegment.AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = ('MD', 'NM', 'UQ')) samwell.sam.clipping.ClippingInfo ¶
Soft-clips the start of an alignment by bases_to_clip bases on the reference.
Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.
If the read is unmapped or bases_to_clip < 1 then nothing is done.
If the read has fewer clippable bases than requested the read will be unmapped.
- Parameters
rec – the BAM record to clip
bases_to_clip – the number of additional bases of clipping desired on the reference
clipped_base_quality – if not None, set bases in the clipped region to this quality
tags_to_invalidate – the set of extended attributes to remove upon clipping
- Returns
- a named tuple containing the number of query/read bases and the number
of target/reference bases clipped.
- Return type
Classes for generating SAM and BAM files and records for testing¶
This module contains utility classes for the generation of SAM and BAM files and alignment records, for use in testing.
The module contains the following public classes:
SamBuilder
– A builder class that allows the accumulationof alignment records and access as a list and writing to file.
- class samwell.sam.sambuilder.SamBuilder(r1_len: Optional[int] = None, r2_len: Optional[int] = None, base_quality: int = 30, mapping_quality: int = 60, sd: Optional[List[Dict[str, Any]]] = None, rg: Optional[Dict[str, str]] = None, extra_header: Optional[Dict[str, Any]] = None, seed: int = 42, sort_order: Optional[samwell.sam.SamOrder] = SamOrder.Coordinate)¶
Builder for constructing one or more sam records (`AlignmentSegment`s in pysam terms).
Provides the ability to manufacture records from minimal arguments, while generating any remaining attributes to ensure a valid record.
A builder is constructed with a handful of defaults including lengths for generated R1s and R2s, the default base quality score to use, a sequence dictionary and a single read group.
Records are then added using the
add_pair()
method. Once accumulated the records can be accessed in the order in which they were created through theto_unsorted_list()
function, or in a list sorted by coordinate order viato_sorted_list()
. The latter creates a temporary file to do the sorting and is somewhat slower as a result.Records can be further modified after being returned from
add_pair()
,to_unsorted_list()
, orto_sorted_list()
by directly accessing their attributes through the [AlignedSegment](https://pysam.readthedocs.io/en/latest/api.html#pysam.AlignedSegment) API.Lastly, the records can be written to a temporary file using
to_path()
.- add_pair(*, name: Optional[str] = None, bases1: Optional[str] = None, bases2: Optional[str] = None, quals1: Optional[List[int]] = None, quals2: Optional[List[int]] = None, chrom: str = '*', start1: int = - 1, start2: int = - 1, cigar1: Optional[str] = None, cigar2: Optional[str] = None, strand1: str = '+', strand2: str = '-', attrs: Optional[Dict[str, Any]] = None) Tuple[pysam.libcalignedsegment.AlignedSegment, pysam.libcalignedsegment.AlignedSegment] ¶
Generates a new pair of reads, adds them to the internal collection, and returns them.
Most fields are optional.
An unmapped pair can be created by calling the method with no parameters (specifically, not setting chrom, start1 or start2). If either cigar is provided, it will be ignored.
A pair with only one of the two reads mapped is created by setting e.g. chrom and start1. The values will be automaticaly transferred to the unmapped mate, and flags set correctly.
A mapped pair is created by providing all three of chrom, start1 and start2.
For a given read (i.e. R1 or R2) the length of the read is determined based on the presence or absence of bases, quals, and cigar. If values are provided for one or more of these parameters, the lengths must match, and the length will be used to generate any unsupplied values. If none of bases, quals, and cigar are provided, all three will be synthesized based on either the r1_len or r2_len stored on the class as appropriate.
When synthesizing, bases are always a random sequence of bases, quals are all the default base quality (supplied when constructing a SamBuilder) and the cigar is always a single M operator of the read length.
Alignment attributes not exposed through the method parameters can be modified directly on the returned AlignedSegment objects. Modifications will be reflected when records are written to a temporary file with
to_path()
.- Parameters
name – The name of the template. If None is given a unique name will be auto-generated.
bases1 – The bases for R1. If None is given a random sequence is generated.
bases2 – The bases for R2. If None is given a random sequence is generated.
quals1 – The list of int qualities for R1. If None, the default base quality is used.
quals2 – The list of int qualities for R2. If None, the default base quality is used.
chrom – The chromosome to which both reads are mapped. Defaults to the unmapped value.
start1 – The start position of R1. Defaults to the unmapped value.
start2 – The start position of R2. Defaults to the unmapped value.
cigar1 – The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.
cigar2 – The cigar string for R2. Defaults to None for unmapped reads, otherwise all M.
strand1 – The strand for R1, either “+” or “-”. Defaults to “+”.
strand2 – The strand for R2, either “+” or “-”. Defaults to “-“.
attrs – An optional dictionary of SAM attribute to place on both R1 and R2.
- Raises
ValueError – if either strand field is not “+” or “-”
ValueError – if bases/quals/cigar are set in a way that is not self-consistent
- Returns
The pair of records created, R1 then R2.
- Return type
Tuple[AlignedSegment, AlignedSegment]
- static default_rg() Dict[str, str] ¶
Returns the default read group used by the SamBuilder, as a dictionary.
- static default_sd() List[Dict[str, Any]] ¶
Generates the sequence dictionary that is used by default by SamBuilder.
Matches the names and lengths of the HG19 reference in use in production.
- Returns
A new copy of the sequence dictionary as a list of dictionaries, one per chromosome.
- property header: pysam.libcalignmentfile.AlignmentHeader¶
Returns a copy of the alignmentt header used by this builder
- to_path(path: Optional[pathlib.Path] = None, index: bool = True, pred: Callable[pysam.libcalignedsegment.AlignedSegment, bool] = <function SamBuilder.<lambda>>) pathlib.Path ¶
Write the accumulated records to a file, sorts & indexes it, and returns the Path. If a path is provided, it will be written to, otherwise a temporary file is created and returned.
- Parameters
path – a path at which to write the file, otherwise a temp file is used.
index – if True and sort_order is Coordinate index is generated, otherwise not.
pred – optional predicate to specify which reads should be output
- Returns
The path to the sorted (and possibly indexed) file.
- Return type
Path
- to_sorted_list() List[pysam.libcalignedsegment.AlignedSegment] ¶
Returns the accumulated records in coordinate order.
- to_unsorted_list() List[pysam.libcalignedsegment.AlignedSegment] ¶
Returns the accumulated records in the order they were created.