API


Utility Functions for Manipulating DNA sequences.

This module contains utility functions for manipulating DNA sequences.

samwell.dnautils.has_long_homopolymer(bases: str, max_hp_length: int) bool

Returns true if the given bases has a homopolymer length longer than the given length.

Parameters
  • bases – the bases to examine.

  • max_hp_length – the maximum homopolymer length to allow.

samwell.dnautils.mask_long_homopolymers(bases: str, min_long_hp_length: int, mask_base: str = 'N') str

Returns the bases masked for regions with long homopolymers

Parameters
  • bases – the bases to mask.

  • min_long_hp_length – the minimum homopolymer length (inclusive) to mask.

  • mask_base – the base to use when masking

samwell.dnautils.reverse_complement(bases: str) str

Reverse complements a base sequence.

Parameters

bases – the bases to be reverse complemented.

Returns

the reverse complement of the provided base string

Functions for Creating Useful Iterators

This module contains classes and functions for creating useful iterators.

Examples of a “Peekable” Iterator

“Peekable” iterators are useful to “peek” at the next item in an iterator without consuming it. For example, this is useful when consuming items in iterator while a predicate is true, and not consuming the first element where the element is not true. See the takewhile() and dropwhile() methods.

An empty peekable iterator throws StopIteration:

>>> from samwell.itertools import peekable
>>> piter = peekable(iter([]))
>>> piter.peek()
StopIteration

A peekable iterator will return the next item before consuming it.

>>> piter = peekable(iter([1, 2, 3]))
>>> piter.peek()
1
>>> next(piter)
1
>>> [j for j in piter]
[2, 3]

The can_peek() function can be used to determine if the iterator can be peeked without StopIteration being thrown:

>>> piter = peekable([1])
>>> piter.peek() if piter.can_peek() else -1
1
>>> next(piter)
1
>>> piter.peek() if piter.can_peek() else -1
-1
>>> next(piter)
StopIteration

The peekable() function should be preferred to calling PeekableIterator’s constructor directly as it supports creation from iterable objects as well as iterators, while the constructor requires an iterator.

Examples of a “Merging” Iterator

A “merging” iterator can merge two iterators in order based on a given ordering function. This is useful for merging two iterators that are already in order.

>>> from samwell.itertools import MergingIterator
>>> even = iter([2, 4, 6, 8])
>>> odd = iter([1, 3, 5, 9])
>>> merging = MergingIterator(even, odd, lambda x: x)
>>> list(merging)
[1, 2, 3, 4, 5, 6, 7, 8, 9]

Module Contents

The module contains the following public classes:

The module contains the following methods:

  • peekable() – Creates an iterator that allows you to peek at

    the next value before calling next

class samwell.itertools.MergingIterator(iter1: Iterator[IterType], iter2: Iterator[IterType], keyfunc: Callable[IterType, Any])

An iterator that merges two iterators; if they are sorted and keyfunc is passed, yields results in order.

Parameters
  • iter1 – an iterator

  • iter2 – an iterator

  • keyfunc – a function that extracts a key from an item that is used to order items

class samwell.itertools.PeekableIterator(source: Iterator[IterType])

A peekable iterator wrapping an iterable.

This allows returning the next item without consuming it.

Parameters

source – an iterator over the objects

can_peek() bool

Returns true if there is a value that can be peeked at, false otherwise.

dropwhile(pred: Callable[IterType, bool]) samwell.itertools.PeekableIterator[IterType]

Drops elements from the iterator while the predicate is true.

Updates the iterator to point at the first non-matching element, or exhausts the iterator if all elements match the predicate.

Parameters
  • pred (Callable[[V], bool]) – a function that takes a value from the iterator

  • false. (and returns true or) –

Returns

a reference to this iterator, so calls can be chained

Return type

PeekableIterator[V]

maybe_peek() Optional[IterType]

Returns the next element without consuming it, or None otherwise.

peek() IterType

Returns the next element without consuming it, or StopIteration otherwise.

takewhile(pred: Callable[IterType, bool]) List[IterType]

Consumes from the iterator while pred is true, and returns the result as a List.

The iterator is left pointing at the first non-matching item, or if all items match then the iterator will be exhausted.

Parameters

pred – a function that takes the next value from the iterator and returns true or false.

Returns

A list of the values from the iterator, in order, up until and excluding the first value that does not match the predicate.

Return type

List[V]

samwell.itertools.peekable(source: Union[Iterator[IterType], Iterable[IterType]]) samwell.itertools.PeekableIterator[IterType]

Creates a peekable iterator that allows you to peek at the next value before calling next

The peek method will return the next element without consuming it, otherwise StopIteration.

Parameters

source – either an iterator over the objects, or a callable that is called until it returns the sentinel.

Returns

a PeekableIterator

Utility Classes for Querying Overlaps with Genomic Regions

DEPRECATED - if you have the option use ~pybedlite.overlap_detector in favor of this.

Examples of Detecting Overlaps

>>> from samwell.overlap_detector import Interval, OverlapDetector
>>> detector = OverlapDetector()
>>> query = Interval("chr1", 2, 20)
>>> detector.overlaps_any(query)
False
>>> detector.add(Interval("chr2", 1, 100))
>>> detector.add(Interval("chr1", 21, 100))
>>> detector.overlaps_any(query)
False
>>> detector.add(Interval("chr1", 1, 1))
>>> detector.overlaps_any(query)
True
>>> detector.get_overlaps(query)
[Interval("chr1", 1, 1)]
>>> detector.add(Interval("chr1", 3, 10))
>>> detector.overlaps_any(query)
True
>>> detector.get_overlaps(query)
[Interval("chr1", 1, 1), interval("chr1", 3, 10)]

Module Contents

The module contains the following public classes:

  • Interval – Represents a region mapping to the genome

    that is 0-based and open-ended

  • OverlapDetector – Detects and returns overlaps between

    a set of genomic regions and another genomic region

class samwell.overlap_detector.Interval(refname: str, start: int, end: int, negative: bool = False, name: Optional[str] = None)

A region mapping to the genome that is 0-based and open-ended

refname

the refname (or chromosome)

Type

str

start

the 0-based start position

Type

int

end

the 0-based end position (exclusive)

Type

int

negative

true if the interval is on the negative strand, false otherwise

Type

bool

name

an optional name assigned to the interval

Type

Optional[str]

length() int

Returns the length of the interval.

overlap(other: pybedlite.overlap_detector.Interval) int

Returns the overlap between this interval and the other, or zero if there is none.

Parameters

other (Interval) – the other interval to find the overlap with

class samwell.overlap_detector.OverlapDetector

Detects and returns overlaps between a set of genomic regions and another genomic region.

If two intervals have the same coordinates, only the first that was added will be kept.

Since Interval objects are used both to populate the overlap detector and to query it, the coordinate system in use is also 0-based open-ended.

add(interval: pybedlite.overlap_detector.Interval) None

Adds a interval to this detector.

Parameters

interval – the interval to add to this detector

add_all(intervals: Iterable[pybedlite.overlap_detector.Interval]) None

Adds one or more intervals to this detector.

Parameters

intervals – the intervals to add to this detector

classmethod from_bed(path: pathlib.Path) pybedlite.overlap_detector.OverlapDetector

Builds an OverlapDetector from a BED file.

Parameters

path – the path to the BED file

Returns

An overlap detector for the regions in the BED file.

get_enclosed(interval: pybedlite.overlap_detector.Interval) List[pybedlite.overlap_detector.Interval]

Returns the set of intervals in this detector that are enclosed by the query interval. I.e. target.start >= query.start and target.end <= query.end.

Args:

interval: the query interval

Returns:

The list of intervals in this detector that are enclosed within the query interval. The intervals will be return in ascending genomic order.

get_enclosing_intervals(interval: pybedlite.overlap_detector.Interval) List[pybedlite.overlap_detector.Interval]

Returns the set of intervals in this detector that wholly enclose the query interval. i.e. query.start >= target.start and query.end <= target.end.

Args:

interval: the query interval

Returns:

The list of intervals in this detector that enclose the query interval. The intervals will be returned in ascending genomic order.

get_overlaps(interval: pybedlite.overlap_detector.Interval) List[pybedlite.overlap_detector.Interval]

Returns any intervals in this detector that overlap the given interval.

Parameters

interval – the interval to check

Returns

The list of intervals in this detector that overlap the given interval, or the empty list if no overlaps exist. The intervals will be return in ascending genomic order.

overlaps_any(interval: pybedlite.overlap_detector.Interval) bool

Determines whether the given interval overlaps any interval in this detector.

Parameters

interval – the interval to check

Returns

True if and only if the given interval overlaps with any interval in this detector.

Utility Classes and Methods for SAM/BAM

This module contains utility classes for reading and writing SAM/BAM files, as well as for manipulating Cigars. It is recommended to use the reader() and writer() methods rather than pysam.AlignmentFile directly (see below for motivation).

Motivation for Reader and Writer methods

The following are the reasons for choosing to implement methods to open a SAM/BAM file for reading and writing, rather than relying on pysam.AlignmentFile directly:

  1. Provides a centralized place for the implementation of opening a SAM/BAM for reading and writing. This is useful if any additional parameters are added, or changes to standards or defaults are made.

  2. Makes the requirement to provide a header when opening a file for writing more explicit.

  3. Adds support for Path.

  4. Remove the reliance on specifying the mode correctly, including specifying the file type (i.e. SAM, BAM, or CRAM), as well as additional options (ex. compression level). This makes the code more explicit and easier to read.

  5. An explicit check is performed to ensure the file type is specified when writing using a file-like object rather than a path to a file.

Examples of Opening a SAM/BAM for Reading or Writing

Opening a SAM/BAM file for reading, auto-recognizing the file-type by the file extension. See SamFileType for the supported file types.

>>> from samwell.sam import reader
>>> with reader("/path/to/sample.sam") as fh:
...     for record in fh:
...         print(record.name)  # do something
>>> with reader("/path/to/sample.bam") as fh:
...     for record in fh:
...         print(record.name)  # do something

Opening a SAM/BAM file for reading, explicitly passing the file type.

>>> from samwell.sam import SamFileType
>>> with reader(path="/path/to/sample.ext1", file_type=SamFileType.SAM) as fh:
...     for record in fh:
...         print(record.name)  # do something
>>> with reader(path="/path/to/sample.ext2", file_type=SamFileType.BAM) as fh:
...     for record in fh:
...         print(record.name)  # do something

Opening a SAM/BAM file for reading, using an existing file-like object

>>> with open("/path/to/sample.sam", "rb") as file_object:
...     with reader(path=file_object, file_type=SamFileType.BAM) as fh:
...         for record in fh:
...             print(record.name)  # do something

Opening a SAM/BAM file for writing follows similar to the reader() method, but the SAM file header object is required.

>>> from samwell.sam import writer
>>> header: Dict[str, Any] = {
...     "HD": {"VN": "1.5", "SO": "coordinate"},
...     "RG": [{"ID": "1", "SM": "1_AAAAAA", "LB": "lib", "PL": "ILLUMINA", "PU": "xxx.1"}],
...     "SQ":  [
...         {"SN": "chr1", "LN": 249250621},
...         {"SN": "chr2", "LN": 243199373}
...     ]
... }
>>> with writer(path="/path/to/sample.bam", header=header) as fh:
...     pass  # do something

Examples of Manipulating Cigars

Creating a Cigar from a pysam.AlignedSegment.

>>> from samwell.sam import Cigar
>>> with reader("/path/to/sample.sam") as fh:
...     record = next(fh)
...     cigar = Cigar.from_cigartuples(record.cigartuples)
...     print(str(cigar))
50M2D5M10S

Creating a Cigar from a str.

>>> cigar = Cigar.from_cigarstring("50M2D5M10S")
>>> print(str(cigar))
50M2D5M10S

If the cigar string is invalid, the exception message will show you the problem character(s) in square brackets.

>>> cigar = Cigar.from_cigarstring("10M5U")
... CigarException("Malformed cigar: 10M5[U]")

The cigar contains a tuple of CigarOp) and associated operator length. A number of useful methods are part of both classes.

The number of bases aligned on the query (i.e. the number of bases consumed by the cigar from the query):

>>> cigar = Cigar.from_cigarstring("50M2D5M2I10S")
>>> [e.length_on_query for e in cigar.elements]
[50, 0, 5, 2, 10]
>>> [e.length_on_target for e in cigar.elements]
[50, 2, 5, 0, 0]
>>> [e.operator.is_indel for e in cigar.elements]
[False, True, False, True, False]

Module Contents

The module contains the following public classes:

  • SamFileType – Enumeration of valid SAM/BAM/CRAM file types.

  • SamOrder – Enumeration of possible SAM/BAM/CRAM sort orders.

  • CigarOp – Enumeration of operators that can appear in a Cigar string.

  • CigarElement – Class representing an element in a Cigar string.

  • CigarParsingException – The exception raised specific to parsing a

    cigar

  • Cigar – Class representing a cigar string.

The module contains the following methods:

  • reader() – opens a SAM/BAM/CRAM file for reading.

  • writer() – opens a SAM/BAM/CRAM file for writing

  • set_qc_fail() – sets the QC fail flag in a

    pysam.AlignedSegment record and sets additional SAM tags giving the tool name and reason for why the QC fail flag was set.

  • get_qc_fail() – gets the tool name and reason for why the QC fail flag

    was set, or None if it is not set.

class samwell.sam.Cigar(elements: Tuple[samwell.sam.CigarElement, ...] = ())

Class representing a cigar string.

- elements

zero or more cigar elements

Type

Tuple[CigarElement, …]

coalesce() samwell.sam.Cigar

Returns a copy of the cigar adjacent operators of the same type coalesced into single operators.

classmethod from_cigarstring(cigarstring: str) samwell.sam.Cigar

Constructs a Cigar from a string returned by pysam.

If “*” is given, returns an empty Cigar.

classmethod from_cigartuples(cigartuples: Optional[List[Tuple[int, int]]]) samwell.sam.Cigar

Returns a Cigar from a list of tuples returned by pysam.

Each tuple denotes the operation and length. See CigarOp for more information on the various operators. If None is given, returns an empty Cigar.

length_on_query() int

Returns the length of the alignment on the query sequence.

length_on_target() int

Returns the length of the alignment on the target sequence.

reversed() samwell.sam.Cigar

Returns a copy of the Cigar with the elements in reverse order.

class samwell.sam.CigarElement(length: int, operator: samwell.sam.CigarOp)

Represents an element in a Cigar

- length

the length of the element

Type

int

- operator

the operator of the element

Type

CigarOp

property length_on_query: int

Returns the length of the element on the query sequence.

property length_on_target: int

Returns the length of the element on the target (often reference) sequence.

class samwell.sam.CigarOp(value)

Enumeration of operators that can appear in a Cigar string.

code

The pysam cigar operator code.

Type

int

character

The single character cigar operator.

Type

int

consumes_query

True if this operator consumes query bases, False otherwise.

Type

bool

consumes_target

True if this operator consumes target bases, False otherwise.

Type

bool

D = (2, 'D', False, True)

Deletion versus the reference

EQ = (7, '=', True, True)

Matches the reference

H = (5, 'H', False, False)

Hard clip

I = (1, 'I', True, False)

E741

Type

Insertion versus the reference # noqa

M = (0, 'M', True, True)

Match or Mismatch the reference

N = (3, 'N', False, True)

Skipped region from the reference

P = (6, 'P', False, False)

Padding

S = (4, 'S', True, False)

Soft clip

X = (8, 'X', True, True)

Mismatches the reference

static from_character(character: str) samwell.sam.CigarOp

Returns the operator from the single character.

static from_code(code: int) samwell.sam.CigarOp

Returns the operator from the given operator code.

Note: this is mainly used to get the operator from pysam.

property is_indel: bool

Returns true if the operator is an indel, false otherwise.

exception samwell.sam.CigarParsingException

The exception raised specific to parsing a cigar.

samwell.sam.NO_REF_INDEX: int = -1

The reference name to use to indicate no reference in SAM/BAM.

samwell.sam.NO_REF_NAME: str = '*'

The reference position to use to indicate no position in SAM/BAM.

class samwell.sam.SamFileType(value)

Enumeration of valid SAM/BAM/CRAM file types.

mode

The additional mode character to add when opening this file type.

Type

str

ext

The standard file extension for this file type.

Type

str

classmethod from_path(path: Union[pathlib.Path, str]) samwell.sam.SamFileType

Infers the file type based on the file extension.

Parameters

path – the path to the SAM/BAM/CRAM to read or write.

class samwell.sam.SamOrder(value)

Enumerations of possible sort orders for a SAM file.

Coordinate = 'coordinate'

coordinate sorted

QueryName = 'queryname'

queryname sorted

Unsorted = 'unsorted'

the SAM / BAM / CRAM is unsorted

samwell.sam.SamPath = typing.Union[typing.IO[typing.Any], pathlib.Path, str]

The reference index to use to indicate no reference in SAM/BAM.

samwell.sam.get_qc_fail(rec: pysam.libcalignedsegment.AlignedSegment) Optional[Tuple[str, str]]

Gets the tool and reason for why the QC fail flag is set, otherwise None if not set.

If the QC fail flag is set, but the tool and filter reason SAM tags are not set, None will be returned. Use pysam.AlignedSegment.is_qcfail() to check if the record is simply QC failed.

Parameters

rec – the record to fail

samwell.sam.get_qc_fail_by_tool(rec: pysam.libcalignedsegment.AlignedSegment, tool: Optional[Callable[..., Any]] = None) Optional[Tuple[str, str]]

Gets the tool and reason for why the QC fail flag if the flag was set by the passed tool.

None will be returned in the following cases:
  • The QC fail flag is not set

  • The QC fail flag isset, but the tool and filter reason SAM tags are not set

  • The tool and filter reason SAM tags were set by a different tool

Use pysam.AlignedSegment.is_qcfail() to check if the record is simply QC failed.

Parameters
  • rec – the record to fail

  • tool – the tool that must have set the QC fail flag

samwell.sam.isize(r1: pysam.libcalignedsegment.AlignedSegment, r2: pysam.libcalignedsegment.AlignedSegment) int

Computes the insert size for a pair of records.

samwell.sam.reader(path: Union[IO[Any], pathlib.Path, str], file_type: Optional[samwell.sam.SamFileType] = None) pysam.libcalignmentfile.AlignmentFile

Opens a SAM/BAM/CRAM for reading.

Parameters
  • path – a file handle or path to the SAM/BAM/CRAM to read or write.

  • file_type – the file type to assume when opening the file. If None, then the file type will be auto-detected.

samwell.sam.set_pair_info(r1: pysam.libcalignedsegment.AlignedSegment, r2: pysam.libcalignedsegment.AlignedSegment, proper_pair: bool = True) None

Resets mate pair information between reads in a pair. Requires that both r1 and r2 are mapped. Can be handed reads that already have pairing flags setup or independent R1 and R2 records that are currently flagged as SE reads.

Parameters
  • r1 – read 1

  • r2 – read 2 with the same queryname as r1

samwell.sam.set_qc_fail(rec: pysam.libcalignedsegment.AlignedSegment, tool: Callable[..., Any], reason: str) None

Sets the QC fail flag, and adds tags containing the tool name and reason for failing. :param rec: the record to fail :param tool: the tool (as a callable) that failed this record :param reason: the reason for failing

samwell.sam.writer(path: Union[IO[Any], pathlib.Path, str], header: Union[str, Dict[str, Any], pysam.libcalignmentfile.AlignmentHeader], file_type: Optional[samwell.sam.SamFileType] = None) pysam.libcalignmentfile.AlignmentFile

Opens a SAM/BAM/CRAM for writing.

Parameters
  • path – a file handle or path to the SAM/BAM/CRAM to read or write.

  • header – Either a string to use for the header or a multi-level dictionary. The multi-level dictionary should be given as follows. The first level are the four types (‘HD’, ‘SQ’, …). The second level are a list of lines, with each line being a list of tag-value pairs. The header is constructed first from all the defined fields, followed by user tags in alphabetical order.

  • file_type – the file type to assume when opening the file. If None, then the filetype will be auto-detected and must be a path-like object.

Utility methods for running BWA

This module contains methods for running BWA. Currently only the “mem” algorithm is supported:
  • align() – Aligns the given reads with BWA mem.

The options for running BWA can be customized via the three Options classes:
  • AlgorithmOptions – Bwa mem algorithm options.

  • ScoringOptions – Bwa mem scoring options

  • InputOutputOptions – Bwa mem input and output options

An input read for alignment must be minimally transformed into a FASTQ-like record:

  • FastqRecord – Fastq record used as input to alignment.

Implementation

Alignment of reads is performed asynchronously.

This is achieved by creating three sub-processes in align():

1. A process to consume the input iterable of reads and write them (a) to the stdin of BWA mem, and (b) to a queue of reads that are awaiting alignment results from BWA mem. 2. A process to run BWA mem, where FASTQ records are written to the process’ stdin, alignment results are returned to stdout, and any error/logging information from BWA mem is returned to stderr. 3. A process to route the stderr of BWA mem to the given stderr handle (stderr_out).

Then align() method consumes the stdout of the BWA mem process and collates that with the queue of reads that have been written/given to BWA mem (from process 1). For each input read, one or more alignments is expected to be returned by BWA mem. The order in which alignments of reads are returned by BWA mem is the same order as the order of reads given to BWA mem. The align() method then returns an iterable over the alignment results.

Exceptions may occur in the thread to input FASTQ records to BWA mem, which are propagated to the caller. Furthermore, an exception is returned if the # of reads given to BWA mem is not the same as the # of reads returned by BWA mem.

Some specific handling occurs around reading the BWA mem output with pysam, since the latter blocks waiting for at least some reads from BWA mem, which may not happen if there was an issue in the various upstream processes (input to BWA mem or BWA mem itself). This would have caused a deadlock.

Examples

Typically, we have AlignedSegment records obtained from reading from a SAM or BAM file. The first must be converted into FastqRecord objects.

>>> from samwell.sam.bwa_mem import FastqRecord
>>> from samwell.sam import reader
>>> reads = reader("/path/to/sample.sam")
>>> fastq_reads = map(lambda read: FastqRecord.build(read), reads)

Next, those align_mem method.

>>> from samwell.sam.bwa_mem import align
>>> results = map(lambda read: align(read), fastq_reads)

This returns an iterable over the alignment results. An alignment result is a tuple consisting of the original FastqRecord and an iterator over the alignments (see AlignedSegment).

>>> result = next(result)
>>> fastq_read, alignments = result
>>> str(fastq_read)
@name
GATTACA
+
HIJKLKM
>>> len(alignments)
2
>>> alignment = str(next(alignments))
>>> alignment.query_name
name
>>> type(alignment)
<class 'pysam.libcalignedsegment.AlignedSegment'>
class samwell.sam.bwa_mem.AlgorithmOptions(threads: Optional[int] = None, min_seed_len: Optional[int] = None, band_width: Optional[int] = None, off_diagonal_dropoff: Optional[int] = None, internal_seeds_length_factor: Optional[float] = None, max_third_seed_occurrence: Optional[int] = None, max_seed_occurrence: Optional[int] = None, drop_ratio: Optional[float] = None, min_chain_weight: Optional[int] = None, max_mate_rescue_rounds: Optional[int] = None, skip_mate_rescue: Optional[bool] = None, skip_pairing: Optional[bool] = None)

Bwa mem algorithm options

threads

number of threads

Type

Optional[int]

min_seed_len

minimum seed length

Type

Optional[int]

band_width

band width for banded alignment

Type

Optional[int]

off_diagonal_dropoff

off-diagonal X-dropoff

Type

Optional[int]

internal_seeds_length_factor

look for internal seeds inside a seed longer than min_seed_len * internal_seeds_length_factor

Type

Optional[float]

max_third_seed_occurrence

seed occurrence for the 3rd round seeding

Type

Optional[int]

max_seed_occurrence

skip seeds with more than INT occurrences

Type

Optional[int]

drop_ratio

drop chains shorter than this fraction of the longest overlapping chain

Type

Optional[float]

min_chain_weight

discard a chain if seeded bases shorter than this value

Type

Optional[int]

max_mate_rescue_rounds

perform at most INT rounds of mate rescues for each read

Type

Optional[int]

skip_mate_rescue

skip mate rescue

Type

Optional[bool]

skip_pairing

skip pairing; mate rescue performed unless skip_mate_rescue also in use

Type

Optional[bool]

class samwell.sam.bwa_mem.FastqRecord(name: str, bases: str, quals: str, source: Optional[FastqRecordSourceType] = None, needs_alignment: bool = True, read_number: Optional[int] = None)

Fastq record used as input to alignment.

name

the name of the read

Type

str

bases

the read bases

Type

str

quals

the base qualities

Type

str

source

optionally the AlignedSegment from which this was built

Type

Optional[FastqRecordSourceType]

needs_alignment

True if the read needs alignment, False otherwise

Type

bool

read_number

optionally the read number; should be set to 1 or 2 for paired end reads.

Type

Optional[int]

classmethod build(read: pysam.libcalignedsegment.AlignedSegment, needs_alignment: bool = True, aligned_bases_only: bool = False, clip_three_prime: int = 0) samwell.sam.bwa_mem.FastqRecord

Builds a FastqRecord from a AlignedSegment

Parameters
  • read – the read to convert

  • needs_alignment – True if the read should be aligned, False otherwise

  • aligned_bases_only – only align the aligned bases (excludes soft-clipped bases)

  • clip_three_prime – the number of bases to clip on the three-prime end of the read relative to the original direction of sequencing. This will be applied after extracting the bases based on aligned_bases_only.

str_with_read_number() str

Returns the record in FASTQ format, with the read number appended (colon delimited).

class samwell.sam.bwa_mem.InputOutputOptions(interleaved_pairs: Optional[bool] = None, read_group: Optional[str] = None, header_insert: Optional[Union[str, pathlib.Path]] = None, alts_as_primary: Optional[bool] = None, verbosity: Optional[int] = None, min_alignment_score: Optional[int] = None, max_hits_within_max_score: Optional[Union[int, Tuple[int, int]]] = None, all_alignments: Optional[bool] = None, append_fastq_comment: Optional[bool] = None, add_fasta_header_to_xr: Optional[bool] = None, softclip_supplementary: Optional[bool] = None, split_hits_are_secondary: Optional[bool] = None, insert_size_params: Optional[Union[float, Tuple[float, float], Tuple[float, float, int], Tuple[float, float, int, int]]] = None, bases_per_batch: Optional[int] = 115000)

Bwa mem input and output options

interleaved_pairs

read pairs are consecutive (r1 then r2), otherwise fragment reads

Type

Optional[bool]

read_group

read group header line such as @RG ID:foo SM:bar

Type

Optional[str]

header_insert

insert STR to header if it starts with @; or insert lines in FILE

Type

Optional[Union[str, pathlib.Path]]

alts_as_primary

treat ALT contigs as part of the primary assembly (i.e. ignore <idxbase>.alt file)

Type

Optional[bool]

verbosity

verbose level: 1=error, 2=warning, 3=message, 4+=debuggin

Type

Optional[int]

min_alignment_score

minimum score to output

Type

Optional[int]

max_hits_within_max_score

if there are <INT hits with score >80% of the max score, output all in XA

Type

Optional[Union[int, Tuple[int, int]]]

all_alignments

output all alignments for SE or unpaired PE

Type

Optional[bool]

append_fastq_comment

append FASTA/FASTQ comment to SAM output

Type

Optional[bool]

add_fasta_header_to_xr

output the reference FASTA header in the XR tag

Type

Optional[bool]

softclip_supplementary

use soft clipping for supplementary alignments

Type

Optional[bool]

split_hits_are_secondary

mark shorter split hits as secondary

Type

Optional[bool]

insert_size_params

specify the mean, standard deviation (10% of the mean if absent), max (4 sigma from the mean if absent) and min of the insert size distribution. FR orientation only.

Type

Optional[Union[float, Tuple[float, float], Tuple[float, float, int], Tuple[float, float, int, int]]]

bases_per_batch

how many bases of sequence data bwa should read from the input before triggering a batch of alignments.

Type

Optional[int]

class samwell.sam.bwa_mem.ReadType(value)

The read type for BWA mem.

class samwell.sam.bwa_mem.ScoringOptions(match_score: Optional[int] = None, mismatch_score: Optional[int] = None, gap_open: Optional[Union[int, Tuple[int, int]]] = None, gap_extend: Optional[Union[int, Tuple[int, int]]] = None, clipping_penalty: Optional[Union[int, Tuple[int, int]]] = None, unpaired_penalty: Optional[int] = None, read_type: Optional[samwell.sam.bwa_mem.ReadType] = None)

Bwa mem scoring options

match_score

the score for a sequence match, which scales options -TdBOELU unless overridden

Type

Optional[int]

mismatch_score

penalty for a mismatch

Type

Optional[int]

gap_open

gap open penalties for deletions and insertions (single value to use the same for both)

Type

Optional[Union[int, Tuple[int, int]]]

gap_extend

gap extension penalty; a gap of size k cost ‘{-O} + {-E}*k’ (single value to use the same for both)

Type

Optional[Union[int, Tuple[int, int]]]

clipping_penalty

penalty for 5’- and 3’-end clipping

Type

Optional[Union[int, Tuple[int, int]]]

unpaired_penalty

penalty for an unpaired read pair

Type

Optional[int]

read_type

read type. Setting -x changes multiple parameters unless overriden: pacbio: -k17 -W40 -r10 -A1 -B1 -O1 -E1 -L0 (PacBio reads to ref) ont2d: -k14 -W20 -r10 -A1 -B1 -O1 -E1 -L0 (Oxford Nanopore 2D-reads to ref) intractg: -B9 -O16 -L5 (intra-species contigs to ref)

Type

Optional[samwell.sam.bwa_mem.ReadType]

samwell.sam.bwa_mem.align(reads: Iterable[samwell.sam.bwa_mem.FastqRecord], idxbase: pathlib.Path, executable_path: pathlib.Path = PosixPath('bwa'), algo_opts: Optional[samwell.sam.bwa_mem.AlgorithmOptions] = None, scoring_opts: Optional[samwell.sam.bwa_mem.ScoringOptions] = None, io_opts: Optional[samwell.sam.bwa_mem.InputOutputOptions] = None, suppress_secondaries: bool = False, stderr_out: Any = <_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>) Iterable[Tuple[samwell.sam.bwa_mem.FastqRecord, List[pysam.libcalignedsegment.AlignedSegment]]]

Aligns the given reads with BWA mem.

See bwa_mem for a detailed explanation for the implementation approach.

Parameters
  • reads – the reads to align

  • idxbase – the path prefix for all the BWA-specific index files

  • executable_path – the path to the BWA executable

  • algo_opts – the algorithm options

  • scoring_opts – the scoring options

  • io_opts – the input and output options

  • suppress_secondaries – true to discard all secondary alignments, false otherwise

Returns

An iterable over the alignment results. An alignment result is a tuple consisting of the original FastqRecord and an iterator over the alignments (see AlignedSegment)

Utility Functions for Soft-Clipping records in SAM/BAM Files

This module contains utility functions for soft-clipping reads. There are four variants that support clipping the beginnings and ends of reads, and specifying the amount to be clipped in terms of query bases or reference bases:

  • softclip_start_of_alignment_by_query() clips the start of the alignment in terms of query bases

  • softclip_end_of_alignment_by_query() clips the end of the alignment in terms of query bases

  • softclip_start_of_alignment_by_ref() clips the start of the alignment in terms of reference bases

  • softclip_end_of_alignment_by_ref() clips the end of the alignment in terms of reference bases

The difference between query and reference based versions is apparent only when there are insertions or deletions in the read as indels have lengths on either the query (insertions) or reference (deletions) but not both.

Upon clipping a set of additional SAM tags are removed from reads as they are likely invalid.

For example, to clip the last 10 query bases of all records and reduce the qualities to Q2:

>>> from samwell.sam import reader, clipping
>>> with reader("/path/to/sample.sam") as fh:
...     for rec in fh:
...         clipping.softclip_end_of_alignment_by_query(rec, 10, 2)
...         print(rec.cigarstring)

It should be noted that any clipping potentially makes the common SAM tags NM, MD and UQ invalid, as well as potentially other alignment based SAM tags. Any clipping added to the start of an alignment changes the position (reference_start) of the record. Any reads that have no aligned bases after clipping are set to be unmapped. If writing the clipped reads back to a BAM it should be noted that:

  • Mate pairs may have incorrect information about their mate’s positions

  • Even if the input was coordinate sorted, the output may be out of order

To rectify these problems it is necessary to do the equivalent of:

cat clipped.bam | samtools sort -n | samtools fixmate | samtools sort | samtools calmd
class samwell.sam.clipping.ClippingInfo(query_bases_clipped: int, ref_bases_clipped: int)

Named tuple holding the number of bases clipped on the query and reference respectively.

query_bases_clipped

the number of query bases in the alignment that were clipped.

Type

int

ref_bases_clipped

the number of reference bases in the alignment that were clipped.

Type

int

property query_bases_clipped

Alias for field number 0

property ref_bases_clipped

Alias for field number 1

samwell.sam.clipping.softclip_end_of_alignment_by_query(rec: pysam.libcalignedsegment.AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = ('MD', 'NM', 'UQ')) samwell.sam.clipping.ClippingInfo

Adds soft-clipping to the end of a read’s alignment.

Clipping is applied before any existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters
  • rec – the BAM record to clip

  • bases_to_clip – the number of additional bases of clipping desired in the read/query

  • clipped_base_quality – if not None, set bases in the clipped region to this quality

  • tags_to_invalidate – the set of extended attributes to remove upon clipping

Returns

a named tuple containing the number of query/read bases and the number

of target/reference bases clipped.

Return type

ClippingInfo

samwell.sam.clipping.softclip_end_of_alignment_by_ref(rec: pysam.libcalignedsegment.AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = ('MD', 'NM', 'UQ')) samwell.sam.clipping.ClippingInfo

Soft-clips the end of an alignment by bases_to_clip bases on the reference.

Clipping is applied beforeany existing hard or soft clipping. E.g. a read with cigar 100M5S that is clipped with bases_to_clip=10 will yield a cigar of 90M15S.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters
  • rec – the BAM record to clip

  • bases_to_clip – the number of additional bases of clipping desired on the reference

  • clipped_base_quality – if not None, set bases in the clipped region to this quality

  • tags_to_invalidate – the set of extended attributes to remove upon clipping

Returns

a named tuple containing the number of query/read bases and the number

of target/reference bases clipped.

Return type

ClippingInfo

samwell.sam.clipping.softclip_start_of_alignment_by_query(rec: pysam.libcalignedsegment.AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = ('MD', 'NM', 'UQ')) samwell.sam.clipping.ClippingInfo

Adds soft-clipping to the start of a read’s alignment.

Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters
  • rec – the BAM record to clip

  • bases_to_clip – the number of additional bases of clipping desired in the read/query

  • clipped_base_quality – if not None, set bases in the clipped region to this quality

  • tags_to_invalidate – the set of extended attributes to remove upon clipping

Returns

a named tuple containing the number of query/read bases and the number

of target/reference bases clipped.

Return type

ClippingInfo

samwell.sam.clipping.softclip_start_of_alignment_by_ref(rec: pysam.libcalignedsegment.AlignedSegment, bases_to_clip: int, clipped_base_quality: Optional[int] = None, tags_to_invalidate: Iterable[str] = ('MD', 'NM', 'UQ')) samwell.sam.clipping.ClippingInfo

Soft-clips the start of an alignment by bases_to_clip bases on the reference.

Clipping is applied after any existing hard or soft clipping. E.g. a read with cigar 5S100M that is clipped with bases_to_clip=10 will yield a cigar of 15S90M.

If the read is unmapped or bases_to_clip < 1 then nothing is done.

If the read has fewer clippable bases than requested the read will be unmapped.

Parameters
  • rec – the BAM record to clip

  • bases_to_clip – the number of additional bases of clipping desired on the reference

  • clipped_base_quality – if not None, set bases in the clipped region to this quality

  • tags_to_invalidate – the set of extended attributes to remove upon clipping

Returns

a named tuple containing the number of query/read bases and the number

of target/reference bases clipped.

Return type

ClippingInfo

Classes for generating SAM and BAM files and records for testing

This module contains utility classes for the generation of SAM and BAM files and alignment records, for use in testing.

The module contains the following public classes:

  • SamBuilder – A builder class that allows the accumulation

    of alignment records and access as a list and writing to file.

class samwell.sam.sambuilder.SamBuilder(r1_len: Optional[int] = None, r2_len: Optional[int] = None, base_quality: int = 30, mapping_quality: int = 60, sd: Optional[List[Dict[str, Any]]] = None, rg: Optional[Dict[str, str]] = None, extra_header: Optional[Dict[str, Any]] = None, seed: int = 42, sort_order: Optional[samwell.sam.SamOrder] = SamOrder.Coordinate)

Builder for constructing one or more sam records (`AlignmentSegment`s in pysam terms).

Provides the ability to manufacture records from minimal arguments, while generating any remaining attributes to ensure a valid record.

A builder is constructed with a handful of defaults including lengths for generated R1s and R2s, the default base quality score to use, a sequence dictionary and a single read group.

Records are then added using the add_pair() method. Once accumulated the records can be accessed in the order in which they were created through the to_unsorted_list() function, or in a list sorted by coordinate order via to_sorted_list(). The latter creates a temporary file to do the sorting and is somewhat slower as a result.

Records can be further modified after being returned from add_pair(), to_unsorted_list(), or to_sorted_list() by directly accessing their attributes through the [AlignedSegment](https://pysam.readthedocs.io/en/latest/api.html#pysam.AlignedSegment) API.

Lastly, the records can be written to a temporary file using to_path().

add_pair(*, name: Optional[str] = None, bases1: Optional[str] = None, bases2: Optional[str] = None, quals1: Optional[List[int]] = None, quals2: Optional[List[int]] = None, chrom: str = '*', start1: int = - 1, start2: int = - 1, cigar1: Optional[str] = None, cigar2: Optional[str] = None, strand1: str = '+', strand2: str = '-', attrs: Optional[Dict[str, Any]] = None) Tuple[pysam.libcalignedsegment.AlignedSegment, pysam.libcalignedsegment.AlignedSegment]

Generates a new pair of reads, adds them to the internal collection, and returns them.

Most fields are optional.

An unmapped pair can be created by calling the method with no parameters (specifically, not setting chrom, start1 or start2). If either cigar is provided, it will be ignored.

A pair with only one of the two reads mapped is created by setting e.g. chrom and start1. The values will be automaticaly transferred to the unmapped mate, and flags set correctly.

A mapped pair is created by providing all three of chrom, start1 and start2.

For a given read (i.e. R1 or R2) the length of the read is determined based on the presence or absence of bases, quals, and cigar. If values are provided for one or more of these parameters, the lengths must match, and the length will be used to generate any unsupplied values. If none of bases, quals, and cigar are provided, all three will be synthesized based on either the r1_len or r2_len stored on the class as appropriate.

When synthesizing, bases are always a random sequence of bases, quals are all the default base quality (supplied when constructing a SamBuilder) and the cigar is always a single M operator of the read length.

Alignment attributes not exposed through the method parameters can be modified directly on the returned AlignedSegment objects. Modifications will be reflected when records are written to a temporary file with to_path().

Parameters
  • name – The name of the template. If None is given a unique name will be auto-generated.

  • bases1 – The bases for R1. If None is given a random sequence is generated.

  • bases2 – The bases for R2. If None is given a random sequence is generated.

  • quals1 – The list of int qualities for R1. If None, the default base quality is used.

  • quals2 – The list of int qualities for R2. If None, the default base quality is used.

  • chrom – The chromosome to which both reads are mapped. Defaults to the unmapped value.

  • start1 – The start position of R1. Defaults to the unmapped value.

  • start2 – The start position of R2. Defaults to the unmapped value.

  • cigar1 – The cigar string for R1. Defaults to None for unmapped reads, otherwise all M.

  • cigar2 – The cigar string for R2. Defaults to None for unmapped reads, otherwise all M.

  • strand1 – The strand for R1, either “+” or “-”. Defaults to “+”.

  • strand2 – The strand for R2, either “+” or “-”. Defaults to “-“.

  • attrs – An optional dictionary of SAM attribute to place on both R1 and R2.

Raises
  • ValueError – if either strand field is not “+” or “-”

  • ValueError – if bases/quals/cigar are set in a way that is not self-consistent

Returns

The pair of records created, R1 then R2.

Return type

Tuple[AlignedSegment, AlignedSegment]

static default_rg() Dict[str, str]

Returns the default read group used by the SamBuilder, as a dictionary.

static default_sd() List[Dict[str, Any]]

Generates the sequence dictionary that is used by default by SamBuilder.

Matches the names and lengths of the HG19 reference in use in production.

Returns

A new copy of the sequence dictionary as a list of dictionaries, one per chromosome.

property header: pysam.libcalignmentfile.AlignmentHeader

Returns a copy of the alignmentt header used by this builder

property rg: Dict[str, Any]

Returns the single read group that is defined in the header.

property rg_id: str

Returns the ID of the single read group that is defined in the header.

to_path(path: Optional[pathlib.Path] = None, index: bool = True, pred: Callable[pysam.libcalignedsegment.AlignedSegment, bool] = <function SamBuilder.<lambda>>) pathlib.Path

Write the accumulated records to a file, sorts & indexes it, and returns the Path. If a path is provided, it will be written to, otherwise a temporary file is created and returned.

Parameters
  • path – a path at which to write the file, otherwise a temp file is used.

  • index – if True and sort_order is Coordinate index is generated, otherwise not.

  • pred – optional predicate to specify which reads should be output

Returns

The path to the sorted (and possibly indexed) file.

Return type

Path

to_sorted_list() List[pysam.libcalignedsegment.AlignedSegment]

Returns the accumulated records in coordinate order.

to_unsorted_list() List[pysam.libcalignedsegment.AlignedSegment]

Returns the accumulated records in the order they were created.