Find homopolymeric tracts in a FASTA genome

Assuming standard FASTA format, this BASH one-liner finds homopolymeric tracts (HTs, stretches of the genome where a single nucleotide is repeated many times, e.g. AAAA or TTTTTTT) in a genome and outputs the region.  Such regions are prone to sequencing errors, but are also mutational hotspots as they are susceptible to slippage errors during replication and transcription. Some evidence suggests that HTs may have a regulatory role in prokaryotes.

tail -n+2 GENOME.fa | tr -d '\n' | grep -ob -E "(\w)\1{4,}" | sed 's/:/\t/g' | awk '{print $1+1"\t"$1+length($2)"\t"substr($2,0,1)"\t"length($2); }' | sort -k1n


# strip the FASTA header
tail -n+2 GENOME.fa
# remove newlines
tr -d '\n'
# grep >4 (i.e. 5 or more) of the same character, output the match and byte offset
grep -ob -E "(\w)\1{4,}"
# replace the ":" added by grep with a tab
sed 's/:/\t/g'
# prints the genomic position (start + end) of the HT, nucleotide (ACGT) and the length of the tract
awk '{print $1+1"\t"$1+length($2)"\t"substr($2,0,1)"\t"length($2); }'
# sorts by natural numeric position in the genome
sort -k1n

Example output (Pseudomonas fluorescens Pf0-1 NC_007492.2):
35 39 C 5
157 162 A 6
374 378 C 5
440 444 T 5
529 533 T 5
1432 1436 T 5
3304 3308 C 5
3310 3315 C 6
3626 3630 G 5
4063 4067 G 5


I wrote a quick Perl script to visualize SNPs in a gene from experimental evolution sequencing data.  Useful for making figures when one gene is hit by mutations in multiple lineages.  It outputs an SVG file with the reference sequence and the changes.

For the moment, it only visualizes substitutions, not insertions/deletions or anything more exotic.  More to come.

Example: SNPs found in Pseudomonas aeruginosa gene PA2449 (converted to PNG)


(Thanks to Sofia Robb for teaching Perl as part of Programming for Evolutionary Biology!)

Download SNPsvg here.


Turdus merula

Turdus merula

The Common Blackbird (Turdus merula) is a species of true thrush. It is also called Eurasian Blackbird (especially in North America, to distinguish it from the unrelated New World blackbirds),[2] or simply Blackbird, where this does not lead to confusion with a similar-looking local species. It breeds in Europe, Asia, and North Africa, and has been introduced to Australia (where it is considered a pest) and New Zealand. It has a number of subspecies across its large range; a few of the Asian subspecies are sometimes considered to be full species. Depending on latitude, the Common Blackbird may be resident, partially migratory or fully migratory.

More from Wikipedia

Regex to change bracket citations into bibtex keys

Useful for converting in-text citations (from e.g. Word) into a LaTeX document.  Converts bracket notation into first 3 chars of first author’s last name (or 2 chars, if only 2 chars long), plus two-digit year: e.g. (Bobby 2009) becomes \citep{Bob09}.  See also my post on how to insert bibtex references into Word.

Search for:

\(([A-Za-z]{2,3})[A-Za-z -.]* (18|19|20)([0-9a-z]{2,3})\)

Replace with:


Test citations:

(Bobby 2009)
(Bobby and Jonny 1909)
(Bobby et al. 1923)

(Maynard Smith 1989)
(Maynard Smith and Haigh 1974)
(Maynard Smith and Bobby-Jonny 1974)
(Maynard Smith and Maynard Smith 1974)
(Maynard Smith et al. 1993)

(Maisnier-Patin 1900)
(Maisnier-Patin and Bobby-Jonny 1956)
(Maisnier-Patin et al. 2002)

(Aa 1932)
(Aaa 1932)