Note: We now have an API which can also perform many of these functions.
As the number of bioinformaticians have grown since the inception of the UCSC Genome Browser in 2000, there has been an increased need for programmatic access to the data and tools hosted at UCSC. Although there is no true API developed by UCSC (yet), there are a number of ways to interface with the UCSC Genome Browser, some more efficient than others. The intention of this blog post series is to explain some of the preferred ways to access the commonly requested Genome Browser data and tools and to add a bit of explanation of the architecture of the UCSC Genome Browser in general. The three most common requests are 1) how to download a single stretch of sequence in FASTA format, 2) how to download multiple ranges of sequence, and 3) how to get basic statistics on the nucleotides in a sequence. If you want the in-depth examples and explanation, skip down, but if you’re crunched for time, all you really need to know is the following three Q&As:
Q: How do I extract some sequence?
A: The best choice is to use the twoBitToFa command, available for your system here (Windows 10 users can use the linux.x86_64/ binaries in the Windows Subsystem for Linux). Here’s an example:
$ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit:chr1:100100-100200 stdout >chr1:100100-100200 gcctagtacagactctccctgcagatgaaattatatgggatgctaaatta taatgagaacaatgtttggtgagccaaaactacaacaagggaagctaatt
Q: What if I have a list of coordinates?
A: Again use twoBitToFa, this time with the -bed option (also check out the post on coordinate systems):
$ cat input.bed chr1 4150100 4150200 seq1 chr1 4150300 4150400 seq2 $ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.2bit -bed=input.bed stdout >seq1 gcatcccagtcctgatactggaaaattcatttagtgacaagcgagggcca cttgggattctctcacccccatatttaggagaccttattagggtcacctt >seq2 tatccccttccctccccaccagatactacaattcacatcatactctgtcc cccagtctacccataaaatctattctatttacctctccaaacgaagatct
Q: How do I count A, C, G, T?
A: twoBitToFa followed by faCount (available from the same location as twoBitToFa):
$ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit:chr1:100100-100200 stdout | faCount stdin #seq len A C G T N cpg chr1:100100-100200 100 37 17 21 25 0 0 total 100 37 17 21 25 0 0
Run twoBitToFa
or faCount
with no arguments to get a usage message and view all of their options:
$ faCount faCount - count base statistics and CpGs in FA files. ...
The most efficient way to get sequence from UCSC Genome Browser
The most common data request we receive is a request for FASTA sequence or sequences, making it a fitting subject for part 1 of this blog series about programmatic access to the Genome Browser. If you are browsing a region in the genome browser and you want to get a FASTA sequence for just the region you are browsing, using the keyboard shortcut ‘vd’ (v then d for view DNA) is probably the easiest way. But what about when you want to get sequences for a list of regions? What about if you need your web application to download the sequence? You could download sequence interactively with the Table Browser, although the solution is somewhat cumbersome: first you must make a custom track of the region(s) you would like sequence for, and then use the “output format: sequence” option with your custom track selected as the primary track. Fortunately, there is a much easier approach – downloading the 2bit file for your organism of interest and then using the twoBitToFa command on it like so:
$ wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit $ twoBitToFa hg38.2bit:chr1:100100-100200 stdout >chr1:100100-100200 gcctagtacagactctccctgcagatgaaattatatgggatgctaaatta taatgagaacaatgtttggtgagccaaaactacaacaagggaagctaatt
The twoBitToFa command is available from the list of public utilities, in the directory appropriate to your operating system. twoBitToFa even accepts a URL to our downloads server as the 2bit argument, so if you wanted to grab some mm10 sequence, or even a list of sequences, you can just query the downloads server directly like so:
$ cat input.bed chr1 4150100 4150200 seq1 chr1 4150300 4150400 seq2 $ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.2bit -bed=input.bed stdout >seq1 gcatcccagtcctgatactggaaaattcatttagtgacaagcgagggcca cttgggattctctcacccccatatttaggagaccttattagggtcacctt >seq2 tatccccttccctccccaccagatactacaattcacatcatactctgtcc cccagtctacccataaaatctattctatttacctctccaaacgaagatct
Note that “stdout” in the above commands is a special option (along with the corresponding “stdin”) that tells the majority of UCSC commands to read/write from/to /dev/stdin
and /dev/stdout
instead of the required filenames, and is exemplified by the following common usage of generating some quick statistics on a region like chr1:100100-100200:
$ twoBitToFa http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit:chr1:100100-100200 stdout | faCount stdin #seq len A C G T N cpg chr1:100100-100200 100 37 17 21 25 0 0 total 100 37 17 21 25 0 0
The twoBitToFa and URL to hgdownload 2bit combo is important because our downloads server is significantly more robust than our DAS CGI, can support more requests, and won’t slow the main site down for other users. We’ve also noticed that our DAS server often receives many requests for the same sequence, so for those of you providing software where the same query will be made multiple times, consider whether it would be more efficient to download an entire 2bit file to your local disk, rather than send the same query thousands of times to our servers.
Summary
twoBitToFa and faCount are two useful utilities, among the many other hundreds of tools available, that are useful for extracting sequence data. While not as preferable to working with locally downloaded files, twoBitToFa can also work with URLs to 2bit files, such as those on the UCSC Genome Browser download site. Stay tuned for part 2 of this programmatic access series — Using the Genome Browser public MySQL server and gbdb.
If after reading this blog post you have any public questions, please email genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.