{"id":601,"date":"2018-06-27T14:14:59","date_gmt":"2018-06-27T21:14:59","guid":{"rendered":"http:\/\/genome.ucsc.edu\/blog\/?p=601"},"modified":"2021-10-30T22:14:28","modified_gmt":"2021-10-30T22:14:28","slug":"accessing-the-genome-browser-programmatically-part-1-how-to-get-sequence-from-the-ucsc-genome-browser","status":"publish","type":"post","link":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/2018\/06\/27\/accessing-the-genome-browser-programmatically-part-1-how-to-get-sequence-from-the-ucsc-genome-browser\/","title":{"rendered":"Accessing the Genome Browser Programmatically Part 1 &#8211; How to get sequence from the UCSC Genome Browser"},"content":{"rendered":"<p><strong>Note:&nbsp;<\/strong>We now have an <a href=\"http:\/\/genome.ucsc.edu\/goldenPath\/help\/api.html\">API<\/a> which can also perform many of these functions.<\/p>\n<p>As the number of bioinformaticians have&nbsp;grown since the inception of the UCSC Genome Browser in 2000, there has been an increased need for programmatic access to the data and tools hosted at UCSC. Although there is no true API developed by UCSC (yet), there are a number of ways to interface with the UCSC Genome Browser, some more efficient than others. The intention of this blog post series is to explain some of the preferred ways to access the commonly requested Genome Browser data and tools and to add&nbsp;a bit of explanation of the architecture of the UCSC Genome Browser in general. The three most common requests are 1) how to download a single stretch of&nbsp;sequence in FASTA format, 2) how to download multiple ranges of sequence, and 3) how to get basic statistics on the nucleotides in a sequence. If you want the in-depth examples and explanation, <a href=\"#explanation\">skip down<\/a>, but if you&#8217;re crunched for time, all you really need to know is the following three Q&amp;As:<\/p>\n<p>Q: How do I extract some sequence?<br \/>\nA: The best choice is to use the <strong>twoBitToFa<\/strong> command, available for your system <a href=\"http:\/\/hgdownload.soe.ucsc.edu\/admin\/exe\">here<\/a> (Windows 10 users can use the linux.x86_64\/ binaries in the <a href=\"https:\/\/docs.microsoft.com\/en-us\/windows\/wsl\/install-win10\">Windows Subsystem for Linux<\/a>). Here&#8217;s an example:<\/p>\n<pre>$ twoBitToFa http:\/\/hgdownload.soe.ucsc.edu\/goldenPath\/hg38\/bigZips\/hg38.2bit:chr1:100100-100200 stdout\n&gt;chr1:100100-100200\ngcctagtacagactctccctgcagatgaaattatatgggatgctaaatta\ntaatgagaacaatgtttggtgagccaaaactacaacaagggaagctaatt\n<\/pre>\n<p>Q: What if I have a list of coordinates?<br \/>\nA: Again use <strong>twoBitToFa<\/strong>, this time with the <strong>-bed<\/strong> option (also check out the post on <a href=\".\/the-ucsc-genome-browser-coordinate-counting-systems\">coordinate systems<\/a>):<\/p>\n<pre>$ cat input.bed\nchr1 4150100 4150200 seq1\nchr1 4150300 4150400 seq2\n$ twoBitToFa http:\/\/hgdownload.soe.ucsc.edu\/goldenPath\/mm10\/bigZips\/mm10.2bit -bed=input.bed stdout\n&gt;seq1\ngcatcccagtcctgatactggaaaattcatttagtgacaagcgagggcca\ncttgggattctctcacccccatatttaggagaccttattagggtcacctt\n&gt;seq2\ntatccccttccctccccaccagatactacaattcacatcatactctgtcc\ncccagtctacccataaaatctattctatttacctctccaaacgaagatct\n<\/pre>\n<p>Q: How do I count A, C, G, T?<br \/>\nA: <strong>twoBitToFa<\/strong> followed by <strong>faCount<\/strong> (available from the same location as twoBitToFa):<\/p>\n<pre>$ twoBitToFa http:\/\/hgdownload.soe.ucsc.edu\/goldenPath\/hg38\/bigZips\/hg38.2bit:chr1:100100-100200 stdout | faCount stdin\n#seq    len     A       C       G       T       N       cpg\nchr1:100100-100200      100     37      17      21      25      0       0\ntotal   100     37      17      21      25      0       0\n<\/pre>\n<p>Run <code>twoBitToFa<\/code> or <code>faCount<\/code> with no arguments to get a usage message and view all of their options:<\/p>\n<pre>$ faCount\nfaCount - count base statistics and CpGs in FA files.\n...\n<\/pre>\n<p><a name=\"explanation\"><\/a><br \/>\n<strong>The most efficient way to get sequence from UCSC Genome Browser<\/strong><\/p>\n<p>The most common data request we receive is a request for FASTA sequence or sequences, making it a fitting subject for part 1 of this blog series about programmatic access to the Genome Browser. If you are browsing a region in the genome browser and you want to get a FASTA sequence for just the region you are browsing, using the keyboard shortcut &#8216;vd&#8217; (v then d for view DNA) is probably the easiest way. But what about when you want to get sequences for a list of regions? What about if you need your web application to download the sequence? You <em>could<\/em> download sequence interactively with the Table Browser, although the solution is somewhat cumbersome: first you must make a custom track of the region(s) you would like sequence for, and then use the &#8220;output format: sequence&#8221; option with your custom track selected as the primary track. Fortunately, there is a much easier approach &#8211; downloading the 2bit file for your organism of interest and then using the <a href=\"http:\/\/hgdownload.soe.ucsc.edu\/admin\/exe\">twoBitToFa<\/a> command on it like so:<\/p>\n<pre>$ wget http:\/\/hgdownload.soe.ucsc.edu\/goldenPath\/hg38\/bigZips\/hg38.2bit\n$ twoBitToFa hg38.2bit:chr1:100100-100200 stdout\n&gt;chr1:100100-100200\ngcctagtacagactctccctgcagatgaaattatatgggatgctaaatta\ntaatgagaacaatgtttggtgagccaaaactacaacaagggaagctaatt\n<\/pre>\n<p>The twoBitToFa command is available from the list of public <a href=\"http:\/\/hgdownload.soe.ucsc.edu\/admin\/exe\">utilities<\/a>, in the directory appropriate to your operating system. twoBitToFa even accepts a URL to our downloads server as the 2bit argument, so if you wanted to grab some mm10 sequence, or even a list of sequences, you can just query the downloads server directly like so:<\/p>\n<pre>$ cat input.bed\nchr1 4150100 4150200 seq1\nchr1 4150300 4150400 seq2\n$ twoBitToFa http:\/\/hgdownload.soe.ucsc.edu\/goldenPath\/mm10\/bigZips\/mm10.2bit -bed=input.bed stdout\n&gt;seq1\ngcatcccagtcctgatactggaaaattcatttagtgacaagcgagggcca\ncttgggattctctcacccccatatttaggagaccttattagggtcacctt\n&gt;seq2\ntatccccttccctccccaccagatactacaattcacatcatactctgtcc\ncccagtctacccataaaatctattctatttacctctccaaacgaagatct\n<\/pre>\n<p>Note that \u201cstdout\u201d in the above commands is a special option (along with the corresponding \u201cstdin\u201d) that tells the majority of UCSC commands to read\/write from\/to <code>\/dev\/stdin<\/code> and <code>\/dev\/stdout<\/code> instead of the required filenames, and is exemplified by the following common usage of generating some quick statistics on a region like chr1:100100-100200:<\/p>\n<pre>$ twoBitToFa http:\/\/hgdownload.soe.ucsc.edu\/goldenPath\/hg38\/bigZips\/hg38.2bit:chr1:100100-100200 stdout | faCount stdin\n#seq    len     A       C       G       T       N       cpg\nchr1:100100-100200      100     37      17      21      25      0       0\ntotal   100     37      17      21      25      0       0\n<\/pre>\n<p>The twoBitToFa and URL to hgdownload 2bit combo is important because our downloads server is significantly more robust than our DAS CGI, can support more requests, and won\u2019t slow the main site down for other users. We\u2019ve also noticed that our DAS server often receives many requests for the same sequence, so for those of you providing software where the same query will be made multiple times, consider whether it would be more efficient to download an entire 2bit file to your local disk, rather than send the same query thousands of times to our servers.<\/p>\n<p><strong>Summary<\/strong><br \/>\ntwoBitToFa and <strong>faCount<\/strong> are two useful utilities, among the many other hundreds of tools available, that are useful for extracting sequence data. While not as preferable to working with locally downloaded files, twoBitToFa can also work with URLs to 2bit files, such as those on the UCSC Genome Browser <a href=\"http:\/\/hgdownload.soe.ucsc.edu\/\">download site<\/a>. Stay tuned for part 2 of this programmatic access series &#8212; Using the Genome Browser public MySQL server and gbdb.<\/p>\n<hr>\n<p>If after reading this blog post you have any public questions, please email <a href=\"mailto:genome@soe.ucsc.edu\" target=\"_blank\" rel=\"noopener\">genome@soe.ucsc.edu<\/a>. All messages sent to that address are archived on a <a href=\"https:\/\/groups.google.com\/a\/soe.ucsc.edu\/forum\/#!forum\/genome\">publicly accessible forum<\/a>. If your question includes sensitive data, you may send it instead to&nbsp;<a href=\"mailto:genome-www@soe.ucsc.edu\" target=\"_blank\" rel=\"noopener\">genome-www@soe.ucsc.edu<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Note:&nbsp;We now have an API which can also perform many of these functions. As the number of bioinformaticians have&nbsp;grown since the inception of the UCSC Genome Browser in 2000, there has been an increased need for programmatic access to the data and tools hosted at UCSC. Although there is no true API developed by UCSC [&hellip;]<\/p>\n","protected":false},"author":15,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[6,15,16,5],"class_list":["post-601","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-browser","tag-downloads","tag-fasta","tag-genome"],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/posts\/601","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/users\/15"}],"replies":[{"embeddable":true,"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/comments?post=601"}],"version-history":[{"count":30,"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/posts\/601\/revisions"}],"predecessor-version":[{"id":941,"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/posts\/601\/revisions\/941"}],"wp:attachment":[{"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/media?parent=601"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/categories?post=601"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/tags?post=601"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}