{"id":632,"date":"2018-07-20T09:32:10","date_gmt":"2018-07-20T16:32:10","guid":{"rendered":"http:\/\/genome.ucsc.edu\/blog\/?p=632"},"modified":"2022-05-19T20:38:02","modified_gmt":"2022-05-19T20:38:02","slug":"accessing-the-genome-browser-programmatically-part-2-using-the-public-mysql-server-and-gbdb-system","status":"publish","type":"post","link":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/2018\/07\/20\/accessing-the-genome-browser-programmatically-part-2-using-the-public-mysql-server-and-gbdb-system\/","title":{"rendered":"Accessing the Genome Browser Programmatically Part 2 &#8211; Using the Public MySQL Server and gbdb System"},"content":{"rendered":"<p>If you missed part 1 about obtaining sequence data, you can catch up <a href=\"http:\/\/genome.ucsc.edu\/blog\/accessing-the-genome-browser-programmatically-part-1-how-to-get-sequence-from-the-ucsc-genome-browser\/\">here<\/a>. <strong>Note:\u00a0<\/strong>We now have an <a href=\"http:\/\/genome.ucsc.edu\/goldenPath\/help\/api.html\">API<\/a> which can also perform many of these functions.<\/p>\n<p>The UCSC Genome Browser is a large repository of data from multiple sources, and if you want to query that annotation data, the easiest way to get started is via the <a href=\"..\/cgi-bin\/hgTables\">Table Browser<\/a>. Choose the assembly and track of interest and click the &#8220;describe table schema&#8221; button, which will show the MySQL database name, the primary table name, the fields of the table and their descriptions. If the track is stored not in MySQL but as a binary file (like bigBed or bigWig) in <code>\/gbdb<\/code>, it will show a file name, e.g. <code>\"Big Bed File: \/gbdb\/dm6\/ncbiRefSeq\/ncbiRefSeqOther.bb\"<\/code>. If this is the case, skip directly to the <a href=\"#gbdb\">Accessing the gbdb directory system<\/a> section below. Otherwise, the track data is either a single MySQL table or a set of related tables, which you can either download as gzipped text files from the &#8220;Annotation Database&#8221; section on our downloads page (<a href=\"http:\/\/hgdownload.soe.ucsc.edu\/goldenPath\/hg19\/database\/\">here&#8217;s the GRCh37\/hg19 listing<\/a>) and work on them locally, or use the public MySQL server and issue MySQL queries remotely. Generally speaking, the format for most of our tables is similar to the formats described <a href=\"..\/FAQ\/FAQformat.html\">here<\/a>, e.g., in bed (&#8220;chrom chromStart chromEnd&#8221;) format, and we do not store any sequence or contigs in our databases, which means you&#8217;ll need to use the instructions in <a href=\"http:\/\/genome.ucsc.edu\/blog\/accessing-the-genome-browser-programmatically-part-1-how-to-get-sequence-from-the-ucsc-genome-browser\/\">Part 1<\/a> of this blog series in order to get any raw sequence data.<\/p>\n<p><b>Accessing the public MySQL server<\/b><br \/>\nThe best way to showcase the public MySQL server is to show some examples &#8212; here are a few to get you started:<br \/>\n1. If you want to download some transcripts from the new <a href=\"http:\/\/genome.ucsc.edu\/blog\/the-new-ncbi-refseq-tracks-and-you\/\">NCBI RefSeq Genes track<\/a>, you can use the following command:<\/p>\n<pre>$ mysql -h genome-mysql.soe.ucsc.edu -ugenome -A -e \"select * from ncbiRefSeq limit 2\" hg38\n+-----+-------------+-------+--------+---------+-------+----------+--------+-----------+--------------------------------------------------------------------+--------------------------------------------------------------------+-------+---------+--------------+------------+-----------------------------------+\n| bin | name        | chrom | strand | txStart | txEnd | cdsStart | cdsEnd | exonCount | exonStarts                                                         | exonEnds                                                           | score | name2   | cdsStartStat | cdsEndStat | exonFrames                        |\n+-----+-------------+-------+--------+---------+-------+----------+--------+-----------+--------------------------------------------------------------------+--------------------------------------------------------------------+-------+---------+--------------+------------+-----------------------------------+\n| 585 | NR_046018.2 | chr1  | +      |   11873 | 14409 |    14409 |  14409 |         3 | 11873,12612,13220,                                                 | 12227,12721,14409,                                                 |     0 | DDX11L1 | none         | none       | -1,-1,-1,                         |\n| 585 | NR_024540.1 | chr1  | -      |   14361 | 29370 |    29370 |  29370 |        11 | 14361,14969,15795,16606,16857,17232,17605,17914,18267,24737,29320, | 14829,15038,15947,16765,17055,17368,17742,18061,18366,24891,29370, |     0 | WASH7P  | none         | none       | -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, |\n+-----+-------------+-------+--------+---------+-------+----------+--------+-----------+--------------------------------------------------------------------+--------------------------------------------------------------------+-------+---------+--------------+------------+-----------------------------------+\n<\/pre>\n<p>2. If you are interested in a particular enhancer region, for instance \u201cchr1:166,167,154-166,167,602\u201d, and want to find the nearest genes within a 10kb range, then the following query will do the job:<\/p>\n<pre>$ chrom=\"chr1\"\n$ chromStart=\"166167154\"\n$ chromEnd=\"166167602\"\n$ mysql -h genome-mysql.soe.ucsc.edu -ugenome -A -e \"select \\\n   e.chrom, e.txStart, e.txEnd, e.strand, e.name, j.name as geneSymbol from ncbiRefSeqCurated e,\\\n   ncbiRefSeqLink j where e.name = j.id AND e.chrom='${chrom}' AND \\\n      ((e.txStart &gt;= ${chromStart} - 10000 AND e.txStart &lt;= ${chromEnd} + 10000) OR \\ (e.txEnd &gt;= ${chromStart} - 10000 AND e.txEnd &lt;= ${chromEnd} + 10000)) \\\norder by e.txEnd desc \" hg38\n+-------+-----------+-----------+--------+----------------+------------+\n| chrom | txStart   | txEnd     | strand | name           | geneSymbol |\n+-------+-----------+-----------+--------+----------------+------------+\n| chr1  | 166055917 | 166166755 | -      | NR_135199.1    | FAM78B     |\n| chr1  | 166055917 | 166166755 | -      | NM_001320302.1 | FAM78B     |\n| chr1  | 166069298 | 166166755 | -      | NM_001017961.4 | FAM78B     |\n+-------+-----------+-----------+--------+----------------+------------+\n<\/pre>\n<p>3. If you need to get gene names and their lengths for RNA-seq read normalization, you can use the following query:<\/p>\n<pre>$ mysql -h genome-mysql.soe.ucsc.edu -u genome -A -e \u201c \\\n  select l.name, kr.value, psl.qEnd - psl.qStart as length \\\n  from   refGene r, hgFixed.refLink l, knownToRefSeq kr, knownCanonical kc, refSeqAli psl \\\n  where  r.name = l.mrnaAcc and r.name = kr.value and kr.name = kc.transcript \\\n         and r.name = psl.qName group by kr.value limit 3\u201d hg38\n+-------+-----------+--------+\n| name  | value     | length |\n+-------+-----------+--------+\n| A2M   | NM_000014 |   4920 |\n| NAT2  | NM_000015 |   1317 |\n| ACADM | NM_000016 |   2622 |\n+-------+-----------+--------+\n<\/pre>\n<p>In addition to our download site and public MySQL server hosted here in California, we have also recently added support for a download site (<a href=\"http:\/\/hgdownload-euro.soe.ucsc.edu\">http:\/\/hgdownload-euro.soe.ucsc.edu<\/a>) and public MySQL server (genome-euro-mysql.soe.ucsc.edu) hosted in Europe, which will speed up downloads for many of our users.<\/p>\n<p>Please follow the <a href=\"http:\/\/genome.ucsc.edu\/conditions.html\">Conditions for Use<\/a> when querying the public MySQL servers.<\/p>\n<p>Many of the command line utilities available on our <a href=\"http:\/\/hgdownload.soe.ucsc.edu\/admin\/exe\">utilities downloads server<\/a> are also able to interact with our databases or download files, like <code>mafFetch<\/code> (as long as your <code>~\/.hg.conf<\/code> file is present as discussed below):<\/p>\n<pre>$ mafFetch xenTro9 multiz11way region.bed stdout\n##maf version=1\n##maf version=1 scoring=blastz\na score=0.000000\ns xenTro9.chr9     15946024 497 +  80437102 ACTAT...\ne galGal5.chr14     1678315   0 -  15595052 I\ne xenLae2.chr9_10L 13130032 2034 - 117834370 I\n\na score=2992.000000\ns xenTro9.chr9     15946521 145 +  80437102 TCATC...\ns xenLae2.chr9_10L 13132066 148 - 117834370 TTATC...\n<\/pre>\n<p>Note: Only the first 5 bases on each line and only the first 10 lines are shown for brevity.<\/p>\n<p>Here we are directly querying the mutliz11way table for the <em>Xenopus tropicalis<\/em> xenTro9 assembly, no need to download the entire alignment file to the local disk and query manually. Commands of this nature usually require a special private <code>.hg.conf<\/code> file in the user\u2019s home directory (note the leading dot). This configuration file contains a couple key=value lines that most of our programs can parse and then use to access the public MySQL server. <a href=\"http:\/\/genome.ucsc.edu\/goldenPath\/help\/mysql.html\">This page<\/a> contains a sample <code>.hg.conf<\/code> file that can be used by most of the command line utilities to direct them to access either our US MySQL server or our European MySQL server. That sample <code>.hg.conf<\/code> is certainly enough to get started, but for more information about the various Genome Browser configuration options, please see the comments in the <a href=\"https:\/\/github.com\/ucscGenomeBrowser\/kent\/blob\/master\/src\/product\/ex.hg.conf\">ex.hg.conf<\/a> and <a href=\"https:\/\/github.com\/ucscGenomeBrowser\/kent\/blob\/master\/src\/product\/minimal.hg.conf\">minimal.hg.conf<\/a> files.<\/p>\n<p><a name=\"gbdb\"><\/a><br \/>\n<b>Accessing the gbdb directory system<\/b><br \/>\nThe third method of grabbing our data is via the <code>\/gbdb\/<\/code> directory system. This location, browsable <a href=\"http:\/\/hgdownload.soe.ucsc.edu\/gbdb\/\">here<\/a>, holds most of the bigBed, bigWig, and other large data files that we do not keep directly in MySQL databases\/tables. <a href=\"http:\/\/hgdownload.soe.ucsc.edu\/admin\/exe\">There are many utilities<\/a> available for manipulating these files, and most of them are able to work on remote files, for example:<\/p>\n<pre>$ bigBedToBed -chrom=chr1 -start=5563837 -end=5564370 http:\/\/hgdownload.soe.ucsc.edu\/gbdb\/hg38\/crispr\/crispr.bb stdout \nchr1    5563870    5563893        55    +    5563870    5563890    0,200,0    255,255,0    128,128,0    CAAGTGGAATCAGGATGCCT    GGG    55    72% (57)    52% (46)    10    60    MIT Spec. Score: 55, Doench 2016: 72%, Moreno-Mateos: 52%    3345002138\nchr1    5563878    5563901        59    +    5563878    5563898    0,200,0    0,200,0    128,128,0    ATCAGGATGCCTGGGATATG    TGG    59    63% (54)    61% (50)    6    63    MIT Spec. Score: 59, Doench 2016: 63%, Moreno-Mateos: 61%    22777603204\n<\/pre>\n<p>Also note that we have all of this data available via <code>rsync<\/code> as well, so the following command will work to download the <code>crispr.bb<\/code> file referenced above:<\/p>\n<pre>$ rsync -vh hgdownload.soe.ucsc.edu::gbdb\/hg38\/crispr\/crispr.bb\n-rw-rw-r--  1466266135 2017\/03\/30 14:31:48 crispr.bb\n\nsent 33 bytes  received 70 bytes  206.00 bytes\/sec\ntotal size is 1.47G  speedup is 14235593.54\n<\/pre>\n<p>If you are interested in say, Human GRCh37\/hg19 gbdb data, then all you have to do is change the &#8220;hg38&#8221; at the end of the template <code>http:\/\/hgdwonload.soe.ucsc.edu\/gbdb\/hg38<\/code> url to &#8220;hg19&#8221;, resulting in <code>http:\/\/hgdwonload.soe.ucsc.edu\/gbdb\/hg19<\/code>. This holds for all databases at UCSC, like mm10 or bosTau8.<\/p>\n<p><b>Summary<\/b><br \/>\nJust as in <a href=\"http:\/\/genome.ucsc.edu\/blog\/accessing-the-genome-browser-programmatically-part-1-how-to-get-sequence-from-the-ucsc-genome-browser\/\">part 1<\/a>, if you are going to continually request parts of the same files or table over and over again, it is best to download the file from our downloads server and operate on it locally. All of our track data, including MySQL tables and bigBed\/Wig\/BAM files are hosted on our downloads server at http:\/\/hgdownload.soe.ucsc.edu. Generally speaking bigBeds\/bigWigs\/BAMs and other binary files are located in the hgdownload.soe.ucsc.edu\/gbdb\/ location discussed earlier, while MySQL table data in gzipped plain text format can be found at http:\/\/hgdownload.soe.ucsc.edu\/goldenPath\/$db (where $db is a database name like hg19 or hg38) or via queries against the public MySQL server directly.<\/p>\n<p>Stay tuned for part 3 of this programmatic access series &#8212; controlling the Genome Browser image!<\/p>\n<hr \/>\n<p>If after reading this blog post you have any public questions, please email <a href=\"mailto:genome@soe.ucsc.edu\" target=\"_blank\" rel=\"noopener\">genome@soe.ucsc.edu<\/a>. All messages sent to that address are archived on a <a href=\"https:\/\/groups.google.com\/a\/soe.ucsc.edu\/forum\/#!forum\/genome\">publicly accessible forum<\/a>. If your question includes sensitive data, you may send it instead to\u00a0<a href=\"mailto:genome-www@soe.ucsc.edu\" target=\"_blank\" rel=\"noopener\">genome-www@soe.ucsc.edu<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>If you missed part 1 about obtaining sequence data, you can catch up here. Note:\u00a0We now have an API which can also perform many of these functions. The UCSC Genome Browser is a large repository of data from multiple sources, and if you want to query that annotation data, the easiest way to get started [&hellip;]<\/p>\n","protected":false},"author":15,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[6,15,18,17],"class_list":["post-632","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-browser","tag-downloads","tag-gbdb","tag-mysql"],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/posts\/632","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/users\/15"}],"replies":[{"embeddable":true,"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/comments?post=632"}],"version-history":[{"count":47,"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/posts\/632\/revisions"}],"predecessor-version":[{"id":1028,"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/posts\/632\/revisions\/1028"}],"wp:attachment":[{"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/media?parent=632"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/categories?post=632"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/tags?post=632"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}