{"id":766,"date":"2019-02-22T16:47:12","date_gmt":"2019-02-23T00:47:12","guid":{"rendered":"http:\/\/genome.ucsc.edu\/blog\/?p=766"},"modified":"2021-10-30T15:28:26","modified_gmt":"2021-10-30T15:28:26","slug":"patches","status":"publish","type":"post","link":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/2019\/02\/22\/patches\/","title":{"rendered":"Patching up the Genome"},"content":{"rendered":"<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">From biologists to computer scientists, the human genome has presented a grand puzzle. With regards to UCSC, the story began in 1985 when our chancellor, molecular biologist Robert Sinsheimer, proposed a bold endeavor \u2013 <a href=\"https:\/\/www.sciencedirect.com\/science\/article\/pii\/0888754389901420\">sequence the complete human genome<\/a>. 5 years later the International Genome Project was launched. The next chapter took place in 1999 when computer science professor David Haussler was asked to join the project. &nbsp;Haussler, in turn, enlisted then graduate student Jim Kent to help with assembling the genome. This collaboration culminated on July 7, 2000, when the first human genome assembly was made available on the UCSC servers. Over 500 GB were downloaded worldwide in 24 hours. &nbsp;(Hey, back in 2000, that was a lot!)<br \/>\n<\/span><\/p>\n<div id=\"attachment_774\" style=\"width: 488px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/genome.ucsc.edu\/blog\/wp-content\/uploads\/2019\/02\/UCSCReleaseDownloads.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-774\" class=\"wp-image-774 size-full aligncenter\" src=\"http:\/\/genome.ucsc.edu\/blog\/wp-content\/uploads\/2019\/02\/UCSCReleaseDownloads.png\" alt=\"UCSCReleaseDownloads\" width=\"478\" height=\"147\" srcset=\"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-content\/uploads\/2019\/02\/UCSCReleaseDownloads.png 478w, https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-content\/uploads\/2019\/02\/UCSCReleaseDownloads-300x92.png 300w\" sizes=\"auto, (max-width: 478px) 100vw, 478px\" \/><\/a><p id=\"caption-attachment-774\" class=\"wp-caption-text\">Total web traffic at the University of California Santa Cruz in 2000. When the genome becomes available online, all other web activity at the university shrank to the background.<\/p><\/div>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Three months later, the UCSC Genome Browser came online as a resource to distribute and visualize the genome. &nbsp;The first ten releases, hg1-hg10 were assembled at UCSC, after which the task was taken over by NCBI. As NCBI incremented the official releases and changed the naming scheme, UCSC released browsers at a slower rate, continuing to increment the hg* nomenclature. &nbsp;By the time NCBI released NCBI33 in 2003, UCSC released it as hg15. After releasing so many browsers in under three years, the pace slowed, with each assembly taking around one year longer than the previous.<\/span><\/p>\n<p><a name=\"patches\"><\/a><\/p>\n<h2 style=\"text-align: justify;\">Patches: What are they and why are they important?<\/h2>\n<div id=\"attachment_757\" style=\"width: 163px\" class=\"wp-caption alignright\"><a href=\"http:\/\/genome.ucsc.edu\/blog\/wp-content\/uploads\/2019\/02\/Blog_table.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-757\" class=\"wp-image-757 alignleft\" src=\"http:\/\/genome.ucsc.edu\/blog\/wp-content\/uploads\/2019\/02\/Blog_table.png\" alt=\"Blog_table\" width=\"153\" height=\"174\"><\/a><p id=\"caption-attachment-757\" class=\"wp-caption-text\"><span style=\"font-weight: 400;\"><strong>Note<\/strong>: hg38 follows hg19. The UCSC nomenclature was changed to match the <\/span><a href=\"https:\/\/www.ncbi.nlm.nih.gov\/grc\"><span style=\"font-weight: 400;\">Genome Reference Consortium (GRC)<\/span><\/a><span style=\"font-weight: 400;\">\u2019s GRCh release number.<\/span><\/p><\/div>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">The early genome assemblies were largely aiming to increase the fidelity of the reference. However, with each release, research progress was temporarily hampered as scientists adjusted to sequence changes and shifted coordinates. This has often led to scientists continuing to use an older release as it may be better annotated and established. This is evident in the Genome Browser as a majority of our users continue to work on GRCh37\/hg19 in spite of GRCh38\/hg38\u2019s release more than 4 years ago.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Looking at the numbers, however, we can see that GRCh38 is the most accurate human genome to date. With these benchmarks in accuracy, the GRC has shifted focus beyond fidelity to inclusion. The GRC &nbsp;now strives to capture more of the genetic diversity present in the human population. The initial release of GRCh38\/hg38 included 261 alternate haplotype sequences, nearly a 30-fold increase over GRCh37\/hg19.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">UCSC builds a new assembly database for each full release of a genome assembly, but the GRC also releases \u201cpatch\u201d updates for genome assemblies. Through patch releases, the GRC adds new alternate haplotype sequences, and also corrected sequences, without changing the sequences or coordinate system of the initial assembly release.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">To quote&nbsp;<\/span><a href=\"https:\/\/www.ncbi.nlm.nih.gov\/grc\/help\/patches\/\"><span style=\"font-weight: 400;\">directly from the GRC<\/span><\/a><span style=\"font-weight: 400;\">:<\/span><\/p>\n<blockquote>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Patches are accessioned scaffold sequences that represent assembly updates. They add information to the assembly without disrupting the chromosome coordinates. Patches are given chromosome context via alignment to the current assembly. Together, the scaffold sequence and alignment define the patch.<\/span><\/p>\n<\/blockquote>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">These patch sequences are more important now than ever before as the GRC has decided to <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/grc\/human\">indefinitely postpone<\/a> the release of the next coordinate-changing assembly (which would have been GRCh39\/hg39), instead opting for additional patches to GRCh38\/hg38. There are two kinds of patch sequences:<\/span><\/p>\n<p style=\"text-align: justify;\"><b>Novel patches (alternative haplotypes):<\/b><span style=\"font-weight: 400;\"> Chromosomal regions of the genome that exhibit sufficient variability to prevent adequate representation by a single sequence. Also referred to as <a href=\"https:\/\/www.ncbi.nlm.nih.gov\/grc\/help\/definitions\/\">alternate loci<\/a>. UCSC labels these haplotype sequences by appending &#8220;_alt&#8221; to their names.<\/span><\/p>\n<p style=\"text-align: justify;\"><b>Fix patches:<\/b><span style=\"font-weight: 400;\"> Error corrections (addressed by approaches such as base changes, component replacements\/updates, switch-point updates or tiling-path changes) or assembly improvements, such as the extension of sequence into gaps. UCSC labels these fix sequences by appending &#8220;_fix&#8221; to their names.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">These patch sequences, especially novel patches, have been increasing in number and will continue to do so.<\/span><\/p>\n<div id=\"attachment_791\" style=\"width: 615px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/genome.ucsc.edu\/blog\/wp-content\/uploads\/2019\/02\/patches.jpg\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-791\" class=\"wp-image-791 size-full\" src=\"http:\/\/genome.ucsc.edu\/blog\/wp-content\/uploads\/2019\/02\/patches.jpg\" alt=\"patches\" width=\"605\" height=\"340\" srcset=\"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-content\/uploads\/2019\/02\/patches.jpg 605w, https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-content\/uploads\/2019\/02\/patches-300x169.jpg 300w\" sizes=\"auto, (max-width: 605px) 100vw, 605px\" \/><\/a><p id=\"caption-attachment-791\" class=\"wp-caption-text\">The number of human assembly patch sequences is quickly growing. This is primarily due to alternative haplotypes (_alt) sequences, though fix sequences (_fix) are also being introduced. The fix patches reset from GRCh37.p13 to GRCh38 as they were integrated into the assembly.<\/p><\/div>\n<p><a name=\"approach\"><\/a><\/p>\n<h2 style=\"text-align: justify;\">A better approach to patches<\/h2>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">Our approach thus far in the Genome Browser has been to make data tracks indicating the locations of these patch releases along the initial assembly chromosomes. While these are useful, they provide little in the way of annotations and are largely underutilized by users. With the increase of these patches and postponement of GRCh39, however, we have decided to switch our approach and add the new sequences, and annotations on the new sequences, to the UCSC hg38 database. This will allow patches to be visualized on the Browser as standalone reference sequences, not unlike a regular chromosome or the alternate haplotype sequences that were included in the initial assembly release. BLAT results may also include alignments to these sequences.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">The addition of new genomic sequences to an existing UCSC database is a departure from our longstanding practice of building a new database every time we import a new genome assembly release. &nbsp;To minimize disruption to pipelines that use our download files, especially those in the <\/span><a href=\"http:\/\/hgdownload.soe.ucsc.edu\/goldenPath\/hg38\/bigZips\/\"><span style=\"font-weight: 400;\">bigZips<\/span><\/a><span style=\"font-weight: 400;\"> directory, we will leave the original bigZips\/hg38.* files unchanged, and add a subdirectory when we incorporate sequences from a patch release; for example, bigZips\/p12\/ for patch release GRCh38.p12. &nbsp;We will also add bigZips\/latest\/ which will link to the most recent patch release subdirectory, so that pipelines may stay up to date with UCSC\u2019s patch sequence annotations if desired. In other words, the bigZips downloads will be \u201copt-in\u201d for patch sequences.<\/span><\/p>\n<p><a name=\"changes\"><\/a><\/p>\n<h2 style=\"text-align: justify;\">Changes and improvements to hg38<\/h2>\n<p><span style=\"font-weight: 400;\">Currently, we are in the process of adding these sequences to the GRCh38\/hg38 genome database with the potential to do the same for GRCh37\/hg19 and GRCm38\/mm10 at a future date. Changes that users may see are as follows:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">BLAT\/In-Silico PCR &#8211; Additional hits on _alt and _fix sequences<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Position searches in the hg38 browser may lead to _alt and _fix sequences in addition to or instead of initial assembly chromosomes<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Replacing the \u2018GRC Patch Release\u2019 and \u2018Alt Map\u2019 tracks with \u2018Fix Patches\u2019 and \u2018Alt Haplotypes\u2019 tracks which include alignments to alts\/fixes with details pages and links to jump between main chromosomes and alts\/fixes<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">New subdirectories of <\/span><a href=\"http:\/\/hgdownload.soe.ucsc.edu\/goldenPath\/hg38\/bigZips\/\"><span style=\"font-weight: 400;\">bigZips<\/span><\/a><span style=\"font-weight: 400;\"> download directory (initial, p12, latest)<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">New sequences\/annotations in <\/span><a href=\"http:\/\/hgdownload.soe.ucsc.edu\/gbdb\/hg38\/\"><span style=\"font-weight: 400;\">\/gbdb\/hg38<\/span><\/a><span style=\"font-weight: 400;\"> download files (same file names, extended contents)<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">SQL queries to genome-mysql.soe.ucsc.edu may include new results on _alt and _fix sequences<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">It is also worth noting what will not change. Existing sequences, and annotations on existing sequences, will not change. Download files in the <\/span><a href=\"http:\/\/hgdownload.soe.ucsc.edu\/goldenPath\/hg38\/bigZips\/\"><span style=\"font-weight: 400;\">bigZips<\/span><\/a><span style=\"font-weight: 400;\"> directory, such as bigZips\/hg38.2bit and bigZips\/hg38.fa.masked.gz, will not change. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">So what kind of annotations can be found on these hg38 patch sequences?<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Annotations generated by UCSC such as RepeatMasker, CpG Islands, AUGUSTUS, Human mRNAs and Pfam<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">NCBI\u2019s sequence alignments of patch sequences to chromosomes: Fix Patches, Alt Haplotypes<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">External annotation sources such as RefSeq and GENCODE that include annotations on patch sequences (up to this point we have ignored those patch annotations)<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Select tracks have been lifted from main chromosomes onto the patches using NCBI\u2019s alignments, most notably GTEx Gene and ENCODE Regulation<\/span><\/li>\n<\/ul>\n<p style=\"text-align: justify;\"><span style=\"font-weight: 400;\">For additional information on these patch sequences, and a full list of sequences in hg38, you may visit the <\/span><a href=\"http:\/\/genome.ucsc.edu\/cgi-bin\/hgGateway?db=hg38\"><span style=\"font-weight: 400;\">hg38 Genome Browser Gateway page<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p style=\"text-align: left;\"><span style=\"font-weight: 400;\">We are always receptive to our users and their needs. If there are any specific track annotations you would like to see on these patches or if you have any questions regarding this implementation and how it may affect you, please write into our public mailing list (<\/span><a href=\"mailto:genome@soe.ucsc.edu\"><span style=\"font-weight: 400;\">genome@soe.ucsc.edu<\/span><\/a><span style=\"font-weight: 400;\">) or our private mailing list if your message includes sensitive data (<\/span><a href=\"mailto:genome-www@soe.ucsc.edu\"><span style=\"font-weight: 400;\">genome-www@soe.ucsc.edu<\/span><\/a><span style=\"font-weight: 400;\">).<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>From biologists to computer scientists, the human genome has presented a grand puzzle. With regards to UCSC, the story began in 1985 when our chancellor, molecular biologist Robert Sinsheimer, proposed a bold endeavor \u2013 sequence the complete human genome. 5 years later the International Genome Project was launched. The next chapter took place in 1999 [&hellip;]<\/p>\n","protected":false},"author":18,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-766","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/posts\/766","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/comments?post=766"}],"version-history":[{"count":25,"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/posts\/766\/revisions"}],"predecessor-version":[{"id":1042,"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/posts\/766\/revisions\/1042"}],"wp:attachment":[{"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/media?parent=766"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/categories?post=766"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/genome-blog.gi.ucsc.edu\/blog\/wp-json\/wp\/v2\/tags?post=766"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}