If you think dogs can’t count, try putting three dog biscuits in your pocket and then giving Fido only two of them.
~Phil Pastoret
“Counting is easy. Right?”
I say this with my hand out, my thumb and 4 fingers spread out. With my other hand’s pointer finger, I simply count each digit, “one, two, three, four, five.” Easy.
But what happens when you start counting at 0 instead of 1? You can see that you have 5 digits (4 fingers and a thumb), but how do you calculate the size of your range?
With your hand in mind as an example, let’s look at counting conventions as they relate to bioinformatics and the UCSC Genome Browser genomic coordinate systems.
The UCSC Genome Browser uses two different systems:
Table 1. UCSC Genome Browser coordinate systems summary
0-start, half-open (0-based) | 1-start, fully-closed (1-based) |
“BED” format (Browser Extensible Data): chr1 127140000 127140001 Note: Spaces, not punctuation When using BED format, browser & utilities assume coords are 0-start, half-open. |
“Position” format: chr1:127140001-127140001 Note: Punctuation used, no spaces When using “position” format, browser & utilities assume coords are 1-start, fully-closed. |
Stored in UCSC Genome Browser tables | Positioned in UCSC Genome Browser web interface |
To convert to 1-start, fully-closed: add 1 to start, end = same |
To convert to 0-start, half-open: subtract 1 from start, end = same |
Section 1: Interval types
0-start vs. 1-start : Does counting start at 0 or 1?
Synonyms:
Sometimes referred to as “0-based” vs “1-based” or “0-relative vs “1-relative.”
Interval Types
For a counted range, is the specified interval fully-open, fully-closed, or a hybrid-interval (e.g., half-open)?
Ok, time to flashback to math class!
You might recall that specifying an interval type as open, closed (or a combination, e.g., “half-open”) refers to whether or not the endpoints of the interval are included in the set. For further explanation, see theinterval math terminology wiki article. Figure 1 below describes various interval types.
Figure 1. (To enlarge, click image.) Description of interval types.
Section 2: Interval types in the UCSC Genome Browser
UCSC Genome Browser web interface = “1-start, fully-closed”
A common counting convention is a system that we all used when we first learned to count the fingers on our hands; this is referred to as the “one-based, fully-closed” system (Figure 2, below). Note that an extra step is needed to calculate the range total (5).
The “1-start, fully-closed” system is what you SEE when using the UCSC Genome Browser web interface. However, all positional data that are stored in database tables use a different system.
Figure 2. (To enlarge, click image.) 1-start, fully-closed interval. Most common counting convention. Used within the UCSC Genome Browser web interface (but not used in UCSC Genome Browser databases/tables). We calculate that we have 5 digits because 5 (pinky finger, range end) – 1 (the thumb, range start) = 4. We then need to add one to calculate the correct range; 4+1= 5.
UCSC Genome Browser tables = “0-start, half-open”
While the commonly-used “one-start, fully-closed” system is more intuitive, it is not always the most efficient method for performing calculations in bioinformatic systems, because an additional step is required to calculate the size of the base-pair (bp) range.
To increase efficiency, the UCSC Genome Browser uses a “hybrid-interval” coordinate system for storing coordinates in databases/tables that is referred to as “0-start, half-open” (see Figure 3, below).
Although coordinates in the web browser are converted to the more human-readable “1-start, fully-closed” system, coordinates are stored in database tables as “0-start, half-open.” You may have heard various terms to express this 0-start system:
Synonyms for “0-start, half-open”
- 0-based, half-open
- 0-based start, 1-based end
- Note: This is not technically accurate, but conceptually helpful. A “1-based end” refers to the end of the range being included, as in the common “1-based, fully-closed” system.
- 0-start, hybrid-interval (interval type is: start-included, end-excluded)
Figure 3. (To enlarge, click image.) The UCSC Genome Browser coordinate system for databases/tables (not the web interface) is “0-start, half-open” where start is included (closed-interval), and stop is excluded (open-interval). We calculate that we have 5 digits because 5 (range end after pinky finger) – 0 (the thumb, range start) = 5.
Another example which compares 0-start and 1-start systems is seen below, in Figure 4. This figure describes the differences in defining and calculating the range for a specified sequence highlighted in yellow, “T, C, G, A.”
Figure 4. (To enlarge, click image.) Calculation of genomic range for comparing “1-start, fully-closed” vs. “0-start, half-open” counting systems.
Section 3: Formatting
Coordinate formatting indicates interval type
The UCSC Genome Browser and many of its related command-line utilities distinguish two types of formatted coordinates and make assumptions of each type.
The “Position” format (referring to the “1-start, fully-closed” system as coordinates are “positioned” in the browser)
- Written as: chr1:127140001-127140001
- No spaces.
- Includes punctuation: a colon after the chromosome, and a dash between the start and end coordinates.
- When in this format, the assumption is that the coordinate is 1-start, fully-closed.
The “BED” format (referring to the “0-start, half-open” system)
- Written as: chr1 127140000 127140001
- Spaces between chromosome, start coordinate, and end coordinate.
- No punctuation.
- When in this format, the assumption is that the coordinates are 0-start, half-open.
Section 4: Examples
SNP example
What we SEE in the Genome Browser interface itself is the “1-start, fully-closed” system. However, these data are not STORED in the UCSC Genome Browser databases and tables in the same way. The UCSC Genome Browser databases store coordinates in the “0-start, half-open” coordinate system.
Table 2. SNP coordinates in web browser (1-start) vs table (0-start)
rs782519173 (hg38) | Start | End |
Positioned in web browser: 1-start, fully-closed | 133255708 | 133255708 |
Stored in table: 0-start, half-open | 133255707 | 133255708 |
LiftOver examples and coordinate formatting
Let’s take a look at the two types of coordinate formatting (“BED” and “position”) when using the UCSC Genome Browser web-based and command-line utility liftOver tools.
1) Web-based LiftOver example
Below is an example from the UCSC Genome Browser’s web-based LiftOver tool (Home > Tools > LiftOver). Depending on how input coordinates are formatted, web-based LiftOver will assume the associated coordinate system and output the results in the same format.
Table 3. UCSC Genome Browser web-based LiftOver and “position” coordinate formatting
Input: | Assembly = panTro3 chr1:127140001–127140001 |
Output: | Lifts to this position in hg19: chr1:110255313–110255313 |
Notes: | If your input is entered with the “position” formatted coords (1-start, fully-closed), the browser will also output the same “position” format. (Note positional format includes “:” and “-” and no spaces.) |
Table 4. UCSC Genome Browser web-based LiftOver and “BED” coordinate formatting
Input: | Assembly = panTro3 chr1 127140000 127140001 |
Output: | Lifts to this position in hg19: chr1 110255312 110255313 |
Notes: | If your input is entered with the “BED” formatted coords (0-start, half-open), the browser will also output the same “BED” format. (Note BED format contains no punctuation and includes spaces.) |
* Note that the web-based output file extension is misleading in this case; while titled “*.bed” the positional output is not actually in “0-start, half-open” BED format, because the 1-start, fully-closed “positional” format was used for input.
2) Command-line liftOver utility example
When using the command-line utility of liftOver, understanding coordinate formatting is also important. Just like the web-based tool, coordinate formatting specifies either the “0-start half-open” or the “1-start fully-closed” convention. For example, if you have a list of 1-start “position” formatted coordinates, and you want to use the command-line liftOver utility, you will need to specify in your command that you are using “position” formatted coordinates to the liftOver utility.
To view the liftOver utility usage statement and options, enter “liftOver” on your command-line (with no other arguments, and without the quotes).
Table 5. UCSC Genome Browser command-line liftOver and “position” coordinate formatting
Input: (panTro3.txt) |
chr1:127140001–127140001 |
Command: | liftOver -positions panTro3.txt liftOver/panTro3ToHg19.over.chain.gz mapped unMapped |
Output: | chr1:110255313–110255313 via “mapped” file for hg19 |
Notes: | Note: Must specify “-positions” for 1-start “position” format in command-line liftOver |
Table 6. UCSC Genome Browser command-line liftOver and “BED” coordinate formatting
Input: (panTro3.bed) |
chr1 127140000 127140001 |
Command: | liftOver panTro3.bed liftOver/panTro3ToHg19.over.chain.gz mapped unMapped |
Output: | chr1 110255312 110255313 via “mapped” file for hg19 |
Notes: | Note: No special argument needed, 0-start “BED” formatted coordinates are default. |
Wiggle Files
The wiggle (WIG) format is used for dense, continuous data where graphing is represented in the browser. Wiggle files of variableStep or fixedStep data use “1-start, fully-closed” coordinates. Like all other UCSC Genome Browser data, these coordinates are positioned in the browser as “1-start, fully-closed.”
Note: Many other formats outside of the UCSC Genome Browser use 1-start coordinate systems, such as GTF/GFF.
Table 7. UCSC Genome Browser wiggle files & coordinate systems
File Type | Wiggle file | Coordinate system as positioned in UCSC Genome Browser |
bedGraph -> bigWig | 0-start, half-open | 1-start, fully-closed |
wiggle variableStep -> bigWig | 1-start, fully-closed | 1-start, fully-closed |
wiggle fixedStep -> bigWig | 1-start, fully-closed | 1-start, fully-closed |
Section 5: Resources
- Sequence Coordinates: 0- vs 1-base, Bob Milius, PhD [pdf]
- Cheat Sheet For One-Based Vs Zero-Based Coordinate Systems [Biostars Forum]
- Database/browser start coordinates differ by 1 base [UCSC Genome Browser, FAQ]
- Genome wiki: Coordinate Transforms [UCSC Genome Browser Wiki: “genomewiki”]
- UCSC Genome Browser: wiggle format help page
- Kent WJ, Zweig AS, Barber G, Hinrichs AS, Karolchik D. BigWig and BigBed: enabling browsing of large distributed data sets. Bioinformatics. 2010 Sep 1;26(17):2204-7. Epub 2010 Jul 17.
If after reading this blog post you have any public questions, please email genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly accessible forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.
Thank you very much for your nice illustration. It really answers my question about the bed file format.
Here’s what looks like a counter-example to the instructions given for converting 1-based to 0-based. The SNP rs575272151 is at position chr1:11008, as can be seen clearly in the browser. Its entry in the downloaded SNPdb151 track is:
chr1 11007 11008 rs575272151 + C C/T single by-frequency,by-1000genomes 0.160609 0.233472 near-gene-5 InconsistentAlleles C,G, 0.911941,0.088059
According to the bed file format, this would place the SNP at chr1:11007 because “required BED fields are…. chromEnd – The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99” , as explained here
https://genome.ucsc.edu/FAQ/FAQformat.html
So in bed file format, position chr1:11008 would be
chr1 11008 11009
And therefore to convert from the coordinates of the UCSC track to bed file format, one has to add 1 to both coordinates, whereas the instructions in your post say to subtract 1 from the start and leave the end the same. Perhaps I am missing something? Please let me know — thanks!
Dear Genya,
Thank you for using the UCSC Genome Browser and your question about BED notation. Please know you can write questions to our public mailing-list either at genome@ucsc.edu or directly to our internal private list at genome-www@soe.ucsc.edu.
Here is a link that will load a view of the Browser on the hg19 database with a parameter to highlight the SNP rs575272151 mentioned, navigating to the position chr1:11000-11015:
http://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&hideTracks=1&snp151=pack&position=chr1:11000-11015&hgFind.matches=rs575272151
One item to note immediately is that the position range is chr1:11000-11015 represents 16 basepairs (not 15 basepairs as one might first think). The Browser would represent this span in BED notation as “chr1 10999 11015” (subtracting 1 from the first coordinate to provide a 0-based chromStart). If you paste in the Browser the BED notation “chr1 10999 11015” you will return to the same spot, chr1:11000-11015, in the above link. One reason the internal Browser files use this BED notation is for the quicker coordinate arithmetics it provides (http://genome.ucsc.edu/FAQ/FAQtracks#tracks1), where one can subtract the chromEnd from the chromStart and get the total number of bases: 11015-10999 = 16.
Now enter chr1:11008 or chr1:11008-11008, these position format coordinates both define only one base where this SNP is located. If you enter the BED notation you described “chr1 11008 11009” you will move over to the next base: chr1:11009, this is because BED chromStart is 1 less being 0-based, just like the 10999 represented starting a span at the nucleotide with coordinate position 11000. Now enter instead “chr1 11007 11008” and you will end up at chr1:11008 where this SNP rs575272151 is located. This explains why in the snp151 table the entry is “chr1 11007 11008 rs575272151”
You bring up a good point about the confusing language describing chromEnd. To illustrate the chromStart=0, chromEnd=100 referenced example enter these BED coordinates into the Browser: “chr1 11000 11010” that will include the referenced SNP. You can think of these as analogous to chromStart=0 chromEnd=10 that span the first 10 basses of a region. While the browser software will think of these bases as numbered 0-9 in the drawing code, in position format they are representing coordinates 1-10.
Thank you again for using the UCSC Genome Browser!
Pingback: Genomics Homework1 速成指南 | Skelviper的算命作坊
Hello UCSC Genome Browser team,
I have a question about the identifier tag of the annotation present in UCSC table browser.
This is a snapshot of annotation file that I have.
“`
chr1 1046829 1047018 NM_001077977_utr3_2_0_chr1_1046830_f 0 +
chr1 1099124 1099325 NM_001077124_utr3_0_0_chr1_1099125_r 0 –
“`
I am not able to understand the annoation column 4.
I figured that NM_001077977 is the ncbi gene i.d -utr3 is the 3’UTR.
I also understand the later part chr1_1046830_f means it’s in chr1 and the position 1046830 -f means it’s in forward (+) strand.
What has been bothering me are the two numbers in the middle. In above examples; _2_0_ in the first one and _0_0_ in the second one. I am not able to figure out what they mean.
Please help me understand the numbers in the middle.
Thank you,
Suraj
Dear Suraj,
Thank you for using the UCSC Genome Browser and your question about Table Browser output. Please know it is best to directly email our help mailing list at genome@soe.ucsc.edu where questions are publicly archived and also can be searched: https://groups.google.com/a/soe.ucsc.edu/forum/#!forum/genome
The Table Browser will attempt to include information in the name column in the BED output. Please see this FAQ about the name column: http://genome.ucsc.edu/FAQ/FAQdownloads.html#download34
These two numbers you have asked about try to include additional information about the exon count and whether in requesting output from the Table Browser if additional padding was included.
Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further public questions, please email genome@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genome-www@soe.ucsc.edu.
All the best,
Brian Lee
UC Santa Cruz Genomics Institute