history


History of 

Bioinformatics/Biological Databases


The first bioinformatic/biological databases were constructed a few years after the first protein sequences began to become available. The first protein sequence reported was that of bovine insulin in 1956 , consisting of 51 residues. Nearly a decade later, the first nucleic acid sequence was reported, that of yeast alanine tRNA with 77 bases. Just a year later, Dayhoff gathered all the availablesequence data to create the first bioinformatic database . The Protein DataBank followed in 1972 with a collectionof ten X-ray crystallographic protein structures, and theSWISSPROT protein sequence database began in 1987. A huge variety of divergent data resources of different typesand sizes are now available either in the public domain or more recently from commercial third parties. All of the original databases were organised in a very simpleway with data entries being stored in flat files, either one perentry, or as a single large text file. Re-write - Later on lookup indexes were added to allow convenient keyword searching of header information.


History of Tools


After the formation of the databases, tools became available to search sequence databases - at first in a very simple way, looking for keyword matches and short sequence words, and then more sophisticated pattern matching and alignment based methods. The rapid but less rigorous BLAST algorithm has been the mainstay of sequence database searching since its introduction a decade ago, complemented by the more rigorous and slower FASTA and Smith Waterman algorithms. Suites of analysis algorithms, written by leading academic researchers at Stanford, CA, Cambridge, UK and Madison, WI for their in-house projects, began to become more widely available for basic sequence analysis. These algorithms were typically single function black boxes that took input and produced output in the form of formatted files. UNIX style commands were used to operate the algorithms, with some suites having hundreds of possible commands, each taking different command options and input formats. Since these early efforts, significant advances have been made in automating the collection of sequence information. Rapid innovation in biochemistry and instrumentation has brought us to the point where the entire genomic sequence of at least 20 organisms, mainly microbial pathogens, are known and projects to elucidate at least 100 more prokaryotic and eukaryotic genomes are currently under way. Groups are now even competing to finish the sequence of the entire human genome. With new technologies we can directly examine the changes in expression levels of both mRNA and proteins in living cells, both in a disease state or following an external challenge. We can go on to identify patterns of response in cells that lead us to an understanding of the mechanism of action of an agent on a tissue. The volume of data arising from projects of this nature is unprecedented in the pharma industry, and will have a profound effect on the ways in which data are used and experiments performed in drug discovery and development projects. This is true not least because, with much of the available interesting data being in the hands of commercial genomics companies, pharmcos are unable to get exclusive access to many gene sequences or their expression profiles. The competition between co-licensees of a genomic database is effectively a race to establish a mechanistic role or other utility for a gene in a disease state in order to secure a patent position on that gene. Much of this work is carried out by informatics tools. Despite the huge progress in sequencing and expression analysis technologies, and the corresponding magnitude of more data that is held in the public, private and commercial databases, the tools used for storage, retrieval, analysis and dissemination of data in bioinformatics are still very similar to the original systems gathered together by researchers 15-20 years ago. Many are simple extensions of the original academic systems, which have served the needs of both academic and commercial users for many years. These systems are now beginning to fall behind as they struggle to keep up with the pace of change in the pharma industry. Databases are still gathered, organised, disseminated and searched using flat files. Relational databases are still few and far between, and object-relational or fully object oriented systems are rarer still in mainstream applications. Interfaces still rely on command lines, fat client interfaces, which must be installed on every desktop, or HTML/CGI forms. Whilst they were in the hands of bioinformatics specialists, pharmcos have been relatively undemanding of their tools. Now the problems have expanded to cover the mainstream discovery process, much more flexible and scalable solutions are needed to serve pharma R & D informatics requirements.
There are different views of origin of Bioinformatics- From T K Attwood and D J Parry-Smith's "Introduction to Bioinformatics", Prentice-Hall 1999 [Longman Higher Education; ISBN 0582327881]: "The term bioinformatics is used to encompass almost all computer applications in biological sciences, but was originally coined in the mid-1980s for the analysis of biological sequence data." From Mark S. Boguski's article in the "Trends Guide to Bioinformatics" Elsevier, Trends Supplement 1998 p1: "The term "bioinformatics" is a relatively recent invention, not appearing in the literature until 1991 and then only in the context of the emergence of electronic publishing. The National Center for Biotechnology Information (NCBI) , is celebrating its 10th anniversary this year, having been written into existence by US Congressman Claude Pepper and President Ronald Reagan in 1988. So bioinformatics has, in fact, been in existence for more than 30 years and is now middle-aged.

A chronological history

 of Bioinformatics


  • 1951 - Pauling and Corey propose the structure for the alpha-helix and beta-sheet (Proc. Natl. Acad. Sci. USA, 27: 205-211, 1951; Proc. Natl. Acad. Sci. USA, 37: 729-740, 1951).
  • 1953 - Watson & Crick propose the double helix model for DNA based x-ray data obtained by Franklin & Wilkins (Nature, 171: 737-738, 1953).
  • 1954 - Perutz's group develop heavy atom methods to solve the phase problem in protein crystallography.
  • 1955 - The sequence of the first protein to be analysed, bovine insulin, is announed by F.Sanger.
  • 1958 - The first integrated circuit is constructed by Jack Kilby at Texas Instruments. The Advanced Research Projects Agency (ARPA) is formed in the US
  • 1962 - Pauling's theory of molecular evolution
  • 1965 - Margaret Dayhoff's Atlas of Protein Sequences
  • 1968 - Packet-switching network protocols are presented to ARPA
  • 1969 - The ARPANET is created by linking computers at Stanford, UCSB, The University of Utah and UCLA.
  • 1970 - The details of the Needleman-Wunsch algorithm for sequence comparison are published.
  • 1971- Ray Tomlinson (BBN) invents the email program.
  • 1972 - The first recombinant DNA molecule is created by Paul Berg and his group.
  • 1973 - The Brookhaven Protein DataBank is announeced (Acta.Cryst.B,1973,29:1764). Robert Metcalfe receives his Ph.D from Harvard University. His thesis describes Ethernet.
  • 1974 - Vint Cerf and Robert Khan develop the concept of connecting networks of computers into an "internet" and develop the Transmission Control Protocol (TCP).
  • 1975 - Microsoft Corporation is founded by Bill Gates and Paul Allen. Two-dimensional electrophoresis, where separation of proteins on SDS polyacrylamide gel is combined with separation according to isoelectric points, is announced by P. H. O'Farrell (J. Biol. Chem., 250: 4007-4021, 1975).
  • 1976 - The Unix-To-Unix Copy Protocol (UUCP) is developed at Bell Labs. E. M. Southern published the experimental details for the Southern Blot technique of specific sequences of DNA (J. Mol. Biol., 98: 503-517, 1975).
  • 1977 - The full description of the Brookhaven PDB (http://www.pdb.bnl.gov) is published (Bernstein, F.C.; Koetzle, T.F.; Williams, G.J.B.; Meyer, E.F.; Brice, M.D.; Rodgers, J.R.; Kennard, O.; Shimanouchi, T.; Tasumi, M.J.; J. Mol. Biol., 1977, 112:, 535). Allan Maxam and Walter Gilbert (Harvard) and Frederick Sanger (U.K. Medical Research Council), report methods for sequencing DNA. DNA sequencing and software to analyze it ( Staden )
  • 1978 - The first Usenet connection is established between Duke and the University of North Carolina at Chapel Hill by Tom Truscott, Jim Ellis and Steve Bellovin.
  • 1980 - The first complete gene sequence for an organism (FX174) is published. The gene consists of 5,386 base pairs which code nine proteins. W�thrich et. al. publish paper detailing the use of multi-dimensional NMR for protein structure determination (Kumar, A.; Ernst, R.R.; W�thrich, K.; Biochem. Biophys. Res. Comm., 1980, 95:, 1). IntelliGenetics, Inc. founded in California. Their primary product is the IntelliGenetics Suite of programs for DNA and protein sequence analysis.
  • 1981 - The Smith-Waterman algorithm for sequence alignment is published. IBM introduces its Personal Computer to the market. The concept of a sequence motif ( Doolittle )
  • 1982 - Genetics Computer Group (GCG) created as a part of the University of Wisconsin of Wisconsin Biotechnology Center. The company's primary product is The Wisconsin Suite of molecular biology tools. GenBank Release 3 made public Phage lambda genome sequenced
  • 1983 - The Compact Disk (CD) is launched. Name servers are developed at the University of Wisconsin. Sequence database searching algorithm ( Wilbur-Lipman ) LANL (Los Alamos National Laboratory) and LLNL (Lawrence Livermore National Laboratory) begin production of DNA clone (cosmid) libraries representing single chromosomes. DNA analysis becomes viable with the discovery of Polymerase Chain Reaction. It allows small samples of DNA to be multiplied to produce a large enough sample to analyse
  • 1984 - Jon Postel's Domain Name System (DNS) is placed on-line. The Macintosh is announced by Apple Computer.
  • 1985 - The FASTP/FASTN algorithm is published. Robert Sinsheimer holds meeting on human genome sequencing at University of California, Santa Cruz . At OHER, Charles DeLisi and David A. Smith commission the first Santa Fe conference to assess the feasibility of a Human Genome Initiative 1986 - Following the Santa Fe conference, DOE OHER announces Human Genome Initiative. With $5.3 million, pilot projects begin at DOE national laboratories to develop critical resources and technologies. The term "Genomics" appeared for the first time to describe the scientific discipline of mapping, sequencing, and analyzing genes. The term was coined by Thomas Roderick as a name for the new journal. Amoco Technology Corporation acquires IntelliGenetics. The SWISS-PROT database is created by the Department of Medical Biochemistry of the University of Geneva and the European Molecular Biology Laboratory (EMBL). The PCR reaction is described by Kary Mullis and co-workers.
  • 1987- The use of yeast artifical chromosomes (YAC) is described (David T. Burke, et. al., Science, 236: 806-812). The physical map of e. coli is published (Y. Kohara, et. al., Cell 51: 319-337). Perl (Practical Extraction Report Language) is released by Larry Wall. Congressionally chartered DOE advisory committee, HERAC, recommends a 15-year, multidisciplinary, scientific, and technological undertaking to map and sequence the human genome. DOE designates multidisciplinary human genome centers. NIH NIGMS begins funding of genome projects
  • 1988 - National Center for Biotechnology Information (NCBI) created at NIH/NLM EMBnet network for database distribution The Human Genome Intiative is started (commission on Life Sciences, National Research council. Mapping and sequencing the Human Genome, National Academy Press: washington, D.C.), 1988. The FASTA algorith for sequence comparison is published by Pearson and Lupman. A new program, an Internet computer virus desined by a student, infects 6,000 military computers in the US. Reports by congressional OTA and NAS NRC committees recommend concerted genome research program. HUGO founded by scientists to coordinate efforts internationally First annual Cold Spring Harbor Laboratory meeting on human genome mapping and sequencing. DOE and NIH sign MOU outlining plans for cooperation on genome research. Telomere (chromosome end) sequence having implications for aging and cancer research is identified at LANL
  • 1989 - The genetics Computer Group (GCG) becomes a privatae company. Oxford Molceular Group,Ltd.(OMG) founded, UK by Anthony Marchigton, David Ricketts, James Hiddleston, Anthony Rees, and W.Graham Richards. Primary products: Anaconds, Asp, Cameleon and others (molecular modeling, drug design, protein design). DNA STSs recommended to correlate diverse types of DNA clones. DOE and NIH establish Joint ELSI Working Group
  • 1990 - The BLAST program (Altschul,et.al.) is implemented. Molecular applications group is founded in California by Michael Levitt and Chris Lee. Their primary products are Look and SegMod which are used for molecular modeling and protein deisign. InforMax is founded in Bethesda, MD. The company's products address sequence analysis, database and data management, searching, publication graphics, clone construction, mapping and primer design. DOE and NIH present joint 5-year U.S. HGP plan to Congress. The 15-year project formally begins. Projects begun to mark gene sites on chromosome maps as sites of mRNA expression. Research and development begun for efficient production of more stable, large-insert BACs
  • 1991 - The research institute in Geneva (CERN) announces the creation of the protocols which make -up the World Wide Web. The creation and use of expressed sequence tags (ESTs) is described. Incyte Pharmaceuticals, a genomics company headquartered in Palo Alto California, is formed. Myriad Genetics, Inc. is founded in Utah. The company's goal is to lead in the discovery of major common human disease genes and their related pathways. The company has discovered and sequenced, with its academic collaborators, the following major genes: BRCA1, BRACA1 , CHD1, MMAC1, MMSC1, MMSC2, CtIP, p16, p19 and MTS2. Human chromosome mapping data repository, GDB, established
  • 1992 -Low-resolution genetic linkage map of entire human genome published. Guidelines for data release and resource sharing announced by DOE and NIH
  • 1993 - Sanger Centre , Hinxton, UK . CuraGen Corporation is formed in New Haven, CT. Affymetrix begins independent operations in Santa Clara, California. International IMAGE Consortium established to coordinate efficient mapping and sequencing of gene-representing cDNAs. DOE-NIH ELSI Working Group's Task Force on Genetic and Insurance Information releases recommendations. DOE and NIH revise 5-year goals [Science 262, 43-46 (Oct. 1, 1993)] IOM releases U.S. HGP-funded report, "Assessing Genetic Risks." LBNL implements novel transposon-mediated chromosome-sequencing system. GRAIL sequence-interpretation service provides Internet access at ORNL
  • 1994 - Netscape Communications Corporation founded and releases Naviagator, the commerical version of NCSA's Mozila.