
Why Is It Necessary?
We are concerned with the publication of the Bible in many languages. We believe it is God's word, for all people in all generations. Huge amounts of effort have been applied to the translation and publication of the Scriptures. The United Bible Society states that the Bible (or parts of it) now exists in over 2,200 languages. But how long will this work be available? In order for it to be accessible by future generations, we need to work strenuously and proactively now to preserve it in digital form. Unfortunately, there are technical and attitude problems that militate against the preservation of Scriptures for the future.
Every year the computer industry offers new products to the public. This is the leading edge. However, at the trailing edge, computer hardware (especially drives), software, operating systems and media are becoming obsolescent and unsupported. DOS programs, old tape formats, 5.25 disks, punch cards, paper tapes, hard disk drives (such as MFM and ESDI) are all disappearing. Consequently, the computer data stored in those forms is also becoming unreadable and, therefore, inaccessible. Digital data, that has the potential of lasting for centuries, may thus be lost or irrecoverable after only ten or fifteen years. (See "Avoiding Technological Quicksand".)
The media on which data are stored are also fragile. Magnetic media, either in tape or disk forms, are not reliable even for the medium term. Ambient magnetic fields can erase data on disks. They can be physically damaged by physical breakage or fire, whether deliberate or accidental. Overseas, there are problems with dust, and fungus and moulds growing on the media. The quality of data on magnetic media on a shelf cannot be guaranteed. CDs are also vulnerable to scratches, light exposure, and other mishandling damage.
There is a worrying attitude of complacency about the need to preserve data. People are not aware of its fragility. Consequently, there is a need for consciousness-raising regarding the transience of the materials we hold. When people upgrade their computers, they frequently do not transfer all their data to the new environment. They often discard old hard disk drives containing both data and programs. Similarly, floppy disks are often thrown away without transferring the data. It is often just too time-consuming to evaluate and then convert the data to new formats.
In addition, people involved in the production of published materials often do not have a long-term preservation mentality or strategy. The data they handle is just "the current job", to be completed and then put to one side. They are not thinking about the long-term preservation of data because that is not their task.
What Needs To Be Done?
First of all, we must focus attention on the problem. People are busy with their own agendas so little time and energy is left for consideration of urgent matters such as data rescue and preservation. Currently, not enough people are aware of the problem. Those who are aware know its seriousness and the need to tackle it vigorously and urgently. We should be doing all we can to raise awareness of the problem, and to be considering solutions.
A number of alternative strategies have already been proposed. (See "Preserving Digital Information".) They include the following measures.
It is probably a wise idea for IT managers to retain some old machinery, particularly old drives in order to read and transfer data that is still in the hands of workers. At least one machine with a 5.25-inch drive would be useful, for example. However, the idea of maintaining a full range of obsolete machines and their drives in working order, plus their accompanying software packages and the means of transferring the data to new machines is not a viable possibility for most organisations.
An alternative suggestion is to develop "emulators". These have already been created in the past for hobby use. For example, you may have seen emulators for the old Sinclair Spectrum. This means that people can play old games using old software on new machines. This is a possible approach to long-term preservation and has been used in a pioneering project being run by the CAMiLEON group at Leeds University. They have recently produced an emulator for the BBC Micro, together with special drives that reproduce the "look and feel" of a 1980's school project called the "Doomsday Project". This has been a successful venture and their BBC Micro emulator works.
However, their objective was to produce a "look and feel" device that would accurately reproduce the original early multimedia project, including maps and photos as well as text. Emulators need to address many technical questions. They also need to be continually updated as new technology develops.
I do not believe that such an investment of time, effort, and energy is required for the text of the Bible. For the majority of the work we wish to preserve, we are interested in the preservation of the content of a Bible translation text and not its original "look and feel". What if some agencies deem it essential that certain documents also contain the "look and feel" of the original document? If this is the case, then other economic factors will need to be considered. Do they want the "look and feel" of this document to be preserved so much that they will make serious investment into emulators or similar methods?
Raymond Lorie of IBM has suggested (in a project on "Preservation of Digital Data") the development of a universal virtual computer that will be able to read and recreate obsolescent data. This is another major development program concept that will require large financial resources. If a "UVC" such as this or emulators are developed by other agencies, be they academic organisations or software developers, old data that is still on readable media and in readable form could still be transferred to such emulators or UVCs. We could all buy into it later. However, until those developments are complete, alternative methods need to be applied right now.
The Wycliffe Associates "Bible For The Future" Project is taking the first necessary steps to long-term preservation of Bible text data streams. We are using a very straightforward strategy of collecting and migrating data. We realise that many organisations and agencies do not have the time, expertise or personnel to devote to what may seem a mundane task. However, we are enthusiastic about data preservation. We would like to be considered as an outsourcing group for other agencies and organisations.
What Needs To Be Preserved?
There are, essentially, two elements that require our attention: data and metadata.
Data that is being kept for future generations must be in a machine-readable form. Hence the need for planned migration to newer media as the new media become widely accepted, and used and old media obsolescent. The data must also be stored in a form that is not software-dependent, or at least in forms that are public standards. So, for example, it would be less than ideal to store material in Ventura Publisher (proprietary software), better to store it in Acrobat or "Rich Text" formats (open formats), and better still to store data as text files. If data is submitted in a "less than ideal" form, then, where possible, the Project will convert the data to a more readable, open and less software-dependent format. However, the original data would also be retained in its original form. Any such conversions would be recorded in the metadata to provide a full audit trail.
Additional conversions may be necessary to move away from the ad hoc methods of displaying fonts that were developed in the past, into the new universal standard of Unicode. Again, conversions would all be tracked in the metadata, and the originals would be retained as well as the converted forms. In addition, conversions may be made from Standard Format into XML, OSIS, and such like. In the conversion process, checksum software is used to ensure the faithful transmission of the data.
Metadata is the additional information that we must supply to future generations for them to interpret and read what we have stored. It would normally include such things as the language name, information about fonts, copyright and intellectual property (IP) information, significant dates, access permissions and the like.
We also plan to include scanned images of sample pages of the original text in TIFF format, to give future users additional clues as to the appearance of the language, and to help them decipher the text. Such scanned imaged would be part of the metadata. The metadata and the data will form a unified package.
How Is Data Stored In The "Bible For The Future" Project?
We save all the data submitted to us on CDs (at present). Each language will have its own CD, so there is plenty of storage space for the Scripture data, and all the accompanying metadata. We will migrate all the data to newer media as these become the de-facto norm for storage.
The CDs are stored in a dark room, within a temperature range suggested by the CD makers.
Copies of the CDs are placed in at least two separate geographical sites, including the client's own site. Mirror copies of the CDs are provided to the client submitting the data. The original data on the media submitted by the client are retained. Periodically the media are checked to ensure that they are still readable.
Copyright And Intellectual Property Rights
How can submitting organisations ensure that copyright and intellectual property rights are not compromised? The Project guarantees that the materials in its care will not be issued to anyone during the period of active copyright, except with the permission of the copyright holder. It will, however, make copies available to all authorised users on request. The ownership of the data will remain with the copyright holder. The archive will retain the right to publish lists of materials held.
What Are The Preferred Submission Criteria?
There are three areas to consider.
Ideally, data should be submitted on current widespread media, such as CD or 3.5-inch floppy. If data is submitted on other media types, it may be necessary to make a charge for conversion to an acceptable form. (This only applies if the Archive does not have the necessary conversion equipment and has to use an outside file conversion agency).
The data are stored on single CDs, one CD per language. Each CD will contain its own metadata as part of the full package.
CDs are physically labelled to show things such as their contents, restrictions on access and the name of the language used.
Ideally, files should be submitted as text, or in open standards rather than in proprietary software formats. However, it is more important to preserve the documents in any form than to reject data because it is not in the preferred format. If the name of the generating software package is known, this will help.
We insist on some basic metadata, such as the name of the language, the 3-letter code (from Ethnologue), details of the copyright holder, and access criteria. However, it is more important to preserve the documents than to insist on complete metadata prior to reception of the document.
A Licence Agreement, mainly concerned with degrees of access, is provided.