Stage III: Database Loading


The actual loading of the database really only takes place once the data files are thought to be correct, at least regarding format, thus Stage II is a repeating loop. The loading stage, however, is a single process but one that is made up of a number of phases. The first is the creation of data files that correspond to the tables in the database. This is basically accomplished by the same programs that do the checking but with a flag set indicating that database data files are to be produced. However, it is at this stage that the relationship processing takes place, as this is a very time consuming process for the computer, taking nearly half an hour of CPU time for volume 2. This represented linking over 40,000 records, out of which about 17,500 known relationships were established (representing some 35,000 records). The remaining 5000 were occasionally due to spelling mistakes that could be corrected later, or fragmentary names.Note 4

Once all the loading data files are created they are copied into the database. Loading the database is a four step process. The first step modifies all the tables to have a heap structure. This ensures that the second step, actually copying the data into the tables is completed as quickly as possible. The third step involves modifying the tables back to a binary tree structure and then finally, the fourth step, creating the extra indexes which make the data retrieval process quicker.

For Volume 2 of the LGPN this process actually took less than a day. In fact, the database was eventually loaded five times. Corrections were necessary to the data and so it was decided that since loading took very little time (within the overall process) it would be quicker to modify the original data files and re-load, rather than modify the data as it existed in the database.

Before the fourth stage a check is made on the actual content of the data which is accomplished by extracting the data in a form that is identical to the data files used by the original loading program. The old and new files are then compared by a comparison program and a file is created of any differences that occur. Differences that may occur will be due to one of three problems. The first is a mistake in the original data file that got past the checking programs. Due to the nature of the data, it is very difficult to ensure that every minor mistake is located and this is especially true of the reference field, Note 5 but most particularly with the relationship field. In these cases, the program compares the name in the first field with that stored in the final bracket. However, if the name used in the final bracket has an accent code missing or misplaced, or a secondary name is used, then no match is made. A worse situation occurs when two individuals bear the same name and each has a relation with the same name, while they all come from the same place; the program could quite easily match the wrong parent and child depending on the order of the entries it found.Note 6 As a consequence it was decided to suspend the relationship checking for volume 2, but it has been re-instated for Volume 3.

Secondly, there can be mistakes in the loading program, and finally there was the possibility of mistakes made by the unloading program. To create all the comparison data files took about 7 days, although the comparison program ran through each set of files in a matter of minutes.

Previous | Contents | Next



Email lgpn@classics.ox.ac.uk



Home  

Project  

Publications  

LGPN Online  
Background  
Searches  
Downloads  
Statistics  

Greek names  

Image Archive  

Contact  

 

Classics at Oxford  

University of Oxford  

British Academy 

AHRC