Stage II: Data Validation
Although the data could be stored in a single file, with around 62,000 entries per volume, this can be too cumbersome for editing purposes. For Volume 1, which covered the area of the Aegean Sea, the entries were grouped into islands with each island stored in a separate data file. In the case of Volume 2 the entries were split into regions. Because the whole of Volume 2 only covers the region around Athens, the data was split into eleven files, divided by initial letter of names. However, this still meant that there were some 11,500 entries in the alpha file, but only about 750 entries in the beta and gamma file. Ultimately, for editing purposes the alpha file was split into four separate segments, with approximately the same number of lines.
The second stage is by far the most time-consuming part of the overall process. All the details need to be checked and this is partly accomplished by manually reading through a printout and partly by electronic means. Not only the content, but also the format needs to be verified, and so a number of checking programs have been written. There are five programs each of which corresponds to one of the fields in the data file (Table 11 lists the programs and describes their function). The checking programs use data stored in a second database, LEXCHECK, although for speed of operation this data is unloaded to data files. There are five separate programs because the amount of information to be processed is too great for a single program and would take too long (the combined size of the program source files represents about 9,000 lines of code, while a number of large arrays are also used). Also, for the convenience of the editing process, work is usually carried out on one field at a time, for instance all places will be checked and edited before moving on to dates. A VAX DCL script is used to control this process, presenting the user with a menu of options (Fig. 12), as well as allowing them to specify a number of other criteria, such as to run the program interactively or in a batch queue. There are several other options, in particular programs can be run to compile lists of names and places. The resultant lists are used to check consistency within the data files.
The original checking programs which ran on the ICL were written in Spitbol, a dialect of Snobol. The new programs have been written in C so making the code more transportable in the event of any future changes in technology. Since there are three more volumes still to be published it is very likely that some hardware changes will take place before completion of the project.
The data for Volume 2 arrived from Australia in a fairly thoroughly
edited form and so the actual validation tended to concentrate on checking
that the conventions used by the LGPN project were correct in the data
files. This process actually took several months, although corrections
and additions to the content of the data were also made, in collaboration
with the Australian project staff.
© Lexicon of Greek Personal Names. All rights reserved.