Panlingua, by Chaumont Devin, May 10, 1998. Chapter 9, C Implementation.
Many of you may have been wondering just how one might implement a system of links and nodes on a computer, and this is not a trivial question. One might begin by analyzing the information that is needed in order to create a standardized link for the system. Since nodes generally carry less information, even in the computational implementation, they can generally use exactly the same number of bytes as the links and have room to spare. In this way all the links and nodes of a link-node system can be thought of as the elements of an array. So the first step is to determine the sizes and identities of the elements required for each link. Once this has been done, it will be a simple matter to define the data elements needed for the nodes within the same number of bytes.
As an example, suppose we intend to create a simpleminded ontology capable of holding 65535 semnods. Recall that the links of the ontology are semlinks, which link one semnod to another within the semantic plane. To create a system that can learn, the semlinks should be weighted, meaning that each link should hold statistical information that tells the system how much it is used. Each link will also have a type and a destination semnod identifier so the system can know what kind of a link it is and where it links to. Thus for each link we will need a weight byte, a type byte, and two bytes for a semnod identifier. So the type of structure we will need, in C, should look something like this:
typedef struct { /* semlink structure */ unsigned char wgt; /* weight byte */ unsigned char typ; /* semlink type */ unsigned tag; /* target semnod identifier */ } sem_t;With this type definition we might then create an array containing space for 10,000 links and nodes as follows:
sem_t ont[10000];
Now recall that although we have defined the semlink structure with its various elements, or fields, we have still to consider what the semnod structure should be like. For this we will need a value that tells the system how many elements of the array are taken up by the semnod and its pointers and an identifying integer value for the semnod. For the number of elements used by the semnod and its pointers a byte value should do, so we can just use the first byte field of the semlink array, which was "wgt." Then for the integer identifier we already have the "tag" field which serves the same purpose in the semlink structure, so we can use it. Then for a particular semnod entry in the array we might have:
5 elements, semnod #21935
The "5 elements" part tells us that this semnod element will be followed immediately by four semlink elements in the array. Then for these we might have:
wgt=255 typ=isa tag=20029
wgt=33 typ=has tag=15293
wgt=55 typ=has tag=995
wgt=92 typ=can tag=11504
And these four semlinks would then be followed immediately by the next semnod in the array, and so on until a semnod was reached containing all zeros, meaning that we had gotten past the last semnod and the last semlink in the array.
Since semnod tag values should increment consecutively, a check of the integrity of the data in the array can easily be made by skipping from semnod to semnod in the array to see if the numbers follow one another consecutively until the zero at the end of the data has been found.
To add a semlink to a semnod, the next semnod and everything after it will have to be shifted one position to the right so the semlink can be put in its place. Then the "wgt" byte of the semnod to which the new semlink has been added will need to be incremented to show one element more.
To remove a semlink from a semnod, simply shift everything to the right of the semlink one position to the left and decrement the "wgt" value of the semnod to reflect one less element in use.
An arbitrary decision will have to be made regarding semlink direction. In my systems the semlinks point from the current semnod to the semnods identified in their tag fields. But it might be just as valid to choose the opposite direction and assume that all the semlinks were pointing to the semnod to which they belong. Then their tag values would not indicate where they were pointing TO, but instead where they were pointing FROM.
So using a digital computer it is a very easy matter to model systems of links and nodes. It is easy to traverse all the semnods in the example system above just by adding the "wgt" value to the current index at every semnod and jumping to the next semnod. It is easy to visit all the semlinks pertaining to one semnod just by finding the semnod, then using the "wgt" value of the semnod to increment the index until the next semnod has been reached. Etc.
But what about data structures like Panlingua, which are not simple systems of links and nodes? Yes, it would be possible to model Panlingua using linked lists, but this would not ve very easy to do, and it would be slow and expensive in terms of computational resources. simpleminded structure for the Panlingua atom would be as follows:
typedef struct { /* the Panlingua atom */ unsigned char sib; /* number of atoms in subtree */ unsigned char lex; /* lexlink type */ unsigned char syn; /* synlink type */ unsigned tag; /* semnod identifier */ } atm_t;A Panlingua system using this atom structure could handle up to 65535 unique words and 65535 unique semnods, yet every atom would only be five bytes long. Thus an array containing a single megabyte of these structures could hold 69,905 three-word assertions or 104,857 two-word assertions in a form that could be traversed at very high speeds.
But how can Panlingua representations, which are known to be three-dimensional structures of links and nodes, be held in such an array? In my implementation of Panlingua I use a trick. It works like this:
Every atom that has a sib value of 0 is the terminator atom of an array. Every atom that has no dependents will have a sib value of 1. Every atom that has one or more dependents will have a sib value greater than 1. For every atom that has one or more dependents, its first dependent will be the next atom in the array. If an atom has a sibling to the right, that sibling will be located at the address of the current atom plus sib. Thus if the 5th atom of an array has a sib value of 4, and if that 5th atom has a sibling, then its sibling will be the 9th atom in the array. Every atom located at the location of the current atom plus its sib value is either the sibling of the current atom or else an atom of higher rank that is not the regent of the current atom.
Thus, ignoring synlink and lexlink types, we might represent the following sentence int the following way:
John loves Mary
3 loves (LOVES heads a subtree containing three atoms)
1 John (JOHN is an atom with no dependents and IS a dependent of loves)
1 Mary (MARY is an atom with no dependents and a right sibling of JOHN)
Notice that even though MARY has no siblings to the right, it still has a sib value of 1. Because sib values indicate the number of atoms in the subtree and not the location of the next sibling, it is necessary to take appropriate programming precautions to make sure that, say, the next atom to the right of Mary will not be construed as a right sibling. This is done using an "upper limit index," which always points tot he next atom past the current subtree. Thus if a process is traversing a subtree at a lower level than its head atom, and the index of any atom plus its sib value equals the upper limit index, then the atom at that calculated position is not a sibling to the right of the current atom.
All this may seem rather awkward, but by arranging Panlingua atoms according to dependency in this way great savings can be achieved in terms of computational resources because it is not necessary to store an identifying integer for the regent or left sibling of the current atom in the Panlingua atom structure. Recall that each Panlingua atom, or word, employs two links. The target of the lexlink is identified by the tag field. Lexlink type is given by the lex field. And synlink type is given by the syn field, but no synlink target need be identified because that target is just the regent or left sibling of the current atom.
Thus each atom in a Panlingua array points just beyond the remaining atoms in its subtree by means of its sib value.
This arrangement lends itself readily to fast serch, match, swap, move, and other routines, many of which have been made the property of the public domain. At first the programmer may be bewildered by the seeming complexity involved, but after some familiarization it is not difficult to gain a reliable feel for where things should be when atoms have been arranged in this way. Mastering this system is crucial for those wishing to write parsers, text generators, and other algorithms for Panlingua.
Unfortunately, real-life implementations of Panlingua cannot use an atom structure as simple as the one given above. Here is the structure I have used for Panlingua in a system currently under development in Honolulu:
typedef struct { /* The Panlingua atom/word structure */ unsigned char sib; unsigned neg: 1; /* begin lexlink type */ unsigned ppt: 1; /* noun: property, verb: stative */ unsigned aux: 2; unsigned lex: 4; /* end lexlink type definition fields */ unsigned char syn; /* synlink type */ unsigned long tag; /* semnod identifier */ } atm_t;In the first place the tag value has been expanded to a long, which makes it possible to create literally billions of semnods and have literally billions of words in the lexicon. But even with this improvement, the tradeoff in size is only two bytes, in other words the length of the atom has been expanded from five to seven bytes. I could have used a three-byte integer, which would probably have allowed more than enough words and semnods to cover any language in the world, but I decided to go ahead and use a long so as to avoid any possible difficulties that might arise from an oddball length integer in times to come.
The sib and syn fields have been left as they were, but the lex field has undergone many changes. This has been done because many components must be packed into lexlink type. For example, is the semnod being accessed negatively, as in "unhappy" versus "happy." If so the neg field will be YES. The ppt field is YES for property nouns and stative verbs. The aux field serves many purposes. For nouns it tells number. For verbs it tells tense. For adjectives it tells comparative or superlative. General lexlink type is thus confined to only four bits, but as it turns out these are sufficient for aspect and the other mutually exclusive types that occur.
As you can see, this whole chapter has only been about computer implementation, and none of the things I have written here really say anything about the theoretical framework of Panlingua, or how it can or cannot be used, etc. I have only included this chapter to help those who find the computer implementation to be difficult, and in order to keep people writing computer programs for Panlingua from running into compatibility problems later on. If there are better ways of doing some of these things, then we should discuss and implement them together. Please remember that compatibility between systems is crucial if we are to create systems that can share.