![](https://crypto4nerd.com/wp-content/uploads/1oZffpu8VTmcIphBFRdBX3A.png)
In last post, I already talked about how to convert a sentence into tokens that is ready for further analysis.
Now, I am going to talk about how to convert it to a graph.
In order to do so, I need following helper functions.
This time it’s a bit long with the helper functions and therefore I will go through it by functions.
A. Tokeniser
I have two different tokeniser functions. One is ‘spanner’ that you would find it in my last post. Another one is the one I present it here. It is called ‘dynamicTokenizer’. Their differences rests on how to recombine raw tokens back into meaningful entity.
‘Spanner’ is using the direct output from the NER processor, which is ‘bionlp13cg’. ‘dynamicTokenizer’, on the other hand, is using my custom-built function ‘consolidatedpassV2’. This function perform task same as the stanza built-in NER. However, this processor depends on a set of rules instead of a neural network-based processor. The set of rules are common rules on how English is used to construct a meaningful phrase (i.e. noun phrase, adjectival phrase, or etc.) and so it is deterministic. It has to be used together with the ‘dynamicToknierzer’ because at each pass the Part-of-speech (POS) tagging for each token will change and that results in another recombination. The process will continue until it reach a steady state where the number of tokens does not change further.
‘sandbox’ function is another helper function to display the tokenisation process of ‘dynamicTokenizer’ and its final tokenisation output. It helps me to visualise the output and change the functions to fit my purpose if I don’t like the output.
B. Rules to connect tokens
Now we have a tokenised sentence. We need some rules to guide us on how to construct a network that reflects the underlying syntax. There are two guiding principles I employ when I construct the graph.
1. The linear sequence of the word order
2. The dependency between two words
The first is pretty straightforward. It’s just the order of word in a English sentence reading from left to right. Nothing fancy.
The second relies heavily on the dependency grammar [1]. Dependency-based parsing is different from constituency-based parsing [2] of which each token as node will be classified to either terminal or non-terminal. Dependency-based parsing does not differentiate each node and a dependency structure is determined by the relation between a word (a head) and its dependents. Dependency structures are flatter than phrase structures in part because they lack a finite verb phrase constituent. And that is the reason I choose output from dependency-based parsing to construct the graph because I want to provide a way for one token to jump to another token by ignoring the order of words in a sentence. I borrow this idea from the PageRank algorithm [3].
Intermission
I have covered in details on what steps to break down a sentence into tokens ready for the construction of a graph.
In next post, I will cover the graph constructed based on rules and processes set out here.
Ciao for now and stay tuned.