Natural Language Processing, simply

 

 

Natural Language Processing (NLP) got a lot of attention in the 2023. OpenAi's large language model (LLM) managed to make a revolution of sorts. But what is NLP?

In this article we are going to look into the basic concepts behind the NLP in the simplest of terms and reflect on similarities and differences in reading by humans vs machines. We shall focus on problems setting rather than a solution of the common NLP stages: sentences, tokens, named entities, normalisation, linking and negation of entities in a typical NLP pipeline.

What is text for the machines

The natural first question to ask oneself is -- what is the text? In other words, if we want a machine to solve something -- we need to be able to precisely define what it is to solve. The question is actually quite deep, the kind we don't really need to ask ourselves and we assume the answer to be self-explanatory. We know that we use text to pass the information between the humans. We can tell that it is a sort of a graphical encoding of the information from which other humans can re-construct the information. That for most of the modern human languages, their written form clearly implies the phonetic/audio signal (the graphics of the text we see can be converted into the information we hear, which in its turn can be related to what was actually encoded in the sound). But I would like to stop it here, hopefully leaving my dear readers with the unsatisfied desire to understand what is the information we are actually passing human to human, how does it happen that we understand each other.

Instead, lets have a look how we store that graphical encoding in computers. To simplify things, lets focus on so-called text formats (even though we cannot clearly say what the text is or what the word is yet) and pass all of the actual graphic formats (like jpeg, png, pdf). After all, to work with image formats we'll need to extract text from those files.


Hopefully it is no secret that on the very bottom, computing is all about contrasts between high and low voltages; voltages being equated with binary numbers; binary numbers being convertable to other formats like decimal. To simplify computer's life, we don't bother it with how the given letter looks (not in the text anyway). We simply differentiate between characters by associating each to a number: on the computing bottom the number can be changed into binary voltages, on the upper side of what we see on the screen -- we simply can associate constellation of pixels with that number.

The table associating characters to numbers is called encoding, and the ASCII is the simplest of them. We are going to use it for our examples below. The important thing to leave the mental note of is that when we talk about the text inside of the program, the text has no visual/audio connotations. Computer sees text as binary numbers it can differentiate between or compare. So lets have a look into how little information it is actually in the next section of sentences.


Sentences

As we have mentioned earlier, for humans text is means to encode the actual information and pass it in human-to-human network in its encoded form. Naturally, the encoding has certain rules so that each node of this distributed human network can process that encoded information. We know those rules as grammar of the language but we rarely think about why it exists in the first place. Fortunately for us, we don't need to think we just need to follow the crowd,  we don't need to think about it now -- just to make a mental note that as humans we have an assumption of naturallity of grammar in the language and rules around the communication. 

Sentences are means to cluster words in a particular way, so that human-reciever can reconstruct the information from that cluster and fit it to the wider body of decoded information from the overall text. Therefore what humans think of sentences is dictated by a variety of things -- from physiology and physical limitations on brain compute capacity all the way up to the cultural environment forming the way individuals in a given society are supposed to behave. Those are all very curious topics, but the only thing I would like to underline here is that human-sentences do not care about what is good or easy for the machines.

 And it happens that machines have physical limitations which we can measure, just because they are built completely mechanically. As we are talking data science here, essentially we are generating a program consisting of if/else trees which can be roughly imagined as the graph above. To be efficient, we want to put those models into parallel computing system like GPU or other specialised unit. As the number of the dedicated parallel computing units is finate, we kind of need to limit the size of the chunk of information we give to it. In the case of NLP, some of the large language model frameworks put a limit on the number of words in the sentence -- just to make it super clear that there is no free lunch. 

In other words the reason we split things into sentences for machines is the same as why we do is as humans -- the limitations of the hardware/brainware to keep/process all of the information at once in working memory. However the differences between human/machine computing technology means that there are edge cases of what sentence means for each.

When humans read the text, we automatically decode the information and it is hard of us to differentiate between the steps we take to do it. We have punctuation rules about dots, which say that sentences should be separated by dots. But if dot is not present -- we can automatically fallback to the grammar of the sentence to put the border between sentences virtually in our mind. We are also used to the "up" and "down" dimensions when reading text (the sentence above, the sentence to the right).

For the machine, the text is one-dimensional chain of numbers. On the example above you can see a written text and red lines of how we as humans split it to sentences. On the right-hand side you can see the exact same text in ASCII encoding as it is seen by the machine, with the blue lines indicating where we expect it to finish sentences. While on that example numbers are inside of the two-dimensional box, the machine can go only to the left or to the right in it.

So in this example, numbers "13" and "10" stand for end of line and beginning of line characters in the text. As I was fitting the text on the left into the box, I needed to hit enter a few times -- this is how it ends up being encoded in computer-text. Number "46" stands for the dot. 

Note that dot after the second sentence is missing and we as humans have no problem in reading it. However it leaves a machine with an undefined problem to solve -- in the first sentence, there is an end-of-line character in the middle just because some meat-bag wished so; in the second sentence the machine is expected to guess that no dot means the end of sentence. In other words, the machine is expected to resolve duality, when in one case end-of-line means the end of a sentence and in the other it is not. It is one of the reasons why so many NLP applications of the past stayed inside of the academia walls -- perfect humans exist only in books and real-life applications are not book-like.

So the sentence layer of NLP applications overall is expected to create boundaries between chains of numbers without understanding what those numbers represent at this stage. On the correct-input text assumption, this stage can be solved by a very simple rule of adding boundaries after specific characters. The further we go into the wild of scanned documents, chat-bots, voice-assistants -- the more edge cases the ML model is expected to solve. However from the information theory perspective, there is an obvious limit on how well such an ML model can do: after all we as humans also need to fallback to grammar/meaning to draw boundaries between the sentences when what we see makes little sense.

Tokens

So lets say we have our machine sentences: chains of numbers which we assume to have a meaning. We still don't have something what as humans we call "words". So what is the word exactly? Oh, forget it. Lets not be exact, we cannot tell what the text is even. Lets just say that a word is a permutations of characters from the alphabet. And that just like there is a way for humans to agree on the writing/sound of a single character, there is some sort-of agreement on those for the words.

With tokens, we focus on that character-permutation part of the meaning of the word. We want to be able to split the chain of sentence-numbers into chains of token-numbers. Just like wecanreadthis whole text using white spaces, punctuatiation marks, dictionary of words in our memory. In the example above number "32" is the encoding for the whitespace character. This, as well as other characters not belonging to tokens get removed by this stage.

Essentially, the goal of this stage is to extract a collection of "clean" character-chains which can be later recognised, justlikewesplitthis first, and read already splitted words to get the meaning. Sounds simple enough one can think -- just split by punctuation marks, whitespace and then check in dictionary. The thing is that the language is evolves, new words appear, old dissappear; there is the wild of typoes and mis-scanned characters. Dost thou heed?

Just like with sentences, in the cleanest input text we can use two-three simple rules to get high-quality tokens. And the more we get into the realm of scanned documents / life text, the more we are exposed to the raw data -- the more we see things like wo rds splitted in halv-es in weeeeird ways, but we still expect machines to be able to read it just because we can. So it goes.

The last thing to mention is that punctuation marks, like dots, comas, brackets are typically extracted as tokens as well. They can be used later as contextual clues within the sentence among the others. Importantly, we don't want to have them as parts of word-tokens because it increases the amount of word-token. Something what we would like to reduce and avoid if possible, as you will see soon below.

Named Entity Recognition (NER)

Probably the most hyped and the most "magical" step of the NLP, named entity recognition does what its name suggests -- recognises entities. But what are the entities it is to recognise?

It might seem counter-intuitive at first, but for many applications we don't really need to understand the meaning of the word completely. Even more so for the majority of business applications. Lets have a look into a hypothetical example: some company designs a physical device and makes everybody addicted to its usage; as people carry this device all the time -- the device has access to the audio feed, from which the company would get a stream of text. The idea of that company would be to use this text to identify what users of the device desire. To simplify, we can assume that the company has some global shop selling almost everything. So in this purely hypothetical example, the company wants to recognise the categories of goods to feed to its users based on what appears in their discussions. For example, the device would overhear that somebody plans to go to the gym after Christmas -- so the company can advertise gym clothes to the user of the device. In other words, our NLP process needs to identify categories in the text feed synchronised with those of the shopping categories.

Our text example might not be the most representative, but it should do just well. Lets say that we want to recognise two categories: those related to smithery and those related to humans. Little bit more formally, we want to colour tokens we received in previous stage in purple for the smithery and in green for the human category. And doing this, we assume that our context is limited to that of the sentence.

Probably the biggest mental leap to make is that at this point in practical terms from the perspective of the machine characters stop to exist. For this stage, tokens are the alphabet. If we go back to our illustration, token "." is seen as "8". Each token retains its sequential position in the sentence, but otherwise same token-characters are seen as the same number. So our dot appears as the token "8" in the end of three sentences. To explain how it works, let me show some magic to you.

If I would ask you "what dughfsdkjh means?", you hopefully will answer that the word is some non-sense.
But if first I give you the following text:
"Jack turned to John: so have you tried the fresh dughfsdkjh? They are next to bananas on the stand over there, arrived together with Thai coconuts. You really should try it, so juicy and delicious!"
You cannot help but to have an intuitive guess that it is some sort of a tropical fruit you haven't heard of yet. We can do this exercise for any type of words -- nouns, verbs, adjectives just name it. Its somewhat scary to admit though, as it somehow dissolves the mist of illusion that words are well-defined. But we promised not to go there this article, so lets get back to our focus topic.

Hopefully at this point it should be visible why tokenisation step is essential: in the second sentence of our example there is a typo ("tex" instead of "text"). This typo leads to the word being identified as a different token ("text" is encoded as 7, "tex" as 9). Therefore the model will treat it as a separate enitity. True enough that with the proper amount of data, the model should be able to see the two to be essentially same. Just that those proper amounts of data are pretty much never there.

NER models receive ordered collections of tokens, those collections are sized by sentences. Those models typically learn on the labelled dataset what are the entities we want to recognise and what category we should classify them to. Model training generates some sort of a graph which ends up classifying tokens based on their context in the training data. There are much more in it, but hopefully we see now how entities can be recognised and tokens can be coloured. On the output we now have collections of tokens matched to categories which we wanted.

Negation (recognition that the entity is negated like: "I don't want to exercise after Christmas", "I can't stand a smell of gym ever in my life") can be seen as the special case of NER. Indeed, negation in whatever of its forms is the context to words we attempt to recognise. One of the simple ways to think about negation is that we can create an independent category "negated"/"non-negated". So our NER model would first colour tokens into purple&green (identify that somebody talks about gym), our negation model would identify tokens which were negated -- and finally we can see all gym tokens being negated. At least thats the basic idea.

Normalisation

This is probably one of the more technical steps, of a kind for which deriving an illustration is complicated. After we spoke about the NER step above, the one might expect that we successfully recognised an entity as belonging to some concrete category. That recognising the category was what we were after, we found our grail and the search is now complete.

The practical reality is that everybody would like to extract as much information from the text as possible, but it is difficult to have both: extract all of the data we want and to have good model accuracy. Typically, there are limitations on how many categories we can recognise well by our NER model for the provided amount of training data. There are also performance and model size limitations on top of that.

So what happens is that we divide & conquer: NER model colours tokens in a few colours we asked it to. Normalisation model follows, loads coloured tokens and further classifies them into something more specific. To give an example: word "mine" classified as "ownership" normalises differently from "mine" classified as "extraction". Or another example: NER model segregates things into types of trash (food, recyclable, glass); normalisation on the glass recycling station divides it into white and coloured.

Typically the task of normalisation step is to provide recognised entity tokens with a relation to some onthology. NER model would recognise the category, like "clothing", "restaurants". Normalisation model based on the category and token text then can further specify it as "pajamas", "britishfrench".

Concept linking

Last but not least comes the linking of concepts. Just like recognition of conepts was counter-intuitive (we recognise and classify everything we read automatically), linking is as couner-inutitive. The very language we speak follows some grammar, this grammar implies links between concepts. In English sentences as the baseline we recognise "who does what" with the obvious dependency between the two. Beyound sentences we follow the context of the written text, like while reading this sentence we keep in back of our minds that it is about concept linking, NLP etc.

Hopefully it comes as no surprise that it is very difficult to track all of the text dependencies in a structured way. And that it is difficult to keep reasonable accuracies. And it takes a lot of computing and memory to process all of that.

Hence, many business applications focus their attention on linking only what is needed. What the one might need to link you may ask? Well, the very hypothetical company listening to the feed of your talk on a hypothetical device might be interested in dates/time references in your speech. If it would hear a sentence like "hey Jack, lets eat something in the downtown tonight around 8pm" -- we might assume that this company might also want to have a hypothetical map service; and that when using navigation that night, the company has a chance of getting advertisement money by showing on a map restaurants in the downtown. Purely hypothetical situation.

In this example, we need to recognise with NER: (a) "restaurants" in entertainment category, (b) "tonight" in relative time-references category, (c) "8pm" in time category, (d) "downtown" in relative location category. Normalisation then would convert it to machine-friendly formats. Finally we link all together and can pass it in structured format, together with the metadata from the audio sample (GPS location, time, average spending) to recommendation sub-system of our hypothetical company. This way we stripped down the raw human's speech from personality, feelings, hopes, aspirations down to the only thing which matters -- how much when and where they are ready to pay for what.


Now, when it is clear what the linking solves we can return back to our original example. Since we don't need that many links, most of the tokens are not linked to anything. There are techniques with the use of sparse matrxies (matrixes mostly filled with zeros) which allow relatively efficient processing/storage of links in the text by the model. For instance a link from "merchants" to "ore" is encoded as 'x' from green token 17 to token 10. 

Typically the model would receive a human-labelled dataset to train on. Once trained, for processed text it can output sparse links we are interested in. The combined structured output of normalised entities linked together is ready for consumption by other systems.

Afterword

We barely scratched the surface of NLP, but hopefully it gives the taste of what problems it actually aims to solve. Rule-based semi-automations of text reading exist for a long time, however it is the development of deep learning, dedicated ML chipsets and GPUs what opened the door for the ongoing boom of ML/AI.

In NLP, we are faced with the ongoing evolution of the language (new words appearing, words shifting their meaning etc), as well as the differences of how information is encoded into different languages. In other words NLP systems to stay valuable need to evolve together with the language and the concepts they are working on. Hence one of the most basic problems of having sufficient amounts of relevant clean data to train models. Shakespeare wrote many nice books, but that way trained model might have a difficulty in understanding today's screenager.

The stages described above need to be solved by any language processing system (including ourselves), just the form/tradeoffs of the solution might differ. NLP pipelines like the above have benefits of transparency and accountability (it is easy to show each model in separation, what data they were trained on, what bias it had, how it performed, which stage suffers the most). But the very manual division introduced by us as an NLP pipeline designers already introduces a limitation to its capabilities.

LLMs on the other hand can take advantage of the highly parallel vast compute resource (beyound single compute nodes with gpus) and learn deeper. However we encounter a problem similar to that of dealing with humans -- how do we know if the person knows what they speak about or they are just very good at speaking words we want to hear.

If you are interested in NLP, fortunately proofs of concepts are very simple to make in ML/AI problems. For the most part it is selecting the framework which would do the magic for you and boilerplating all together. Opennlp offers most of the discussed above steps together with things like sentiment recognition. You can ask for your face to be hugged on huggingface with their NLP solutions, although be prepared that models size are bigger with BERT. You can check xgboost for solving sparse matrixes problems on the concept linking side. This article is very likely to be outdated at the time of your reading though.





Comments

Popular Posts