The next series in my data visualization project was on how computers understand text. As humans, we read text as words and groups of words that we call phrases. In the english language and many others, these are made up of characters that we call letters. Computers see text as individual characters that are actually just represented by numbers. For this series, I wanted to illustrate the difference in how we interpret language and show the benefits of reading it like computers do.
Basic Visualization Explanation
In this visualization, each word is represented by a curving line. As the line travels along the image, it flows vertically based on the letters in the word. For example, here is the word “language.”
The alphabet is organized with ‘a’ at the bottom and ‘z’ at the top. The letters are at the flat sections or “plateaus” on the line. In this visualization, you can clearly see the two ‘a’ characters as the lowest two points in the diagram.
The English Dictionary
The first set of words I visualized was the english dictionary. I used the Merriam-Webster dictionary and set a maximum word length of 28 characters (the length of the word antidisestablishmentarianism).
This is the exact same format as the first image. I made each line a bit transparent so that you can see them stack up along popular letters. This is why it also appears to fade out. If you click on the image and zoom in, you can see the longer words almost all on their own on the right side of the image.
There are some interesting patterns that we can see in this image. Below, I have highlighted the positions showing vowels with red lines and the position of ‘s’ with a green line.
We can see that these letters are popular based on the concentration of lines going to and from them. Another very interesting phenomenon is that of the letter ‘q’. It is almost always followed by the letter ‘u’ and we can see this very clearly in the visualization. The middle letter below is ‘q’. Almost every time ‘q’ is used, it is followed by ‘u’.
The Declaration of Independence
The next set of words was the Declaration of Independence. This is where we get into the actual usage of the english language.
This is at a different scale than the above visualizations due to the fact that we don’t actually use 28-letter words in normal writing or talking. Another thing that is different about this visualization is that there is an added row at the top of the image that denotes punctuation. With this line, we can see things like the length of the last word of each sentence. In the Declaration of Independence, sentences tend to end with 5 to 8 letter words. There are also no words that start with x, y, or z. The letter ‘q’ is almost never used and neither is the letter ‘j’. Words actually end in the letter ‘y’ fairly often.
We can also compare different sets of words. Here is the Declaration of Independence (white lines) and the English dictionary (black lines):
From this, we can see that we don’t usually use the last quarter (top) of the alphabet much and that, in actual writing, we don’t often use longer words.
With comparison of different texts, the next step is comparing languages. I used the Declaration of Independence (English: blue) as my reference document and translated it into Somali (green) and French (red).
The first pattern I noticed in this was the french use of the accent after the first letter of a word. If you look at the second letter (second plateau) and look at the top of the image, there is a collection of red lines going to the row that represents punctuation. This is the visual representation of french words like d’Amerique, l’abolir, and l’ont. This pattern of punctuation is not very common in many languages so it stands out on this visualization.
Next, if you look at the bottom of the visualization, you can see that there is a heavy concentration of green lines leading to ‘a’. This is the visual representation of the commonality of the letter ‘a’ in the Somali language.
One final pattern is related once again to the letter ‘q’. In the Somali language, the letter ‘q’ is not always followed by ‘u’. In the image below, the blue and red lines (english and french) follow the “qu” pattern but the green line (Somali) does not.
I wrote this visualization in order to show how computers look at language, but have realized that it can actually help us understand our language as a whole and its relation to other languages. Using patterns found in different languages and even in the writing of different authors, we can actually use this sort of visualization to recognize writing styles and languages in a visual manner.
Just for fun, here is a visualization of this blog post:
(Guess what the longest word is….)
Comment below if you see any other interesting patterns that I didn’t point out!