Tuesday, November 29, 2016

The Critical Importance of Simplicity

This is a post by Luigi Muzii that was initially triggered by this post and this one, but I think it has grown into a broader comment on a key issue related to the successful professional use of MT i.e. the assessment of MT quality and the extent, scope, and management of the post-editing effort. Being able to get a quick and accurate assessment of the specific quality at any given time in a production use scenario is critical, but the assessment process itself cannot be so cumbersome and so complicated a process that the measurement effort becomes a new problem in itself.

While we see that industry leaders and academics continue to develop well meaning but very difficult to deploy (efficiently and cost-effectively) metrics like MQM and DQF, most practitioners are left with BLEU and TER as the only viable and cost-effective measures. However, these easy-to-do metrics have well-known bias issues with RbMT and now with NMT. And given that this estimation issue is the “crux of the biscuit” as Zappa would say, it is worth ongoing consideration and review as doing this correctly is where MT success is hidden. 

Luigi's insistence on keeping this measurement simple, sometimes makes him unpopular with academics and industry "experts",  but I believe that this issue is so often at the heart of a successful and unsuccessful MT deployment that it bears repeated exposure and frequent re-examination as we inch our way to more practical and useful measurement procedures than BLEU which continues to confound discussions of real progress in improving MT quality.

KISS - Keep it simple, stupid” is a design principle noted by the U.S. Navy in 1960 stating that most systems work best if they are kept simple rather than made complicated.

The most profound technologies are those that disappear.
Mark Weiser
The Computer for the Twenty-First Century, Scientific American, 1991, pp. 66–75

The best way to understand the powers and limitations of a technology is to use it.

This can be easily shown for any general-purpose technology, and machine translation can now be considered as such. In fact, the major accomplishment we can acknowledge to Google Translate is that of having popularized widespread translation activity using machine translation, something most celebrated academics, and supposedly influential professional bodies have not been able to achieve after decades of trying.

The translation quality assessment debacle is emblematic i.e. the translation quality issue is, in many ways, representative of the whole translation community.  It has been debated for centuries, mostly at conferences where insiders — always the same people — talk amongst themselves. And the people attending conferences of one kind do not talk with people attending conferences of another kind.

This ill-conceived approach to quality assessment has claimed victims even among scientists working on automatic evaluation methods. Just recently, the nonsensical notion of a “perfect translation” regained momentum. Everybody even fleetingly involved in translation should know that there is nothing easier to show as flawed than the notion of a “perfect translation”. At least according to current assessment practices, in a typical confirmation bias pattern. There is a difference between “acceptable” and “usable” translation and any notion of a perfect translation.

On the verge of the ultimate disruption, translation orthodoxy still dominates even the technology landscape by eradicating the key principle of innovation, simplicity. 
The expected, and yet the overwhelming growth of content has long been going hand in hand with a demand for faster translation in an ever-growing number of language pairs, with machine translation being suggested as the one solution.

The issue remains unsolved, though, of providing buyers with an easy way to know whether the game is worth the candle. Instead, the translation community has been unable so far to provide unknowing buyers anything but an intricate maze of categories and error typologies, weights and parameters, where even an experienced linguist can have a hard time to find his way around.

The still largely widespread claim that the industry should “educate the client” is the concrete manifestation of the typical information asymmetry affecting the translation sector. By inadvertently keeping the customer in the dark, translation academics, pundits, and providers cuddle the silly illusion of gaining respect and consideration for their roles, while they are simply shooting themselves in the feet.

When KantanMT’s Poulomi Choudhury highlights the importance of the central role that the Multidimensional Quality Metrics (MQM) is supposed to play, in all likelihood she is talking to her fellow linguists. However, typical customers simply want to know whether they have to spend further to refine a translation and — possibly — understand how much. Typical customers who are ignorant of the translation production process are not interested in the kind of KPIs that Poulomi describes, while they could be interested in a totally different set of KPIs, to assess the reliability of a prospective partner.

Possibly, the perverse complexity of unnecessarily intricate metrics for translation quality assessment is meant to hide the uncertainty and resulting ambiguity of theorists and the inability and failure of theory rather than to reassure customers and provide them with usable tools.

In fact, every time you try to question the cumbersome and flawed mechanism behind such metrics, the academic community closes like a clam.

In her post, Poulomi Choudhury suggests setting exact parameters for reviewers. Unfortunately, the inception of the countless fights between translators and reviewers, between translators and reviewers and terminologists, and between translators and reviewers and terminologists and subject experts and in-country reviewers gets lost in the mist of time.

Not only are reviewing and post-editing (PEMT) instructions a rare commodity, the same translation pundits who tirelessly flood the industry with pointless standards and intricate metrics — possibly without having spent a single hour in their life negotiating with customers — have not produced even a guideline skeleton to help practitioners develop such procedural overviews.

As implementing a machine translation platform is no stroll for DIY ramblers, writing PEMT guidelines is not straightforward either, requiring specific know-how, and understanding, recalling the rationale for hiring a consultant when working with MT.

For example, although writing instructions for post-editors is a once-only task, different engines, domains, and language pairs require different instructions to meet the needs of different PEMT efforts. Once written, these instructions must then be kept up-to-date as new engines, language pairs, or domains are implemented so they vary continuously. Also, to help project managers assess the PEMT effort, these instructions should address the quality issue with guidelines and thresholds and scores for raw translation. Obviously, they should be clear and concise, and this might very well be the hardest part.

As well as being related to the quality of the raw output, the PEMT effort is a measure that any customer should be able to easily understand as a direct indicator of the potential expenditures to achieve business goals. In this respect, it should be properly described and we should go with tools that help the customer financially estimate the amount of work required to achieve the desired quality level from a machine translation output.

Indeed, the PEMT effort depends on diverse factors such as the volume of content to process, the turnaround time, and the quality expectations for the finalized output. Most importantly, it depends on the suitability of source data and input for (machine) translation.

Therefore, however, assessable through automatic measurements, PEMT effort can only be loosely estimated and projected. In this respect, KantanMT is offering the finest tool combination for accurate estimates. 
On the other hand, a downstream measurement of the PEMT effort by comparing the final post-edited translation with the raw machine translation output is reactive (just like the typical translation quality assessment practice) rather than predictive (that is business-oriented).

Also, a downstream compensation model requires an accurate measurement of the actual work performed to infer the percentage on the hourly rate from the edit distance, as no positive correlation exists between edit distance and actual throughput.
Nonetheless, tracking the PEMT effort can be useful if the resulting data is compared with estimates to derive a historical series. After all, that’s how data empowers us.

Predictability is a major driver in any business, and it should come as no surprise, then, that translation buyers have no interest in dealing with the intricacy of quality metrics that are irredeemably prone to subjectivity, ambiguity, and misinterpretation, and, most importantly, are irrelevant to them. When it comes to business — and real money — gambling is never the first option, but the last resort. (KV: Predictability here would mean a defined effort ($ & time) that would result in a defined outcome (sample of acceptable output)).
On the other hand, more than a quarter of a century has passed since the introduction of CAT tools in the professional field, many books and papers have been written about them, and yet many still feel the urge to explain what they are. Maybe this might make sense for the few customers who are entirely new to translation, even though what they could be interested to know is just that providers would use some tool of the trade and spare them some money. And yet, quality would remain a major concern, as a recent SDL study showed.

An introduction to CAT tools is at the very least curious when recipients are translation professionals or translation students about to graduate. Even debunking some still popular myths about CAT is just as curious, unless considering the number of preachers thundering from their virtual pulpits against the hazards of these instruments of the devil.

In this apocalyptic scenario, even a significant leap forward could go almost unnoticed. Lilt is an innovative translation tool, with some fabulous features, especially for professional translators. As Kirti Vashee points out, it is a virtual translator assistant. It also presents a few drawbacks, though.
Post-editing is the ferry to the singularity. It could be run interactively, or downstream on an entire corpus of machine translation output.

When fed with properly arranged linguistic data from existing translation memories, Lilt could be an extraordinary post-editing tool also on bilingual files. Unfortunately, the edits made by a single user only affects the dataset associated with that account and the task that is underway. In other words, Lilt is by no means a collaborative translation environment. Yet.

This means that, for Lilt to be effective with typically large PEMT jobs involving teams, accurate PEMT instructions are essential, and, most importantly, post-editors should strictly follow them. This is a serious issue. Computers never break rules, while free-will enables humans to deviate from them.

Finally, although cloud computing is now usual in business, Lilt can still present a major problem to many translation industry players for being only available in the cloud, due to the need for a fast Internet connection, or to the vexed — although repeatedly demystified — question of data protection for IP reasons, and despite the computing resources to process the vast amount of data that would hardly make sense for a typical SME to have.

In conclusion, when you start a business, it is usually to make money, and money is not necessarily bad if you do no evil, pecunia non olet. And money usually comes from buyers, whose prime requirement can be summarized as “Give me something I can understand.”

My ignorance will excuse me.


Luigi Muzii's profile photo

Luigi Muzii has been in the "translation business" since 1982 and has been a business consultant since 2002, in the translation and localization industry through his firm . He focuses on helping customers choose and implement best-suited technologies and redesign their business processes for the greatest effectiveness of translation and localization related work.

This link provides access to his other blog posts.

Thursday, November 24, 2016

The Thanksgiving Myth

Thanksgiving is fundamentally about giving thanks. Though, according to Wikipedia and what we are generally told in the US, it has associations with Pilgrims, Puritans and being a harvest festival in the US. For Native Americans, the story of Thanksgiving is not a very happy one.

“Thanksgiving” has become a time of mourning for many Native People. It serves as a period of remembering how a gift of generosity was rewarded by theft of land and seed corn, extermination of many Native people from disease, and near total elimination of many more from forced assimilation. As celebrated in America “Thanksgiving” is a reminder of 500 years of betrayal. To many Native Americans, the Thanksgiving Myth amounts to the settler’s justification for the genocide of Indigenous peoples. Native Americans think of this official U.S. celebration of the survival of early arrivals in a European invasion that culminated in the death of 10+ million native people. Here is a  view of how one Native American views the holiday who provides some background on the source of this darker view and also shares why she has chosen to view it in another way with a spirit of forgiveness.

Thanksgiving is also associated with hard core shopping in the U.S. with something called Black Friday.  However, in the modern era, where few are aware of the damage to the native cultures by the original settlers and broken treaties, it is essentially about feasting, football, shopping and expressing gratitude. This is what most of my personal experience has been, football, shopping and turkey (apparently 45 million will die). 

While I have never resonated with the commercialism of the event, I have always felt that the celebration of gratitude is wonderful. Gratitude is an emotion expressing appreciation for what one has — as opposed to, for example, a consumer-driven emphasis on what one wants. Gratitude is getting a great deal of attention as a facet of positive psychology: Studies show that we can deliberately cultivate gratitude, and can increase our well-being and happiness by doing so. In addition, gratefulness—and especially expression of it to others -- is associated with increased energy, optimism, and empathy.

What Is Gratitude?

Robert Emmons, perhaps the world’s leading scientific expert on gratitude, argues that gratitude has two key components, which he describes in a Greater Good essay, “Why Gratitude Is Good.”

“First,” he writes, “it’s an affirmation of goodness. We affirm that there are good things in the world, gifts and benefits we’ve received.”

In the second part of gratitude, he explains, “we recognize that the sources of this goodness are outside of ourselves. … We acknowledge that other people -- or even higher powers, if you’re of a spiritual mindset—gave us many gifts, big and small, to help us achieve the goodness in our lives.”

Emmons and other researchers see the social dimension as being especially important to gratitude. “I see it as a relationship-strengthening emotion,“ writes Emmons, “because it requires us to see how we’ve been supported and affirmed by other people.”
Because gratitude encourages us not only to appreciate gifts but to repay them (or pay them forward), the sociologist Georg Simmel called it “the moral memory of mankind.”

As an immigrant to America I have always felt that the Thanksgiving story I was told about Pilgrims and "Indians" holding hands and smiling, was at least a little bit shaky based on my very limited knowledge of American history. However, it just never rang true to my mind. And while I feel that any day when a family and a community gather to give thanks, is special and worthy of celebration, I think we should also acknowledge that the history we are told is suspect, as often, history is written by the victors and not by men of even and truthful temperance. Part of giving thanks, it seems to me is to also acknowledge the sacrifices of our ancestors who may have made one’s plenitude possible. This would include the Native Americans if you live in North America, as they have always regarded themselves as caretakers of the land rather than owners of it. The following statement is something that you will hear from many Native Americans about their ethos.

As America’s Host People, Native Americans are the keepers of the land, that is our sacred duty. Our responsibilities include bringing the land, the people, and the rest of creation back into harmony.
 On this particular Thanksgiving, near the Standing Rock Sioux Reservation, we have yet another example of Native Americans standing up for what they believe is a sacred trust, to protect the desecration of land they consider holy, and protect potential damage to the largest drinking water supply in the region. This is yet another example of the betrayal of a treaty with the US government, as many believe that this should have been prevented by treaties already in place. From one perspective the issues are complex as described here and in looking at the oil price economics driving the project. The world has been electrified by protests against the Dakota access pipeline. Is this a new civil rights movement where environmental and human rights meet?

For the Elders leading the protest there are 3 clear reasons to try and stop this:
  1. Prevent desecration of sacred burial grounds and what is considered “holy” land,
  2. Protect a major supply of natural drinking water from potential oil spill accidents,
  3. They have treaty in place with US government that was supposed to protect against commercial exploitation of protected land.

 For those who think that the oil spill potential is overstated, should take a look at how frequently these accidents do happen, and what happens when they do. Galveston Bay, a hub for oil traffic, for example averages close to 300 oil spills of various sizes each year. As you may have guessed, Galveston is not known for it’s wonderful beach experience. The Exxon Valdez spill still has a negative impact 25 years later, and the environment and wildlife has yet to fully recover from the accident. The impact of the Deepwater Horizon spill examined 5 years later, shows that while nature does have a recovery process, some things can take decades or longer to even understand the damage let alone recover.

Here is a video of a 90,000 gallon spill in May 2016 that did not even make the daily news since these kinds of spills are so common.

So on this Thanksgiving, I also give thanks to those who oppose this pipeline and make a valiant attempt to stop the potential destruction of one of the largest natural drinking water supplies in the US. The Native American ethos also has a very unique view on death in such a battle. When one battles and fights for the community well being, and for the land, it is considered to be a noble death since it is a sacrifice for the well being of others.  Robbie Robertson (of The Band) captures the emotion that these “water protectors” must feel at Standing Rock, wonderfully in this live rendition of “It is a good day to die”, a quote attributed to Crazy Horse. This translation is the English bastardization of a common Sioux battle-cry of, "Nake nula wauŋ welo!" This phrase really means, "I am ready for whatever comes." It was meant to show the warriors were not afraid of the battle or dying in it. So... Crazy Horse probably shouted, "Hokahey! Nake nula wauŋ welo!"

I wish you all a warm and loving Thanksgiving as you express your gratitude for your plenitude.


Tuesday, November 22, 2016

Understanding Your Data Using Corpus Analysis

If you were surprised by the outcome of the recent US Presidential elections, you can imagine the surprise of the “expert” pollsters whose alleged expertise it is, to predict these events. These predictions were based on an understanding of the population (the data), which in this case meant predicting how 120 million people would vote based on a sample of 25,000 or maybe 100,000 people who are assumed to be a representative sample. They were all wrong because the sample was simply not representative of the actual voting population. So it goes. It is very easy to go wrong with big data even when you have deep expertise.

This is not so different from Google claiming “human quality MT” based on a sample of 500 sentences. Unfortunately for them, it is just not true once you step away from these 500 sentences. The real world is much more unpredictable.

This is a guest post by Juan Rowda about Corpus Analysis which is a technical way of saying it is about understanding your data in a serious MT use case scenario. Juan has special expertise with corpus analysis as he is involved in a translation project that is attempting to translate billions of words where this type of analysis is critical. And, unlike the boys at Google, who shout hooray with 500 sentences that look good, the linguists at eBay judge themselves on how well they are doing with 60 million product listings across 12,000 categories and so are less likely to be shouting hooray until many millions of sentences are looking good.They also have millions of people looking at the translations so I am sure get regular feedback when the MT is not good. This post is more demanding at a technical level than many of the posts on eMpTy Pages, but I hope that you will find it useful.

Statistical Machine Translation (SMT) needs considerably big amounts of text data to produce good translations. We are talking about millions of words. But it’s not simply any text data – it’s good data that will produce good translations. Does “garbage in, garbage out” ring a bell?

In this scenario, and speaking mainly from a linguist’s perspective, the challenge is how to make sense of all of these millions of words? What to do to find out whether the quality of a corpus is good enough to be used in your MT system? How do you know what to improve if you realize a corpus is not good? How to know what your corpus is about?

It’s simply unrealistic to try to understand your corpus by reading every single line or word in it.


Corpus analysis can help you find answers to these questions. It can also help you understand how your MT system is performing and why. It can even help you understand how your post-editors are performing.

In this article, I will cover some analysis techniques and tips that I believe are useful and effective to understand your corpus better. Please note that, to keep things simple, I will call corpus to any text sample, either used to produce translations or being the result of a translation-related process.

 The Tools

I’m going to cover two tools: AntConc and Python. The first one is a corpus analysis tool exclusively. The latter is a programming language (linguists, please, don't panic!), but I’m going to show you how you can use a natural language processing module (NLTK) to dig into your corpora, and provide snippets of code for you to try.

AntConc and Python can be used in Windows, Mac and Linux.

As defined by its website, AntConc is a freeware corpus analysis toolkit for concordancing and text analysis. It’s really simple to use, it contains 7 main tools for analysis and has several interesting features. We will take a closer look at the details and how the tool can be used with the following examples.

Getting a Word List
A great way to know more about your corpus is getting a list of all the words that appear in it. AntConc can easily create a list with all the words that appear in your corpus, and show important additional information about them, like how many tokens are there and the frequency of each. Knowing which words appear in your corpus can help you identify what it is about; the frequency can help you determine which are the most important words.

You can also see how many tokens (individual words) and word types (unique words) are there in a corpus. This is important to determine how varied (how many different words) your text is.

To create a Word List, after loading your corpus file(s), click the Word List tab and click Start. You’ll see a list of words sorted by frequency by default. You can change the sorting order in the Sort by drop down. Besides frequency, you can sort alphabetically and by word ending.

Frequency is often a good indicator of important words - it makes sense to assume that tokens that appear many times have a more relevant role in the text.

But what about prepositions or determiners and other words that don’t really add any meaning to the analysis? You can define a word list range, i.e., you can add stopwords (words you want to exclude from your analysis) individually or entire lists.

Word lists are also a very good resource to create glossaries. You can either use the frequency to identify key words or just go through the list to identify words that may be difficult to translate.

Keyword Lists
This feature allows you to compare a reference corpus and a target corpus, and calculate words that are unusually frequent or infrequent. What’s the use for this? Well, this can help you get a better insight on post-editing changes, for example, and try to identify words and phrases that were consistently changed by post-editors. It’s safe to assume that the MT system is not producing a correct translation for such words and phrases. You can add these to any blacklists, QA checks, or automated post-editing rules you may be using.

A typical scenario would be this: you use your MT output as target corpus, and post-edited/ human translation (for the same source text, of course) as source corpus; the comparison will tell you which words are frequent in the MT output that are not so frequent in the PE/HT content.

Vintage here is at the top of the list. In my file with MT output segments, it occurs 705 times. If I do the same with the post-edited content, there are 0 occurrences. This means post-editors have consistently changed “vintage” to something else. It’s safe to add this word to my blacklist then, as I’m sure I don’t want to see it in my translated content. If I know how it should be translated, it could be part of an automated post-processing rule. Of course, if you are training your engine with the post-edited content, “vintage” should become less common in the output.

To add a reference corpus, in the Tool Preferences menu, select Add Directory or Add Files to choose your corpus file(s). Click the Load button after adding your files.


Collocates are simply words that occur together. This feature allows you to search for a word in a corpus and get a list of results that show other words that appear next to the search term.  You can see how frequent a collocate is and also choose if results should include collocates appearing to the right of the term, to the left, or both. What’s really interesting about this is that it can help you find occurrences of words that occur near your search term, and not necessarily next to it. For example, in eBay’s listing titles, the word clutch can be sometimes mistranslated. It’s a polysemous word and it can be either a small purse or an auto part. I can do some analysis on the collocate results for clutch (auto parts) and see if terms like bag, leather, purse, etc., occur near it.

You can also select how frequent a collocate needs to be in order to be included in the results.

This is very useful to spot unusual combinations of words, as well. It obviously depends on the language, but a clear example could be a preposition followed by another preposition.

To use this feature, load your corpus files, and click the Collocates tab. Select the From and To ranges - values here contain a number and a letter: L(eft)/R(ight). The number indicates how many words away from the search terms should be included in the results, and L/R indicates the direction in which collocates must appear. You can also select a frequency value. Enter a search term and click start.

All the results obtained with any of the tools AntConc provides can be exported into several formats. This allows you to take your data and process it in any other tool.

This is perhaps one of the most useful features in AntConc. Why? Because it allows you to find patterns. Remember that, when working with MT output, most of the times it’s not realistic to try to find and/or fix every single issue. There may be tons of errors with varying levels of severity in the MT output (especially considering the volumes of content processed by MT), so it does make sense to focus first on those that occur more frequently or that have a higher severity.

Here's a simple example: let’s assume that by looking at your MT output you realize that your MT system is translating the word “inches” into “centimeters”, without making any changes to the numbers that usually precede that word, i.e., 10 inches is being consistently translated as 10 centimeters. You could try to find and fix 1 centimeter, 2 centimeters, 3 centimeters, etc. Rather, a much better choice would be to identify a pattern: “any number” followed by the word “centimeter” should be instead “any number” “inches”. This is an oversimplification, but the point is that identifying an error pattern is a much better approach than fixing individual errors.

Once you have identified a pattern, the next step is to figure out how you can create some sort of rule to find/fix such pattern. Simple patterns made of word or phrases are pretty straightforward - find all instances of “red dress” and replace with “blue dress”, for example. Now, you can take this to the next level by using regular expressions. Going back to the inches example you could easily find all instances of “any number” followed by centimeters with a simple regex like \d+ centimeters, where \d stands for any number and the + signs stands for 1 or more (numbers).

Using the Clusters/N-Grams tool helps you find strings of text based on their length (number of tokens or words), frequency, and even the occurrence of any specific word. Once you open your corpus, AntConc can find a word or a pattern in it and cluster the results in a list. If you search for a word in your corpus, you can opt to see words that precede or follow the word you searched for.

Results can be sorted:
  • by frequency (ideal to find recurring patterns - the more frequent a pattern is, the more relevant it might be),
  • by word (ideal to see how your MT system is dealing with the translation of a particular term),
  • by word end (sorted alphabetically based off the last word in the string),
  • by range (if your corpus is composed of more than one file, in how many of those files the search term appears), and
  • by transitional probability (how likely it is that word2 will occur after word1; e.g., the probability of “Am” occurring after “I” is much higher than “dishwasher” occurring after “I”.).
Let’s see how the Clusters tool can be used. I’ve loaded my corpus in AntConc and I want to see how my system is dealing with the word case. Under the Cluster/Ngrams tab, let’s check the box Word, as I want to enter a specific search term. I want to see clusters that are 3 to 4 words long. And very important here, the Search Term Position option: if you select Left, your search term will be the first word in the cluster; if you select Right, it’ll be the last one instead. Notice in the screenshots, how the Left/Right option selection affects the results.

On Left

On Right

We can also use regular expressions here for cases in which we need more powerful searches. Remember the example about numbers and inches above? Well, numbers, words, spaces, letters, punctuation - all these can be covered with regular expressions.

Let’s take a look at a few examples:

Here, I want to see all 2-word clusters that start with the word “original”, so I’m going to use a boundary (\b) before “original”. I don’t know the second word, it’s actually what I want to find out, so I’m going to use \w, which stands for “any word”. All my results will then have the following form: original+word.

Now, I want to see all clusters, regardless of their frequency, that contain the words “price” OR “quality”. So, in addition to adding the boundaries, I’m going to separate these words with | that simply stands for “or”.

This is really useful when you want to check how the system is dealing with certain words - there’s no need to run separate searches since you can combine any number of words with | between them. Check the Global Settings menu for reference.

For seasoned regex users, note that regex capabilities in AntConc are pretty modest and that some operators are not standard.

If you are not familiar with this term, in a nutshell, an n-gram is any word or sequence of words of any size; a 1-gram is composed of one element, a 2-gram is composed of 2 elements, etc. It’s a term that defines the length of a string rather than its content.

What’s great about this feature is that you can find recurring phrases without specifying any search terms. That is, you can easily obtain a list of, for example, all the 6-grams to 3-grams that occur more than 10 times in your corpus. Remember that clusters work in the opposite way - you find words that surround a specific search term.

The n-gram search is definitely an advantage when you don’t know your corpus very well and you still don’t know what kind of issues to expect. It’s usually a good choice if it’s the first time you are analyzing a corpus - it finds patterns for you: common expressions, repeated phrases, etc.

When working with n-grams, it’s really important to consider frequency. You want to focus your analysis on n-grams that occur frequently first, so you can cover a higher number of issues.

What can you do with your findings, besides the obvious fact of knowing your corpus better? You can find recurring issues and create automated post-editing rules. Automated post-editing is a technique that consists in applying search and replace operations on the MT output. For instance, going back to our initial inches vs. centimeters example, you could create a rule that replaces all instances of number+centimeters with number+inches. Using regular expressions, you can create very powerful, flexible rules. Even though this technique was particularly effective when working with RBMT, it’s still pretty useful for SMT between training cycles (the process in which you feed new data to your system so it learns to produce better translations).

You can also create blacklists with issues found in your MT output. A blacklist is simply a list of terms that you don’t want to see in your target so, for example, if your system is consistently mistranslating the word “case” as a legal case instead of a protective case, you can add the incorrect terms to the blacklists and easily detect when they occur in your output. In the same way, you can create QA checks to run in tools like Checkmate or Xbench.


For those of you not familiar with it, Python is a programming language that has been gaining more and more popularity for several reasons: it's easy to learn and easy to read, it can be run in different environments and operating systems, and there is a significant number of modules that can be imported and used.

Modules are files that contain classes, functions, and other data. Without getting too technical, a module is code already written that you can reuse, without having to write it yourself from scratch. Quick example: if you want to write a program that will use regular expressions, you can simply import the re module and Python will learn how to deal with them thanks to the data in the module.
I'm not a Python expert myself, and I apologize if the terminology I use here is not what experts use.

Enter Natural Language Processing Toolkit
Modules are the perfect segue to introduce the Natural Language Processing Toolkit (nltk). Let me just steal the definition from their site, NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries[...].
Using Python and NLTK, there's quite a few interesting things you can do to learn more about your corpora. I have to assume you are somewhat familiar with Python (not an expert!), as a full tutorial would simply exceed the purpose of this post. If you want to learn more about it, there are really good courses on Coursera, Udemy, Youtube, etc. I personally like Codeacademy's hands-on approach.
The Mise en Place
= To follow these examples, you'll need the following installed:
  • Python 3.5 (version 2.7 works too, but some of these examples may need tweaking)
  • NLTK
  • Numpy (optional)
= To get corpora, you can follow these steps:
You have two options here: you can choose to use corpora provided by NLTK (ideal if you just want to try these examples, see how Python works, etc.) or you can use your own files. Let me walk you through both cases.
If you want to use corpora from NLTK, open your Python's IDLE, import the nltk module (you'll do this every time you want to use nltk) and then download the corpora:

>>> import nltk

A new window will open, and you'll be able to download one or more corpora, as well as other packages. You can find the entire list here.

 When working in Python, you can import (a) all available corpora at the same time or (b) a single corpus. Notice that (a) will import books (like Moby Dick, The Book of Genesis, etc...)

a       >>> from import *
b       >>> from nltk.corpus import brown

If you want to use your own files, you'll have to tell Python where they are, so it can read them. Follow these steps if you want to work with one file (remember to import nltk!):

>>> f = open(r'c:\reviews.txt','rU')
>>> raw =
>>> tokens = nltk.word_tokenize(raw)
>>> text = nltk.Text(tokens)

Basically, I'm telling Python to open my file called reviews.txt saved in C: (the "r" in front of the path is required for Python to read it correctly). I'm also telling Python I want to read, not write on, this file.

Then, I'm asking Python to read the contents of my file and store them in a variable called "raw"; to tokenize the content ("identify" the words in it) and to store those tokens in a variable text. Don't get scared by the technical lingo at this point: a variable is just a name that we assign to a bucket where we store information, so we can later make reference to it.

What if you have more than one file? You can use the Plain Text Corpus Reader to deal with several plaintext documents. Note that, if you follow the example below, you'll need to replace the red sections with the relevant information, like your path, your file extension, etc...

>>> from nltk.corpus import PlaintextCorpusReader
>>> files = ".*\.txt"
>>> corpus0 = PlaintextCorpusReader(r"C:/corpus", files)
>>> corpus  = nltk.Text(corpus0.words())

 Here, I'm asking Python to import PlaintextCorpusReader, that my files have the TXT extension, where the files are stored, and to store the data from my files into a variable called corpus.

You can test if your data was correctly read just by typing the name of the variable containing it:

>>> corpus
<Text: This black and silver Toshiba Excite is a...>
>>> text
<Text:`` It 's a Motorola StarTac , there...>

corpus and text are the variables I used to store data in the examples above.

Analyzing (finally!)
Now that we are all set up, and have our corpora imported, let's see some of the things we can do to analyze it:

We can get a wordcount using the "len" function. It is important to know the size of our corpus, basically to understand what we are dealing with. What we'll obtain is a count of all words and symbols, repeated words included.

>>> len(text)
>>> len(corpus)

If we wanted to count unique tokens, excluding repeated elements:
>>> len(set(corpus))

 With the "set" function, we can get a list of all the words used in our corpus, i.e., a vocabulary:
>>> set(corpus)
{'knowledge', 'Lord', 'stolen', 'one', ':', 'threat', 'PEN', 'gunslingers', 'missions', 'extracting', 'ensuring',
 'Players', 'player', 'must', 'constantly', 'except', 'Domino', 'odds', 'Core', 'SuperSponge', etc..

A list of words is definitely useful, but it's usually better to have them alphabetically sorted. We can also do that easily:

>>> sorted(set(corpus))
["'", '(', ').', ',', '-', '--', '.', '3', '98', ':', 'Ancaria', 'Apocalypse', 'Ashen'
'Barnacle', 'Bikini', 'Black', 'Bond', 'Bottom', 'Boy', 'Core', 'Croft', 'Croy', 'D'
'Dalmatian', 'Domino', 'Egyptian', etc...

Note that Python will put capitalized words at the beginning of your list.

We can check how many times a word is used on average, what we call lexical richness. From a corpus analysis perspective, it's good that a corpus is lexically rich as, theoretically, the MT system will "learn" how to deal with a broader range of words. This indicator can be obtained by dividing the total number of words by the number of unique words:

>>> len(text)/len(set(text))
>>> len(corpus)/len(set(corpus))

 If you need to find out how many times a word occurs in your corpus, you can try the following (notice this is case-sensitive):
>>> text.count("leave")
>>> text.count("Leave")

 One key piece of information you probably want to get is the number of occurrences of each token or vocabulary item. As we mentioned previously, frequent words say a lot about your corpus, they can be used to create glossaries, etc. One way to do this is using frequency distributions. You can also use this method to find how many times a certain word occurs.
>>> fdistcorpus = FreqDist(corpus)
>>> fdistcorpus
FreqDist({',': 33, 'the': 27, 'and': 24, '.': 20, 'a': 20, 'of': 17, 'to': 16, '-': 12,
 'in': 8, 'is': 8, ...})
>>> fdistcorpus['a']

 A similar way to do this is using the vocab function:
>>> text.vocab()
FreqDist({',': 2094, '.': 1919, 'the': 1735, 'a': 1009, 'of': 978, 'and': 912, 'to': 896, 'is': 597
'in': 543, 'that': 518, ...})

Conversely, if you wanted to see the words that only appear one time:
>>> fdistcorpus.hapaxes()
['knowledge', 'opening', 'mystical', 'return', 'bound']

 If you only want to see, for example, the 10 most common tokens from your corpus, there's a function for that:
>>> fdistcorpus.most_common(10)
[(',', 33), ('the', 27), ('and', 24), ('.', 20), ('a', 20), ('of', 17), ('to', 16), ('-', 12), ('in', 8), 
('is', 8)]

We can have the frequency distributions results presented in many ways:
  • one column:
>>> for sample in fdistcorpus:

Here, I'm using a for loop. Loops are typically used when you want to repeat or iterate and action. In this case, I'm asking Python, for each token or sample in my corpus, to print said sample. The loop will perform the same action for all the tokens, one at the time, and stop when it has covered every single one of them.
  • tab-separated:
>>> fdistcorpus.tabulate()
               ,              the              and                .
                a               of               to 

  •  chart
 >>> fdistcorpus.plot()
Let’s say now that you want to obtain ngrams. There are many methods, but here is a very simple one:
>>> from nltk import ngrams
>>> mytext = "I’m selling these fine leather jackets"
>>> n = 3
>>> trigrams = ngrams(mytext.split(),n)
>>> for grams in trigrams:

('I’m', 'selling', 'these')
('selling', 'these', 'fine')
('these', 'fine', 'leather')
('fine', 'leather', 'jackets')
In this example, I’m using a sentence (mytext) for clarity purposes, but you can use a longer text or a file. What I’m doing is splitting the text based on the value of the n variable, in this case, 3. This way, I can print all the trigrams in my text.

If I wanted to get bigrams instead, I can simply change the value of n:
>>> n = 2
>>> bigrams = ngrams(mytext.split(),n)
>>> for grams in bigrams:

('I’m', 'selling')
('selling', 'these')
('these', 'fine')
('fine', 'leather')
('leather', 'jackets')

TIP: use this formula to easily calculate the number of ngrams in a sentence: 
 number of words – ngram + 1

Example: I’m selling these fine leather jackets
6 (words) – 3 (trigrams) + 1 = 4 trigrams

1 - ('I’m', 'selling', 'these')
2 -('selling', 'these', 'fine')
3 - ('these', 'fine', 'leather')
4 -('fine', 'leather', 'jackets')

To close this post, let’s take a look at how you can filter out stopwords. These are very common, high-frequency words in any language, like prepositions or possessive pronouns. For example, if you want to see a list of the most “important” words in your corpus, or if you are trying to create a glossary, words like “the” or “my” barely have any lexical value. You don't need them.

NLTK includes a corpus with more than 2,400 stopwords for 11 languages. You can import this corpus and see its content:
>>> from nltk.corpus import stopwords
>>> stopwords.words('spanish')
['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'del', 'se', 'las', 'por', 'un', 'para', 'con', 'no',

>>> stopwords.words('italian')
['ad', 'al', 'allo', 'ai', 'agli', 'all', 'agl', 'alla', 'alle', 'con', 'col', 'coi'...
Let’s try this on our own corpus. Again, I’m going to use a sentence, but you can try this on files too:
>>> from nltk.corpus import stopwords
>>> from nltk.tokenize import word_tokenize
>>> mytext="Shepard was born on April 11, 2154,[1] is a graduate of the Systems Alliance N7 special forces program (service no. 5923-AC-2826), a veteran of the Skyllian Blitz, and is initially assigned to the SSV Normandy in 2183 as Executive Officer."

>>> stop_words = set(stopwords.words('english'))
>>> word_tokens = word_tokenize(mytext)
>>> filtered_sentence = [w for w in word_tokens if not w in stop_words]

>>> for w in word_tokens:
        if w not in stop_words:
>>> print(filtered_sentence)
['Shepard', 'born', 'April', '11', ',', '2154', ',', '[', '1', ']', 'graduate', 'Systems', 'Alliance', 'N7', 'special', 'forces', 'program', '(', 'service', '.', '5923-AC-2826', ')', ',', 'veteran', 'Skyllian', 'Blitz', ',', 'initially', 'assigned', 'SSV', 'Normandy', '2183', 'Executive', 'Officer', '.']

>>> print(word_tokens)
['Shepard', 'was', 'born', 'on', 'April', '11', ',', '2154', ',', '[', '1', ']', 'is', 'a', 'graduate', 'of', 'the', 'Systems', 'Alliance', 'N7', 'special', 'forces', 'program', '(', 'service', 'no', '.', '5923-AC-2826', ')', ',', 'a', 'veteran', 'of', 'the', 'Skyllian', 'Blitz', ',', 'and', 'is', 'initially', 'assigned', 'to', 'the', 'SSV', 'Normandy', 'in', '2183', 'as', 'Executive', 'Officer', '.']

One of the benefits of filtering out stopwords is that you will reduce the size of your corpus, making it easier to work with. In the example above, we reduced our sentence from 51 to 35 tokens.
>>> len(filtered_sentence)
>>> len(word_tokens)
 There is so much you can do with Python and NLTK that it would be impossible to cover everything in this post. If you are interested in learning more, I encourage you to check out the book “Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit”.


Juan Rowda
Staff MT Language Specialist, eBayín-fernández-rowda-b238915 

Juan is a certified localization professional working in the localization industry since 2003. He joined eBay in 2014. Before that, he worked as translator/editor for several years, managed and trained a team of +10 translators specialized in IT, and also worked as a localization engineer for some time. He first started working with MT in 2006. Juan helped to localize quite a few major video games, as well. 
He was also a professional CAT tool trainer and taught courses on localization.
Juan holds a BA in technical, scientific, legal, and literary translation.