MT as a walled garden
Don’t get me wrong there is a lot of value in MT. And with advances in NMT there may be a point where certain types of translation reach human quality without humans. But for now human quality requires humans as translators, reviewers, post-editors, or in some other role of the translation process. But this post is less about the merits of MT and more about the logistics of using MT at an enterprise. MT is the new walled garden. It replaces TMS products as a new form of customer lock-in. But the beneficiaries are no longer just translation agencies, but also big corporations. MT has also changed the once sacred rule that a customer’s data is private because a customer’s data is abstracted by the MT process. But that is a problem for another post.
The productization of MT
Over the last ten years machine translation has gone from a rules-based to a data-based problem. Generic SMT until a few years ago produced humorous results that were mocked by the localization industry. Customized SMT performed much better and thus it was adopted to improve enterprise translation quality, tone, and style while reducing costs and timelines. Now NMT’s smaller data requirements and ability to improve colloquial translation and fluency are really making inroads in content with short half-lives. However, there is a problem: NMT lacks an exchange format for the training process. Most enterprise organizations do not think about MT training processes, because they buy the translations currently provided by translation agencies as post-edited content. But as NMT becomes more prevalent more enterprise organizations will either hire machine learning scientists or demand that they can easily move their process to a new vendor.
NMT tools are not the same
The training process for SMT and NMT necessitates the use of tools to train and retrain the SMT and NMT engines. Even when these tools are open-sourced there is not an easy way for a customer to move from one form of MT to another without redoing some of their previous training work. Whether NMT engines are trained through Amazon’s Sockeye, Google’s Tensorflow, OPENNMT, or some other tool to train there is still some work necessary to transition it. The resulting translated content is not exactly reproducible unless the same training toolset is used. Or put another way there is not a good way to transition the work performed for one training tool to a new training tool. Perhaps this is not a major issue because as long as the same training data is used across toolsets the work can be redone in a new training environment. But the results may differ. So does an enterprise redo all of their machine translation if they choose to move tools or do they accept the inconsistencies as a byproduct of moving their content?
How do we solve this issue?
It is much easier to identify a problem than it is to solve it. Unfortunately, I don’t have a ready-made solution for this problem. What comes to mind is to drive organizations like TAUS to take the mantle and drive the issue. TAUS has done great work on garnering training data and designing quality tools to be used across the industry. Perhaps Jaap and his team can push for the creation of a rosetta stone for MT training that allows enterprises to move from one training process to another easily.
Amazon and Microsoft’s Gluon go a way towards helping solve this problem, but its real focus is to expand the use of ML by making the integration and interfaces more like other development work.
And to be honest, only outside forces will push for interoperability. Custom offerings and tooling have often been used as offerings to lock in revenue from customers and ensure ongoing revenue (think TMS products before XLIFF, TMX, and TBX). As the use of NMT increases customers will clamor for inter-operability. And if this is impossible then the price pressure on training and maintenance fees will drive competition in this space. And the large to medium-sized players will spin up their own internal teams to own their own destiny and lower costs.