How can open data help bridge the digital language divide?

Over half of the landing pages of the web’s most visited sites are written in English. In light of ‘information inequality’ online, Alexander Leon explores how open data can help improve the web for non-English speakers

A tower of world flags from the 2012 Summer Olympics - image: Karen Roe CC-BY-2.0

The word data, coming from Latin, literally means 'something given'. For a word with such an altruistic etymology, it is stifling that data finds itself, in the internet age, facing problems of exclusivity and restriction. We are familiar with the issue of important, valuable data being inaccessible, but beyond this lies another issue of contention seldom discussed but of supreme importance to the international community: why is so much of our data and online content published solely in English and how can we work towards creating a more inclusive digital space?

Historically, the English language has dominated cyberspace. In 1997, David Crystal estimated that over 80% of online content was in English, and although this has been in steady decline as internet access has become more available worldwide, a study in March 2015 estimated that the homepages of just under 55% of the most visited websites are in English. When we compare these figures to the number of native-level English speakers (around 20% of the world population), it becomes clear that if you are not a speaker of one of the top 10 most common languages featured on the internet, then it is of relatively little use to you. In other words, there's data out there, something to be given, but we're not making it easy for speakers of Tswana, Shan or Aymara to access it. This type of ‘information inequality’ is a result of the wider problem of the ‘digital language divide’, that is, the gap between regions that do and don't have access to information and communication technologies in their native language.

It is at this point that data – this time open data – re-enters our narrative. One of the most poetic qualities of data is that while it can unearth a problem, it can, upon being opened up, also be the driving force behind its solution. In the case of empowering minority language speakers, open data plays a pivotal role in uncovering the digital world.

One approach that leverages open data to help solve the issue follows the reasoning of embracing globalisation – that is, as it seems that English isn't going anywhere, let's make it easier for minority language speakers to adopt it. But beyond our perception of English as the current lingua franca, do we have any proof that its dominance on the web will persist beyond the next few decades? Data visualisation projects, such as those of the Global Language Network, harness open data to quantify the lasting global importance of the English language. Researchers were able to show, through analysis of long-term translation trends on Twitter, Wikipedia and digitised book translations, that the vast majority of text-based content on the Web is written in English, or ends up translated into English eventually. Even languages which trend away from English, due to demographic or cultural reasons (for example, minority Chinese dialects being translated into Mandarin), are pulled back to English further down the line. Essentially, when it comes to languages both on and offline, all roads lead to, well, English.

So how can we draw upon open data to make learning English more feasible? FLAX (Flexible Language Acquisition) is an award-winning language learning tool which combines publicly available digital libraries (such as The British National Corpus and its American equivalent) with powerful word analysis software, automating learning exercises tailored to any given text. Due to the near endless supply of content available, FLAX allows students of English to learn with texts that are of particular interest to them. Naturally, it's free to use.

Open data also plays a supporting role in the movement to preserve and promote linguistic diversity, especially through the documentation of minority languages and the creation of translation and learning platforms. The Endangered Languages Project maps out languages across the world that are at-risk, endangered or dead, providing users with information and learning resources to help revive them. The movement to increase native content for minority language speakers is relatively new, thus the simple act of documenting the thousands that exist equips researchers a strong foundation for further work.

Going one step further, Openwords, a US-based startup, aims to create a free smartphone app which uses open language data to provide learning modules in over 1,000 languages, including many minority languages that suffer a scarcity of learning resources. Using open lexical and educational data, Openwords collects words from a variety of languages and builds lessons around them, allowing users to choose those they are interested in. With a range of APIs, Openwords provides a free, open alternative to costly and inflexible language learning resources.

Whether it's facilitating the learning of English or creating a larger profile for minority languages, the onus is on the English-speaking diaspora, who enjoy a relative freedom of data access, to attempt to balance the scales.

Through the application and advocation of open data we can have reasonable hope for a future in which people everywhere, no matter their chosen language, can have equal access to the online world.

Alex Leon is a Junior Consultant at the ODI. Follow @alxndrleon on Twitter.

If you have ideas or experience in open data that you'd like to share, pitch us a blog or tweet us at @ODIHQ.