Vancouver Open Data language census

The City of Vancouver B.C. has an Open Data portal with a rich collection of datasets. This page lists those datasets, and characterises their human language use.

Our vision is that apps based on Vancouver open data should be localised into all the languages Vancouver residents want them in. This language census is a building block towards that vision. It is a project of Vancouver Open Data Day 2017 and Vancouver Open Data Day 2013.

Internationalisation issues in Open Data feeds
In order for an app or web site based on open data to present information in a user's preferred language, the app needs to be localised into that language. This is a task for the developer. Internationalisation (i18n) is a set of design and implementation techniques to make it cheaper and easier to localise an app or web site.

At some point, an app or web site presents data sourced from an open data dataset. In order for the complete user experience to be localised, the dataset also needs to be localised. In general, the localisation of the dataset will happen after the core of the dataset is originally created. A different set of people from the original creators may do the localisation. In some cases, the dataset won't be localised into the target language, so the app will have to present the data in the best available language, even though it's not the target.

A challenge of enabling localisation of open data-sourced apps is to set up formats, social structures, and incentive structures which makes it easier for datasets to get localised into the languages which matter to the end users.

In broad strokes, we can look at different types of data in a dataset:
 * Numbers, which are pure data and have no localisation implications. (Note that display of numbers should follow language conventions, but that doesn't affect how the app parses the number from the data feeds. Also, the dataset format description should be clear how numbers are formatted.) Latitude and longitude coordinates are a variant of numbers.
 * Dates, which are fairly pure data, but can appear in a few different formats where language conventions can help disambiguate. e.g. "1/2/13" could be "January 2 2013" or "February 1 2013". (If the dataset format description is comprehensive, it will clarify which order dates follow.) Only strings with date or time values are marked as Dates. If the component parts of date or time are each assigned to their own column, then there is no language convention ambiguity, so these are Numbers, not Dates, for localisation purposes.
 * Controlled Vocabulary, which is human-readable text but expressed in a limited set of words controlled by the data originator. Where variations in spelling or wording exist, these are errors to be corrected rather than valid variations. We can think of controlled vocabulary phrases as machine-readable identifiers, which can be translated into terminology a user would understand through localisation. For instance the controlled vocabulary entry "Theft From Auto Under $5000" might be localised into English as "Theft from cars". The good news is that the localisation is potentially a one-time task, not a continuing one.
 * Keywords, which is free text in controlled vocabulary clothing. The set of keywords is open to extension at data entry time, instead of being limited by the data originator. Usually, same localisation implications as free text.
 * Free text, which is human-readable text without particular limitations on content. The localisation implication of free text is that localisation is a continuing task, with each new record representing a new work item.
 * Metadata, which is data about the dataset. The most common example of this is the labels in the column headers of a dataset file. Dataset format descriptions, and the dataset descriptions in the dataset catalogue also are metadata. Metadata is frequently in a human language, but the target audience is app developers and data consumers, rather than the end users of an app or website. Thus, to a first approximation, metadata does not have localisation implications.

Analysis of Vancouver datasets
Here are some general observations about language use in Vancouver Open Data datasets.

In general, Metadata for the City of Vancouver Open Data datasets is in English. This isn't tracked in this census.

See also Vancouver Open Data Catalogue dataset. This provides both a more usable list of datasets as starting point for this census, and a data structure which might be able to hold the results of the census.