More and more companies and organizations are recognizing the importance of data and information quality. Issues and initiatives such as the value of a single customer view, data integration, fraud prevention, customer relationship management, operational risk management compliance and anti-terrorism have become boardroom themes. As a result, high quality customer data have become the prerequisite for successful business decisions. In order to reach the intended data quality level, a lot of money is being invested in solutions for input control, file merging, data enrichment and duplicate identification.
But do these investments guarantee high quality data and information? For example, are the data quality tools and processes equipped for the inevitable internationalization of our business community?
Why do we know that William Johnson International Logistics Ltd
and W. Johnson Int. Transport Co.
are probably different appellations for the same company? How do we determine that Leonard
is a given name in Leonard Peters
and a surname in Leonard & Peters
? Without being all that aware of it, we are using methods such as pattern recognition, context analysis and other linguistic considerations.
Data quality has several strong country-specific levels of complexity. Therefore, correct automated processing of data requires specific expertise.
It starts with the country-specific interpretation of customer data. Human Inference has chosen a natural language processing approach to be able to satisfy the principle “think global, act local”.
To answer the question ”what’s in a name?” people will use their knowledge of language and culture to interpret the data they will encounter in daily life.
Correct, automated interpretation of personal and company names needs to imitate this natural language processing ability. This means that an extensive knowledge dictionary must be built, containing relevant information on the components names consist of. Furthermore, a grammar is needed to take care of issues such as context rules, ambiguity checks, structure recognition, semantic associations and probability estimates. Using the knowledge and the grammar, the software solution decides what is the most probable signification of a word in a name. This is the basis for all data quality processes.
Paul Simon & Art Garfunkel
Johnson’s Art Gallery
ART Abbot Regency Theatre
< signification ambiguity >
Data Quality Across the Border
Having established an approach in which the knowledge base and the grammar is combined with mathematical and statistic methods, it’s time to take a look across the border. Apparently, a lot of companies doing business abroad, often seem to forget that they are dealing with a large variety of languages, names, address conventions and other culturally embedded business rules and habits. Naturally, data quality issues go beyond relationship data (product data and quality processes, for example, are also important themes in the total data quality proposition), but for this article, I will briefly focus on European name specifics, address variety and languages.
The names Haddad, Hernández, Le Fèvre, Smid, Ferreiro, Schmidt, Kuznetsov en Kovács are illustrative for the variety of names in Europe. These names all mean “Smith” in different countries. Naturally, there is a large variety of names in the US as well, but the rules and habits concerning structure, storage, exchange and representation are far more intricate in the various European countries. Here are some examples:
Signification of name components
Due to divergent naming conventions, there is a great variety in storage, exchange, representation and signification of names. For example, the first name Joan is male in Spain and female in Belgium. Also, the representation will be exactly reverse. The form of address ‘Señor’ is the male equivalent of ‘Mevrouw’. In Spain: Señor Joan Martinez Fonseca Andrade. In Belgium: Mevrouw Vandenwalle, Joan.
A name like Van Buren would be sorted under ‘V’ in the US. In the Netherlands, for example, that name will always be found under the letter ‘B’. Additionally, the spelling (initial capital or not?) of prefixes differs per situation, per country.
The use of patronymics (names derived from the father’s first name) is highly country specific. Whereas a Russian man whose father’s first name is Ivan, will add the patronymic Ivanovich to his family name, his sister will use Ivanovna: Sergei Ivanovich Golubev and Olga Ivanovna Golubeva. In Iceland, it is impossible to establish relation through analysis of family names. Here the patronymic serves as the family name itself: The son and daughter of Björn Thorgeirson will be called Nils Björnson and Anna Björnsdottir, respectively.
An even greater challenge lies in the interpretation and processing of European postal addresses. And this is not only because there are so many variations of the postcode. In Europe, there are countries that do not even have a postcode. If we compare three regular addresses of three European countries, we will find that the initial analysis is only the tip of the iceberg. Imagine the variety if you were to compare all the European countries.
| France ||Spain ||UK |
Mme. Eva Riebe
38b, rue de Benfeld
|Pilar Gonzales |
Paseo de Gracia 202, 2°1a
15 Cooper Cres
A few observations:
- Different components with different significations
- Order of the components
- Number of address lines
- Format of the postcode
- Different significations of the addition to the plot number.
Companies selling products or services in Europe will have to cope with the language (or languages!) of the particular country they are doing business in. Identifying customers, prospects and product descriptions is very hard when there is little or no knowledge on the local language.
For example, when comparing product or customer data in an automated system, the use of a soundexing mechanism, will not be able to successfully distinguish between all the European phonologic variations. Especially when diacritics are involved. There is a huge difference in sound between the product name Süßtrink
. And in Scandinavia, nobody would consider linking the family names Hällström
. In Europe, a robust phrase converter is needed to process phonologic similarity and variety. Soundexing is too crude an instrument in a multilingual environment.
Naturally, there are many more data and information quality aspects to consider when crossing the border. Think of multiple character sets, privacy issues, and different currency and date notation.
This article is not intended to supply an exhaustive list of these aspects, but it aims to give an impulse towards new approaches of solving data quality problems.
Companies working with international data are highly dependent on understanding name specifics, address conventions, languages, codepages, culture, habits, business rules and legislation.
Knowledge is the key for successful cross-border business initiatives.Copyright © 2010 Human Inference Enterprise B.V.
SOURCE: Knowledge is the Key to International Data Quality