Data consistency is everything

I have recently been working with some social scientists on estimating the prevalence of certain disabilities in the UK regions using census, health survey for england and R. One of the aims is to show the stats on a map. Edina’s thematic mapping service helps here. However I was having difficulty with inconsistencies in the UK district classifications. Why is the source data for each of the countriesĀ in the UKĀ (available through edina again) published in a slightly different format:

Scotland:

“COUNCIL_AREA”,”NAME”,”ONS_CODE”

“01”,”Aberdeen City”,”QA”

England/Wales:

COUNTY_CODE_2001 DISTRICT_CODE_2001 DISTRICT_NAME_2001

00 AA City of London

Northern Ireland:

District,District Labels

95AA,Antrim

So, we have two in csv, one in tab, one which uses quotes around fields and headers, one which joins the codes into a 4 letter string and different orders in each file. How much time could we all save if the creators of such data talked to one another.