Data consistency is everything

I have recently been working with some social scientists on estimating the prevalence of certain disabilities in the UK regions using census, health survey for england and R. One of the aims is to show the stats on a map. Edina’s thematic mapping service helps here. However I was having difficulty with inconsistencies in the UK district classifications. Why is the source data for each of the countriesĀ in the UKĀ (available through edina again) published in a slightly different format:



“01”,”Aberdeen City”,”QA”



00 AA City of London

Northern Ireland:

District,District Labels


So, we have two in csv, one in tab, one which uses quotes around fields and headers, one which joins the codes into a 4 letter string and different orders in each file. How much time could we all save if the creators of such data talked to one another.