On Sep 2, 2012, at 9:54 AM, Kingsley Idehen wrote:
Have the columns you're interested in been
profiled?
What does that mean
Excuse me... should have fully spelled it out... "data
profiling"
I first stepped in this particular cow-pie in 1976 when
working with the Massachusetts AFDC (Aid For Dependent
Children) masterfile. Turn on your wayback machine...
"database" really wasn't a word in widespread use & less
in actual use. Flat files ruled.
Long story short, egg on my face, turns out there
were—surprise, surprise!!—7 different values in the gender
code field. Who would have thought. Sure wish someone had
told me. I might have looked, but did not have access to the
live data.
To the best of my knowledge, this practice continues today
(I've even done it to myself)... when one looks into a
field/column you're likely to find pretty much anything.
I don't remember precisely, but I've been told the Canadian
health care "standard" for gender code is something like 14 or
17 values.
Data profiling is a non-trivial exercise where one examines
in painful statistical detail the actual—as opposed to the
expected/believed—contents of fields/columns. Tends to get
ugly very quickly.
If you're tossing a bunch of RDBMS data into BAGs, I'm a
little surprised you've not encountered this issue.