other sites
topics
archives
- April 2005
- May 2005
- June 2005
- July 2005
- August 2005
- September 2005
- November 2005
- December 2005
- January 2006
- February 2006
- March 2006
- April 2006
- May 2006
- June 2006
- July 2006
- August 2006
- September 2006
- October 2006
- November 2006
- December 2006
- January 2007
- February 2007
- March 2007
- April 2007
- May 2007
- June 2007
- July 2007
- August 2007
- September 2007
- October 2007
- November 2007
- December 2007
- January 2008
- February 2008
- March 2008
- April 2008
- May 2008
- June 2008
- July 2008
- August 2008
- September 2008
- October 2008
- November 2008
- December 2008
- January 2009
- February 2009
- March 2009
- April 2009
- May 2009
- June 2009
- September 2009
- November 2009
- December 2009
- March 2010
- April 2010
- June 2010
- August 2010
- January 2011
- July 2011
- November 2011
Thursday, September 11, 2008
The New Big Problem
posted by barsoomcore
At a table during last week's Business of Software conference, I met a fellow whose name escapes me now. Hardly a new occurrence for me, but I do recall the topic we were discussing.
His company does big data migrations for big government agencies. So for example if you're in charge of running the standardized health insurance across Ohio, you need data from every hospital in the state. And of course, each of these hospitals has been storing their data in some insane scheme that was cobbled together by the Pascal programmer who wrote the original record-keeping software back in 1975. It probably wasn't insane, actually. It probably made perfect sense. At the time.
The real problem isn't that this particular Pascal programmer was insane. The problem is that there was more than one Pascal programmer in Ohio in 1975 (I'm guessing, here, but give me a pass on that one). And so the solution that Pascal Programmer A chose for Hospital I was different (perhaps very different, perhaps a little different, doesn't matter) than the solution that Pascal Programmer B chose for Hospital II. Even if the SAME PATIENT was recorded in both hospitals, the data would look subtly different. Pascal Programmer A recorded First_Name and Last_Name as separate columns in a Patient_Data table, while Programmer B made sure NameF and NameL were stored as columns in a Personal_Info table. That's a trivial problem to solve, of course, if you're only worrying about these two databases. It gets more substantial if you're dealing with thousands of databases.
And not all problems in this space are trivial. The same data (or same types of data) can be stored in literally an infinite number of ways, and given the oft-frustrating creativity and ingenuity of people, you can bet that there will be as many variations as there were people involved in the varying.
So this gentleman I met obviously has his work cut out for him, and from what I could tell appears to be doing a roaring business handling this sort of thing. Because if you do want this sort of work done, there's really no solution other than to get a bunch of people to go through all your varied databases and manually map one set of data to the other, then perform whatever complicated trickery is required to do whatever conversion you've decided to implement so that at the end of the whole process, you have ONE dataset that includes everything all the originating databases contained.
And I thought to myself, "Wouldn't it be amazing if somebody figured out how to automate this process? If all the data in the world could just be connected to all the other data in the world? You'd never have to transfer information around, you wouldn't have to keep track of things in various places and forget to update it here after you've updated it there. It would be a transformative thing for the world, if every piece of information we had could talk to every other piece."
I used to think the big problem right now was latency. That the next big innovation was going to appear as some way to accelerate our ability to transfer information from one place to the next. But data normalization on a global scale is BIGGER.
I don't know how it can be done. Just thinking about it, it seems like the most tedious manual process of trivial decision-making imaginable, and yet, how can you derive a process that can look through the masses of databases around the world and understand how they connect to each other? But once you'd done it, well, that would be a world-sized oyster facing you, right there.
I wish I were smarter.
And that's hardly a new occurrence for me, either.
His company does big data migrations for big government agencies. So for example if you're in charge of running the standardized health insurance across Ohio, you need data from every hospital in the state. And of course, each of these hospitals has been storing their data in some insane scheme that was cobbled together by the Pascal programmer who wrote the original record-keeping software back in 1975. It probably wasn't insane, actually. It probably made perfect sense. At the time.
The real problem isn't that this particular Pascal programmer was insane. The problem is that there was more than one Pascal programmer in Ohio in 1975 (I'm guessing, here, but give me a pass on that one). And so the solution that Pascal Programmer A chose for Hospital I was different (perhaps very different, perhaps a little different, doesn't matter) than the solution that Pascal Programmer B chose for Hospital II. Even if the SAME PATIENT was recorded in both hospitals, the data would look subtly different. Pascal Programmer A recorded First_Name and Last_Name as separate columns in a Patient_Data table, while Programmer B made sure NameF and NameL were stored as columns in a Personal_Info table. That's a trivial problem to solve, of course, if you're only worrying about these two databases. It gets more substantial if you're dealing with thousands of databases.
And not all problems in this space are trivial. The same data (or same types of data) can be stored in literally an infinite number of ways, and given the oft-frustrating creativity and ingenuity of people, you can bet that there will be as many variations as there were people involved in the varying.
So this gentleman I met obviously has his work cut out for him, and from what I could tell appears to be doing a roaring business handling this sort of thing. Because if you do want this sort of work done, there's really no solution other than to get a bunch of people to go through all your varied databases and manually map one set of data to the other, then perform whatever complicated trickery is required to do whatever conversion you've decided to implement so that at the end of the whole process, you have ONE dataset that includes everything all the originating databases contained.
And I thought to myself, "Wouldn't it be amazing if somebody figured out how to automate this process? If all the data in the world could just be connected to all the other data in the world? You'd never have to transfer information around, you wouldn't have to keep track of things in various places and forget to update it here after you've updated it there. It would be a transformative thing for the world, if every piece of information we had could talk to every other piece."
I used to think the big problem right now was latency. That the next big innovation was going to appear as some way to accelerate our ability to transfer information from one place to the next. But data normalization on a global scale is BIGGER.
I don't know how it can be done. Just thinking about it, it seems like the most tedious manual process of trivial decision-making imaginable, and yet, how can you derive a process that can look through the masses of databases around the world and understand how they connect to each other? But once you'd done it, well, that would be a world-sized oyster facing you, right there.
I wish I were smarter.
And that's hardly a new occurrence for me, either.
Photo: "network spheres" by gerard79.
Labels: Thinking
Post a Comment