Filling in the Empty Space – The Personal Data Store
I said here that at present there are very few genuine VRM tools available right for use right now, and that the main reason for that is that the underlying plumbing is not yet in place at any kind of scale.
By ‘plumbing’, I mean that ‘personal data stores’ and all that they imply are not as yet deployed en masse, or with any degree of robust functionality.
Before we get into what it will take to change that, let’s take a look at what I mean by the term ‘personal data store’, because obviously that is open to interpretation, and indeed this has been the subject of much debate in the Project VRM community. To get to the heart of that, I think it is useful to draw a parallel to the deployment of data warehouses within organisations, a process which began some 30 years back, and continues to evolve and extend today. The raison d’etre for a data warehouse within an organisation is normally to pull together the data from multiple operational sources (silo’s), organise that data, enhance it and make it available for use – whether that be for analysis within the warehouse, or via applications that will tap into it. Pulling data in from multiple operational systems is the key, because what is being acknowledged is that no one operational system can pull together a data set that is sufficiently rich, deep and broad to enable all of the functions required to run the organisation. That is to say, we need to distinguish between systems that are there to fulfill a specific task (an operational system such as a CRM application, an ERP instance, or a web site), and those whose main purpose is to generate knowledge, enable understanding and enable sharing information built across multiple business functions.
A further defining characteristic of the data warehouse is that it runs on ‘atomic level’ data, that is to say data that is stored at the lowest level of detail available from the feeder system (e.g. line item of a receipt). When data is stored in this way, it can be aggregated and summarised where appropriate or necessary for use. This then enables a further defining characteristic of a data warehouse….that one cannot predict in advance all of the uses to which the data might be put which storing at an aggregated level would limit. The same will apply in the personal data store.
So what else is involved in data warehousing that might inform our thinking about personal data stores?
Firstly, i’d suggest there is a (mainly manual) ‘discovery’ phase in both that is about identifying and engaging with valid data sources (i.e. inputs to the store). In practice the data to be sourced is driven by the prioritised functionality sought by the user. For example, if my main purpose for the personal data store is to help me manage my health, clearly i’m going to need my health and my health care supplier data, or links to it, in the store.
Next, we need to consider the personal equivalent of the ETL processes and tools deployed in data warehouses; ETL is short for Extract, Transform and Load. In recognising the likely need for ETL equivalents, we imply that:
a) the personal data store will have its own target data schema (design), with greater of lesser degrees of flexibility built in dependent on technical choices. I think there will necessarily be open standards around personal data store design. That’s not the case in the data warehousing world (Oracle, SAP, Teradata, IBM are all largely proprietary), but I don’t think that approach is sustainable for the personal variant which needs to run at greater scale and much lower cost.
b) most/ all of the data sources will not hold data in precisely the same format/ design as the target data schema.
Extract, Transform and Load usually consists a set up phase, and then automation; many ETL tools exist in the data warehouse world and it is reasonable to assume that the same will emerge in the individual space (indeed they already are tactically with data exchange formats like OFX) in the banking world for moving transaction data around. Note that ETL may only be a precursor to a direct feed from a source system into the warehouse, whether they be batch, trickle or real time feeds.
Now that we have data in the warehouse/ store the task lies in organising the data and preparing it for use; there are a range of technology candidates in this area from standard RDBMS to NoSQL databases. At this point, it may be worth diverting briefly to a harsh reality, because it is pretty certain that this same reality will apply within personal data stores. This reality is that many data warehouses actually become ‘digital dumping grounds’ into which data is put ‘in case we need it later’ (note the clash with data minimisation principles in privacy law), and/ or it is not organised/ optimised for use. That does not make them a complete waste of time necessarily, it just means that they are not providing maximum value; ……the well worn phrase ‘Garbage In, Garbage Out’ springs to mind. My colleague John McKean tells this story much more eloquently in his first book, The Information Masters, which dates back to 1999 but is as valid today as it was back then. His research amongst the 30 or so ‘Information Master’ organisations sets out what differentiates the tiny percentage of organisations that get mega-returns on their information investments, versus those that just plod along or suffer regular failure to get a return on investment (hint….the master’s don’t regard the issue as something that ‘the IT folks do’).
The further functions of the data warehouse/ personal data store beyond getting data in, and organising it are:
– Data maintenance, i.e. refreshing data as appropriate, and having processes to keep it up to date, whether that be static data, dynamic data, or reference data.
– Data enhancement, either through combining existing data via queries into new attributes or meta data, or by bringing in further external data (e.g. my credit rating or verification via a third party that a data attribute is accurate at that point in time, or otherwise). This verification piece is a key issue, if I can prove for example that I am a gold level flyer on British Airways, or that i’ve not had any speeding tickets in the last 5 years, or that I do have a specific illness to manage then that ultimately takes a vast amount of guesswork and waste out of the current modus operandi.
– Make available for use; i.e. providing a data access layer that enables the data to flow onward to those entitled to it, in the way that they wish to receive it.
– Archive, there comes a time in the life-cycle of a data attribute, that it is no longer useful. This situation, which will certainly apply in a personal data store, can lead the database manager to either physically move the data elsewhere/ onto back up media (usually after building summary histories that do remain), or just leave it within the warehouse on the basis that storage cost may be less than removal costs.
Two other aspects of data warehousing are probably worth noting
– whilst initially, a data warehouse was most likely to be a single computer (perhaps costing £1m upwards to buy and install), these days the concept of a virtual warehouse is also a perfectly viable option, with data stored physically in different places and brought together as and when required.
– the concept of a data mart has emerged, which means the carving off of a specific set of data to support a subsidiary warehouse tuned to particular task (e.g. a retailer may choose to set up a mart for the team managing the loyalty scheme). Typically the link to the main warehouse remains in place for maintenance and update purposes, but the mart acts more independently in terms of access and use.
So what does all of that mean for the ‘personal data store’ then?
Firstly, I would contend that there is a terminology point to be taken on-board. The data warehouse is a short, fairly well understood term (perhaps because it is 30 years old). But it actually covers a lot of ground, and is much more than just a storage facility. It covers ‘identify relevant data types and sources, enable processes for bringing that into the storage facility, keep it clean and up to date by looking back to the source and other other cross-reference files, aggregate and summarise data where appropriate, enhance and add meta data where useful, and make available for use in a controlled, auditable manner via a range of output mechanisms and formats. That’s a lot functionality to pack into two words….. I think that a personal data store will do pretty much all of those same functions, so the users of the term should ideally aligned with that description, or seek to agree different terms for each of the system components and functions.
Secondly, there should be a recognition that functionality will continue to emerge and evolve over time, rather than all turn up in one big bang deployment.That said, there is clearly a huge upside to deploying with the technology we have available now, than that of 30 years ago. Cost of storage and back up is very low, connectivity is solid, access routes/ devices are many and the range of things that will be enabled by them using the internet/ mobile internet as the main place where this user managed information will be deployed.
Third, my working assumption is that there will be both self managed, and hosted options and that people will chose the options that best suit them and their likely uses. It is probable that stand alone personal data stores might not be that common as the market evolves, and indeed the individual buys into a wider set of personal information management capabilities (e.g. a personal data store, a set of key applications, and a hosting/ back up service).
So, after all that, here’s my working definition of a personal data store:
A personal data store helps me gather, manage, enhance and use information from across multiple aspects of my life, and share that information under my control with other individuals, organisations, or with applications or subsidiary data stores that I wish to enable.
The key, as per above, that this is a multi-life aspect data management platform that is infinitely extensible, and not constrained by the need to operate within a silo-ed context.
Here’s a diagram that seeks to illustrate the personal data store that I think will emerge over time.
One of the big issues around data warehousing is ‘the business case’ for what is typically regarded as a behind the scenes, not very sexy investment. I think the same will apply to the personal data store, but i’ll save that post for another day…