Real-time or near-zero latency has been a hot topic in data warehousing since 2002. In our industry, that’s a rather long time to be a hot topic. According to the annual predictions, by now we should all be refreshing some part of our data warehouse as often as we possibly can, but few of us do.
I know many of us have dismissed real-time data warehousing (RTDW) as vendor hype – a technology solution in search of a business problem. Certainly when the term was first coined, extract, transform and load (ETL) tools were busily re-branded as data integration platforms; perhaps vendors were fearful that we were running out of data for them to move. After all, the much-anticipated wave of clickstream data warehouses had failed to arrive after the dotcom bubble burst. They need not have worried; data volumes have grown unabated and hitherto unimaginable sources of ‘sub-transactional’ data have sprung up to challenge us. So much so that most organisations are pleased if their ‘static’ data warehouses give them anything like a complete picture of what happened yesterday, never mind the last hour.
Yet talk of RTDW just won’t go away, despite the low number of early adopters. Possibly this is because it is another example of our natural progression from batch to online – inevitable evolution if you like. Once we did all analytical reporting overnight in batch directly from operational systems, the only time when it was safe to do so. We moved on to the daylight querying of haphazard decision support extracts and then eventually to data warehouses to remove that technology-imposed limitation. Initially data warehouses provided online access to historical summary data extracted on a monthly or weekly basis for management reporting only. Increasingly burdensome operational reporting was to be handled by operational data stores. It wasn’t long before this technology-imposed duality was called into question, as daily extracts of atomic-level detail transactions became the data-warehousing norm. Why maintain two sets of reporting structures and make users choose which they query? Why not one database for all reporting, especially as the distinction between analytical and operational reporting has blurred? We’ve learned that if business users cannot access information when they want, from where they want, eventually IT practices change when it becomes an economic possibility and an operational imperative.
I believe we may be reaching the RTDW tipping point now. I have one broadcasting client, who has had a very specialised RTDW in production for four years. That’s an anomaly but I have another about to embark on a far more ambitious operational data warehouse to match the wholesale integration of their production systems using service-oriented architecture (SOA) technology.
Just as I’m glad we didn’t all attempt to build data warehouses in the early 1990s before we knew enough about dimensional modelling, I’m relieved my clients have waited until now for RTDW. Today we do have enabling integration technology that has been production tested but, more importantly, we have had repeatable success at ‘static’ data warehousing. We’ve learned to walk (almost gracefully) before attempting to run. That’s a good thing because RTDW can be something of a marathon. So how are we going to shape up for it and what must we do to succeed at RTDW?
Realise that RTDW is not just faster ETL. A simple definition of RTDW is the ability to query data with a refresh cycle more frequently than current ETL processes can cope with. That’s why real-time ETL is going to double your development time and budget but it is no use spending millions making data available in minutes if it will not be consumed for hours. RTDW requires end- to-end process re-engineering, not just stretching back to the source systems so they can cope with accelerated extraction (it probably won’t be extraction) but forward to the business intelligence infrastructure to support more active analysis and decision-making. Even dashboards are going to be too passive.
Make real-time operational sources more co-operative. If they can’t cope with invasive data extraction they are going to have to push changes to the warehouse or at least notify ETL processes when data are ready so we don’t needlessly burn cycles trying to extract what isn’t there yet.
Don’t assume data flow is a one-way street. Even if your current data warehouse information is consumed entirely by business intelligence users, real-time questions and answers may have a completely different audience. You’re probably going to have to close the loop and propagate answers back to the operational systems as frequently as you extract from them.
Adapt our existing ETL techniques. Don’t assume that change data capture and enterprise data integration (EAI) technology alone will allow us to simply bolt on real time. EAI may be appropriate for trickling small to medium volumes of data to and from the warehouse, but messaging technology may be overwhelmed by large bursts of data more suited to our current ETL processing run in micro batch. EAI also represents a huge financial and technical overhead if only a small number of systems must be integrated.
Be honest about how data quality is going to affect your RTDW. You can’t be as thorough in running your suite of data quality tests in ‘real-time’ as you would in overnight batch.
Continue to use dimensional techniques. Normalised models may be tempting as they can closely match the sources of real-time data and cut down the amount of transformation needed to the point where it will fit into our acceptable refresh cycle. However, these structures are notoriously query-unfriendly and typically rely on a succession of dependent data marts downstream for business intelligence performance. Our calculation of latency has to include the time taken to ready the data for query and typical query processing itself, not just the time to store it. For near-zero latency, the tables/cubes we update in real-time are going to have to be the ones we query.
Use Enterprise Information Integration (EII) technology to prototype RTDW or add a lightweight RTDW component to your existing warehouse. EII’s virtual data warehousing could never replace real data warehouses but as an adjunct EII’s user-triggered ETL could provide real-time answers where data volumes are low and usage is light or as yet unproven.
Develop real-time partitions for our RTDW fact tables. These may not necessarily be native database partitions as our real-time structures will likely need a more lightweight indexing strategy, possibly not based on surrogate keys. That last point may sound like heresy to any dimensional modeller who knows me but we don’t have the luxury of our usual batch windows to do that level of housekeeping. But don’t worry. We’ll promise to clean up overnight, sweeping the real-time data into the static partitions where we can fully support slowly changing dimensions and data quality improvements.
You still don’t believe? Almost every data-warehousing project I have been associated with that began with strategic/managerial analysis goals has been hijacked by operational requirements. When your operational processes develop a need to compare the most recent transactions to the historic detail in your warehouse, you’ll receive your RTDW orders. Interesting times lie ahead.
Based in the UK, Lawrence Corr is a leading data warehouse design specialist and highly experienced trainer with over 20 years of industry experience in decision support and data analysis. He will be presenting a four-day Data Warehouse Design and Development Master Class for Effective Business Intelligence in Cape Town from 19 to 22 May 2008. This technology neutral event is being hosted by Alicornio Africa, a premier partner of Pitney Bowes in South Africa.
World-class software solutions from Pitney Bowes – Sagent Data Flow and Customer Data Quality – are available to help South Africa embrace the international trend towards RTDW. They offer the local market a technologically advanced platform for data integration, data quality management, data profiling, data warehousing, business intelligence or any data related project.