Data Driven: Creating a Data Culture (Free Book)

It’s been a while, (almost 2 years), but I’m back and ready to bring you the best information and resources out there, on Information Management.

We’ll start 2015 with a giveaway, a free book that has just been published by the thought leaders in driving a great data culture in organizations.

It’s an area that I’d like to focus on over the next few years, now we have some excellent tools that allow true self-service Business Intelligence.

Get the book here: http://www.oreilly.com/data/free/data-driven.csp

Best Wishes for 2015 🙂

Data Driven
Creating a Data Culture
Publisher: O’Reilly
Released: January 2015

Succeeding with data isn’t just a matter of putting Hadoop in your machine room, or hiring some physicists with crazy math skills. It requires you to develop a data culture that involves people throughout the organization. In this O’Reilly report, DJ Patil and Hilary Mason outline the steps you need to take if your company is to be truly data-driven—including the questions you should ask and the methods you should adopt.

You’ll not only learn examples of how Google, LinkedIn, and Facebook use their data, but also how Walmart, UPS, and other organizations took advantage of this resource long before the advent of Big Data. No matter how you approach it, building a data culture is the key to success in the 21st century.

You’ll explore:

  • Data scientist skills—and why every company needs a Spock
  • How the benefits of giving company-wide access to data outweigh the costs
  • Why data-driven organizations use the scientific method to explore and solve data problems
  • Key questions to help you develop a research-specific process for tackling important issues
  • What to consider when assembling your data team
  • Developing processes to keep your data team (and company) engaged
  • Choosing technologies that are powerful, support teamwork, and easy to use and learn

Known for coining the term Data Scientist, DJ Patil has held a variety of roles in academia, industry, and government, including VP of Product at RelateIQ (acquired by Salesforce).

Hilary Mason is Founder and CEO of Fast Forward Labs, an independent data technology research lab, and Data Scientist in Residence at Accel Partners. Previously, she was Chief Scientist at bitly.

Posted in BI Strategy, Books, Data 2.0, Data Sharing, Intelligent Enterprise, Open Data, Trends | Tagged , | Leave a comment

Amazon Redshift – Review and First Impressions

Guest Blog: Jeremy Winters, www.full360.com

As an AWS Premier partner, and the only partner with expertise in columnar databases in the cloud, Full360 was given earliest preview access to Amazon Redshift, the new cloud-based columnar data warehouse offering which was announced at the Re:Invent conference in November. Here are some of my initial impressions and experiences working with it over the past month.

Setup and Configuration

Setting up a Redshift cluster is a quick and easy process. You just define a few high-level parameters such as cluster, database, and master user name, pick the size and quantity of nodes you want, and AWS does the rest. Within a few minutes your cluster will be built and ready for you to configure and build the database.

Cluster and database configuration is performed through the web UI, or via SQL statements over a JDBC/ODBC connection. You do not have access to the command line on any of the Redshift servers. This approach forces users to treat the database as a service (which is Amazon’s intent) and a data source, as opposed to mucking around with OS level software, settings, and related issues.

Client Drivers

I had some issues with the JDBC driver in the first round, as it didn’t work with all my usual client software. Everything works fine if you use AWS’s recommendation of SQLWorkbenchJ. The driver also worked fine in Talend Open Studio, but only if I used the JDBC components, not the Postgres components.

If I were developing an application on top of Redshift right now, I would use APIs that are based on generic JDBC or ODBC drivers (such as DBI) as opposed to something like a Postgres rubygem. I suspect that Amazon will iron this stuff out in the future, easing the transition of an existing Postgres based application to Redshift.

Web UI

The AWS console portion of Redshift is very impressive. One nice feature of these screens is the way the queries are graphed along with the performance metrics. You can hover over a specific query and the metrics for the query duration are automatically highlighted.

Redshift UI Query Hover

Another tab in the UI allows you to browse a list of all queries executed against the database. Any time you see a query ID in the interface, it is a hyperlink that will take you to the detail page for that query. In addition to the basic query info and related performance stats, this page tells you the explain plan for the query. If the SQL statement is a data load, it will provide a list of file names which were loaded during the query execution!

Redshift Query Detail

Loading Data

Not having direct access to the server back-end means that bulk data loads cannot be performed from the local file system. Redshift handles this by providing direct access to files stored in S3. Even better… if you have your data set broken down into smaller files, they will automatically be loaded in parallel. Loading my single-file, 1B row fact table into Redshift took 4 hours 35 minutes, but breaking the table into 40 smaller files resulted in a 7 minute load, which is nearly a linear improvement at 38X!

While developing my Redshift demo, I used a repeatable SQL script which rebuilt the entire database each time I ran it. This script involves DDL, data loading, and some data transformation with large insert/select queries. The end-to-end script execution typically takes around 25 minutes to complete.

Architecture and Tuning

Redshift, like other columnar databases, optimizes the disk storage to meet query performance in the following ways:

  • Each column is stored in a separate set of files.
  • Columns are compressed with the encoding scheme appropriate to the nature of the data stored.
  • Sort order for the columns allows for faster predicate lookups, merge joins and aggregate grouping.
  • Records in a single table are distributed across nodes. Fact and large dimension tables can be distributed on the foreign key fields in order to keep the joins local to the node.

In Redshift, you have to specify your encoding (compression) types, sort orders, and compression types in the CREATE TABLE statement. While Redshift does provide a function that allows you to analyze a table and determine the best encoding scheme (this can be performed automatically upon first data load), it does not currently have any mechanism that will define your sort orders or distribution keys for you.

A limitation here is that a single encoding/sorting/distribution scheme will not optimize for all cases. Other columnar databases provide the ability to have multiple projections of a single table, each with different encoding and sort orders, but Redshift requires you do this in a separate table.

After a few rounds of tweaking the encoding and sort orders in my fact table, I was able to get 1-2 second performance when querying for the use case I optimized for. In other columnar databases, I was able to get sub-second query times for the same data set. In the case of BI reporting, I think that 1-2 seconds is pretty reasonable… especially with a billion-row fact table!

In my tests, I purposefully ran queries which are outside of the sort order and encoding scheme to see how the database handled them. These queries tend to run in the 10-60 second range… which is comparable (and even a bit better) than other columnar options I have tested.

Note that keeping up good performance requires ongoing maintenance, such as updating the statistics and running the “vacuum” statement which makes sure that your newly loaded data is completely optimized for querying. These are simple tasks to perform, but at this point you will have to set up the automation yourself. A good time to run these functions is immediately after a data load is completed.

Conclusion?

I expect to see a lot of companies jumping right into Redshift as soon as it’s released. The price is unbeatable, and getting started is easy to do. Companies that are currently throwing away detailed data will now be able to retain it inexpensively in S3 and report on it for as long as they desire.

BI and general reporting use cases should work well with Redshift, though I have yet to see it deliver the sub-second performance which may be critical for some analytic applications. Also, the fact that if any node in the cluster fails, the entire cluster fails might be cause for some concern. To be fair, AWS has indicated that it will only take a few minutes to automatically replace the failed node, but this may not be acceptable for some customer-facing applications. This same concern applies to the mandatory maintenance window which you are required to schedule each week.

The thing that I suspect will be the real hurdle for most companies is going to be database design and optimization. If you’re not familiar with columnar databases, you have a definite learning curve to get past. Approaches for optimizing row-based systems, such as normalization and indexing, do not really apply to columnar databases. You really need to understand your data and the way it is going to be used if you want to squeeze the magic out of the system.

Posted in Cloud BI | Tagged , , , | 1 Comment

2012 in review

The WordPress.com stats helper monkeys prepared a 2012 annual report for this blog.

Here’s an excerpt:

600 people reached the top of Mt. Everest in 2012. This blog got about 9,700 views in 2012. If every person who reached the top of Mt. Everest viewed this blog, it would have taken 16 years to get that many views.

Click here to see the complete report.

Posted in Uncategorized | Leave a comment

SAS Business Analytics – Visualisation, Mobility and Reporting

I recently attended a SAS Business Analytics seminar in Melbourne. The session provided insight from SAS, answering the following questions:

  • How business analytics can integrate data from across a organisation to deliver self-service reporting and analysis.
  • How the power of business visualisation can transform how we see, discover and share insights hidden in our data.
  • What pressures and market conditions are driving us to adopt mobile analytic reporting.
  • How and why SAS can help us move from insight to performance.

The session was well attended with a number of different industries and organisations represented.

The key takeaways for me:

  1. ‘Business Analytics’ is about discovering why an event occurred, and not simply reporting on it, which falls more into the ‘Business Intelligence’ way of thinking. Business Analytics helps predict what will happen in the future. (http://www.sas.com/businessanalytics/)
  2. If an organisation is looking to compete in their marketplace with ‘Competitive Advantage’, Business Analytics is a key enabler.
  3. SAS have developed a great Business Analytics value chain; Analysis -> Forecasting -> Predictive Modelling -> Optimisation (of business processes).
  4. There are challenges to be faced and resolved on the Business Analytics journey.

a. Data Governance and Data Quality – As with any data project, if data quality is an issue the business insights you’ll generate will be at best low value and at worst wrong. A Business Analytics project, not unsurprisingly, needs good quality data.

b. One version of the truth – Integrated data ensures consistency of the insights generated and provides an easy access path to data.

c. Operationalisation of intelligence – Reducing the risk of having business insight locked up in individual resources, operationalisation of intelligence ensures insight generated can be used across the organisation.

d. Big Data!

There was a great case study,which for me really highlights the power and competitive advantage of Business Analytics.

State Fleet of New South Wales, Australia, have been able to accurately set the lease price of the 12,000 cars per year they lease to NSW public sector workers, by using ‘Predictive Analytics’ to accurately forecast the residual/sale value of a car at the end of the lease period. With this capability they are saving millions of dollars in potential losses. With a fleet of over 25,000 cars, every 1% error in the calculated vs. actual end of lease sale value of their fleet of cars will cost State Fleet over $3 million.

No seminar/presentation would be complete without a section and discussion on Big Data, and SAS gave us their take on Big Data, including what SAS see as the fourth V of big data – Value. When you think about it, this really is far more important that Velocity, Variety and Volume. Big Data ‘Value’ for SAS means focusing effort on analysis of data where high value insights can be generated. SAS further define Big Data as being able to perform analytics in a much shorter timescale than previously possible – ‘High Performance Analytics’ – so it’s not all about how big the data set is you have for your Big Data initiatives. You can still be doing ‘Big Data’ with a small amount of data; where the data set contains a large amount of untapped high value business insight but requires a high level of processing to unlock that value.

We had some practical demonstrations of current and new SAS tools that support Business Analytics – visualisation, mobility and reporting.

  •  SAS Mobile Business Intelligence (http://www.sas.com/technologies/bi/mobile/index.html ) – via Roambi™ ES for SAS, organizations can deliver real-time analytics to Apple iPhones and iPads, empowering users to monitor key metrics and make informed decisions wherever they are. A great move by SAS in partnering with Roambi (http://www.Roambi.com) to provide a visually appealing and market leading MobileBI experience.
  •  SAS Social Media Analytics (http://www.sas.com/software/customer-intelligence/social-media-analytics/index.html) – integrates, archives, analyses and enables organizations to act on intelligence gleaned from online conversations on professional and consumer-generated media sites. It enables an organisation to attribute online conversations to specific parts of the business, allowing accelerated responses to marketplace shifts. ‘Sentiment Analysis’ made simple.
  •  SAS Office Analytics (http://www.sas.com/technologies/bi/office-analytics.html) – connects analytics data with Microsoft Office products (Excel, PowerPoint, Word and Outlook) to produce consistent views of data, automate reporting and add analytical insights while keeping information consumers within their interface comfort zone. The ability to directly access and view analytics from within Outlook looked very good, and provisg business users with the option to remain within a tool they are familiar with like Excel, can only be a good thing to further drive analytics uptake in the organisation.

Altis has been actively engaged in a number of successful Business Analytics projects over the last few years, and this seminar has strengthened my understanding and belief that successfully establishing and embedding Business Analytics within an organisation can generate massive competitive advantage.

I look forward to sharing our success stories and blogging further about our thoughts on how to deliver insights via Business Analytics. Altis expects 2013 to be another growth year for Business Analytics, with the smart companies using it to their advantage.

Posted in analytics | Tagged , , , | Leave a comment

Data Quality – Go for Gold

July 16, 2012 By (Altis Consulting)

From time to time, ETL will highlight data quality issues.  There is often a choice between fixing the issue at the source or in the ETL processes.

In this blog I argue that the issue should be fixed at the source, but if that isn’t practical I provide  some guidelines to ensure the best outcome.

Correcting the issue at the source is the gold standard as this will not only correct the data itself, but may address the root cause of the underlying problem (be it process, standards or a technical error). It’ always worthwhile aiming for the problem to be fixed at the source, for the following reasons.  In other words … Go for Gold!

It stays fixed

If the issue is fixed at the source, it can prevent similar or related data quality issues from re-occurring. A one-off correction made within the receiving system is no guarantee that the data won’t be reloaded or re-submitted incorrectly in the future, and  the issue will re-appear. Correct it at source and it stays fixed.

It stays fixed for everyone

If the issue is fixed at the source, not only is it fixed for the warehouse, it is also fixed for the source system and any other upstream users of the data. Never underestimate the breadth of impact a data quality issue may have. Imagine a spider’s web with lots of tangled threads that radiate out from the centre and you’re looking at a how data quality issues can thread through an organisation . Enforcing a solution at the source means everyone benefits from a consistent and correct view of the data.

It’s usually cheaper

Even though it may appear easier to just patch in the warehouse, appearances are often deceptive.  It’is worthwhile keeping the following saying in mind:

“If you want quick and dirty, we can guarantee the dirty but not the quick”.  One example of hidden costs is reconciling the source and the target.

The reality is though, sometimes we are forced to apply a patch in the data warehouse .  The root cause may involve external systems over which we have little control.  The time needed to co-ordinate the fix may be long and complex, meanwhile we need to do something to keep the data flowing.

In this case we need to settle (at least temporarily) for silver, but here are some steps to ensure that the fix is as effective as possible:

Raise the Issue

Ensure the issue is communicated so that other users of the data are aware of it.  Those responsible for the root cause should know so that it at least doesn’t increase in impact.  Highlighting the impact of data quality issues may be all that’s needed to get the ball rolling towards addressing it.

Isolate the fix

Minimize the flow-on effect of the issue.  In other words, avoid becoming part of the problem.  Once the root cause is fixed, it should be easy to remove the temporary fix. Don’t comprise on your standards to attain the best data quality possible. Remember – go for gold!

Revisit the Issue

Keep an eye on progress towards fixing the root cause.  Don’t drop the issue because the symptoms have disappeared for the moment.

Keep aiming for gold in data quality!

Simon McAlister

Posted in Data Quality | Tagged , , | Leave a comment

Crunch time for big data

Source: http://www.uk.capgemini.com/news-centre/news/crunch-time-for-big-data/ and

In a global survey of 600 executives this month by Capgemini and the Economist Intelligence Unit, nine out of 10 respondents identified data as being the fourth factor of production – as fundamental to business as land, labour and capital.

18 June 2012
Author:
Paul Taylor
Publication:
Financial Times

Companies are awash with data, some generated by their customers or systems, some by third parties. These data are growing so fast – by about 2.5 exabytes a day – that 90 per cent of the stored data in the world today has been created in just the past two years, earning it the geeky moniker “big data”.
Whether big data becomes an organisation’s greatest asset or one of its gravest liabilities depends on the strategies and solutions it puts in place to deal with the epic growth in data volumes, complexity, diversity and velocity.
This message seems to be getting through. Among the survey’s other findings, respondents said the use of big data has improved businesses’ performance, on average, by 26 per cent and that the impact will grow to 41 per cent over the next three years.
Almost 60 per cent of companies said they planned to make a bigger investment in big data over the next three years, suggesting that the era of big data and big data analytics has already arrived.

To read the full article on FT.com, please click here: Crunch time for big data

Related links:

Posted in big data, Intelligent Enterprise, Trends | Tagged , , | Leave a comment

Big Data – the end of Data Warehousing?

There is a massive amount of hype and buzz in the Data Warehousing and Business Intelligence market place surrounding the term ‘Big Data’.  Recently we have even seen talk of Big Data as a replacement for Data Warehousing.  I believe this is a misunderstanding of what Big Data is. In fact Big Data strategies only work if they co-exist with a well thought-out and supported Enterprise Data Warehouse. So I don’t believe we are witnessing the end of Data Warehousing – and here’s why.

First, what is Big Data? In John Bantleman’s recent blog Raw is More, he defines Big Data using the criteria of volume, velocity, variety and value.  This is a great definition and captures exactly why the hype, buzz and excitement around Big Data will be with us for some time – businesses now have the means to collect, store and analyse huge volumes of data, from varied sources, at high frequency, in a very cost efficient manner – and this hasn’t been possible before.

I recall the days during the first dot-com boom, where trying to capture and store all the detailed data generated by people browsing a website – capturing every click, interaction and page viewed, over a period of more than a month was near-on impossible.  A client involved in providing share trading services couldn’t hold more than 14 days’ worth of detailed browsing data – so think how difficult it was to generate insights into user behaviour.  With the arrival of Big Data, this problem is no longer present; it’s possible to save the detail data for much longer.

So where does an Enterprise Data Warehouse (EDW), fit into the picture? Are we now witnessing the demise of an EDW, to be replaced by ‘Big Data’ systems? In short … no. For an organisation to get value out of their data they must be able to generate insights, quickly, effectively and for as many user groups as possible. For this you need a well-structured Data Warehouse.

In a recent Australian CIO article, ‘Five things CIOs should know about Big Data’, the misinformed idea  is presented that in some way ‘Big Data’ allows an organisation to forgot all the hard work and thinking that goes into creating a well-constructed Enterprise Data Warehouse (EDW), The article suggests that a Big Data implementation will enable;

  1. Access to data by more than just a handful of highly paid and hard-to-find Data Scientists. Untrue – you will need even more sophisticated data analysis if your data is not structured in a logical way – a skill most people in the organisation do not have.
  2. Support for all the business questions that can be thrown at it, unlike an EDW, and without the need for any structure. Untrue – A well-designed dimensional data model at the core of the EDW supports a variety of business questions being asked, and the data model doesn’t prevent, limit or second guess those questions.  Structure to data actually makes it easier to navigate the data and generate insight.  Good luck if you need to navigate your unstructured Big Data store, without your expert guide available!
  3. As much detail data as the underlying infrastructure can support. True, but you still have to have the means and capability to access that data.

The article goes even further suggesting that ‘You can use a [Big Data repository] as a dumping ground, and run the analysis on top of it, and discover the relationships later.’  I’ve seen ‘data dumps’ and they are not fun to use for anyone. They typically suffer from extremely poor data quality, poor performance and lack of control – all of which is the reason we’ve spent 20 years refining the approach to supporting business in generating insight from Data Warehouses!

We believe that both ‘Big Data’ and Enterprise Data Warehousing need to co-exist, supporting the need for organisations to generate insight from all data. Big Data provides the deep analytical capability to generate insight from huge volumes of data and transactions that you just wouldn’t need to make available to everybody on expensive hardware, whereas an Enterprise Data Warehouse is bringing insight from data to as many business users as possible, in a structured and planned way.

Is there a meeting point in the future? We believe there could be – a ‘Big Data 2.0’ where an Enterprise Data Warehouse can take advantage of the infrastructure approach that ‘Big Data’ uses.  In the meantime if your ‘Big Data’ vendor tells you that you don’t need that Data Warehouse any more, come and talk to us at altis Consulting for a more rounded and balanced view.

Posted in big data, Data Warehousing, Intelligent Enterprise, Trends | Tagged , , , | Leave a comment