How to Transform into a Data-Driven Organization?

Click to learn more about author Konain Qurban. It is a journey to ensure the alignment of analytics initiatives to organizational objectives, combined with consistent and effective coordination of activities across all business units. The road from a pile of raw data to insights and from insights to action is paved with strategic goals. More […]

The post How to Transform into a Data-Driven Organization? appeared first on DATAVERSITY.

Source: New feed

How has Covid-19 impacted investments in data?

How has Covid-19 impacted investments in data?

A new IAB study states that COVID-19 may cause 20% in data cuts in Q3 2020. Survey respondents expect the short-term impact of COVID-19 to require budgetary cuts and reallocation of investments. There is less consensus on the outlook in Q4 2020 and beyond.  Here a few key stats from the study.

  • 64.2% — decrease in Q2/Q3 2020 marketing spend
  • 49.1% — change messaging and content priorities
  • 26.4% — de-emphasize certain products or business lines
  • 22.6% — shift budget from brand initiatives to performance-oriented marketing
  • 20.8% — decrease Q4 2020 and beyond marketing spend
  • 17.0% — increase Q4 2020 and beyond marketing spend

The report focuses on U.S. advertiser and marketer spend on third-party audience data; spend on services, technologies and hybrid activation solutions that support the use of data across consumer and B2B marketing.  It outlines the major demand drivers, operational challenges, and other trends impacting investment in data.

Check out the article on MediaPost https://www.mediapost.com/publications/article/353961/iab-the-impact-covid-has-on-data-spending.html for more details.

Time based heatmaps in R

Time based heatmaps in R

 

 

Tutorial Scenario

In this tutorial, we are going to be looking at heatmaps of Seattle 911 calls by various time periods and by type of incident.  This awesome dataset is available as part of the data.gov open data project.  

 

Steps

The code below walks through 6 main steps:

  1. Install and load packages
  2. Load data files
  3. Add time variables
  4. Create summary table
  5. Create heatmap
  6. Celebrate

 

 

Code

 

#################### Import and Install Packages ####################

 

install.packages(“plyr”)

install.packages(“lubridate”)

install.packages(“ggplot2”)

install.packages(“dplyr”)

 

 

library(plyr)

library(lubridate)

library(ggplot2)

library(dplyr)

#################### Set Variables and Import Data ####################

 

#https://catalog.data.gov/dataset/seattle-police-department-911-incident-response-52779

 

incidents <-read.table(“https://data.seattle.gov/api/views/3k2p-39jp/rows.csv?accessType=DOWNLOAD”, head=TRUE, sep=”,”, fill=TRUE, stringsAsFactors=F)

col1 = “#d8e1cf”

col2 = “#438484”

head(incidents)

attach(incidents)

str(incidents)

 

#################### Transform ####################

 

 

#Convert dates using lubridate

 

incidents$ymd <-mdy_hms(Event.Clearance.Date)

incidents$month <- month(incidents$ymd, label = TRUE)

incidents$year <- year(incidents$ymd)

incidents$wday <- wday(incidents$ymd, label = TRUE)

incidents$hour <- hour(incidents$ymd)

attach(incidents)

head(incidents)

 

#################### Heatmap Incidents Per Hour ####################

 

#create summary table for heatmap – Day/Hour Specific

 

dayHour <- ddply(incidents, c( “hour”, “wday”), summarise,

N = length(ymd)

)

dayHour$wday <- factor(dayHour$wday, levels=rev(levels(dayHour$wday)))

 

attach(dayHour)

 

 

#overall summary

ggplot(dayHour, aes(hour, wday)) + geom_tile(aes(fill = N),colour = “white”, na.rm = TRUE) +

scale_fill_gradient(low = col1, high = col2) +

guides(fill=guide_legend(title=”Total Incidents”)) +

theme_bw() + theme_minimal() +

labs(title = “Histogram of Seattle Incidents by Day of Week and Hour”,

x = “Incidents Per Hour”, y = “Day of Week”) +

theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

 

 

 

#################### Heatmap Incidents Year and Month ####################

 

#create summary table for heatmap – Month/Year Specific

 

yearMonth <- ddply(incidents, c( “year”, “month” ), summarise,

N = length(ymd)

)

 

yearMonth$month <- factor(summaryGroup$month, levels=rev(levels(summaryGroup$month)))

 

attach(yearMonth)

 

 

#overall summary

ggplot(yearMonth, aes(year, month)) + geom_tile(aes(fill = N),colour = “white”) +

scale_fill_gradient(low = col1, high = col2) +

guides(fill=guide_legend(title=”Total Incidents”)) +

labs(title = “Histogram of Seattle Incidents by Year and Month”,

x = “Month”, y = “Year”) +

theme_bw() + theme_minimal() +

theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

 

 

#################### Heatmap Incidents Per Hour by Incident Group ####################

 

#create summary table for heatmap – Group Specific

 

groupSummary <- ddply(incidents, c( “Event.Clearance.Group”, “hour”), summarise,

N = length(ymd)

)

 

#overall summary

ggplot(groupSummary, aes( hour,Event.Clearance.Group)) + geom_tile(aes(fill = N),colour = “white”) +

scale_fill_gradient(low = col1, high = col2) +

guides(fill=guide_legend(title=”Total Incidents”)) +

labs(title = “Histogram of Seattle Incidents by Event and Hour”,

x = “Hour”, y = “Event”) +

theme_bw() + theme_minimal() +

theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

 

Please see here for the full tutorial and steps
Time based heatmaps in R

ETL vs ELT: Considering the Advancement of Data Warehouses

ETL vs ELT: Considering the Advancement of Data Warehouses

ETL stands for Extract, Transform, Load. It has been a traditional way to manage analytics pipelines for decades. With the advent of modern cloud-based data warehouses, such as BigQuery or Redshift, the traditional concept of ETL is changing towards ELT – when you’re running transformations right in the data warehouse. Let’s see why it’s happening, what it means to have ETL vs ELT, and what we can expect in the future.

ETL is hard and outdated

ETL arose to solve a problem of providing businesses with clean and ready-to-analyze data. We remove dirty and irrelevant data and transform, enrich, and reshape the rest. The example of this could be sessionization – the process of creating sessions out of raw pageviews and users’ events.

ETL is complicated, especially the transformation part. It requires at least several months for a small-sized (less than 500 employees) company to get up and running. Once you have the initial transform jobs implemented, never-ending changes and updates will begin, because data always evolves with business.

The other problem of ETL is that during the transformation we reshape data into some specific form. This form usually lacks some data’s resolution and does not include data which is useless for that time or for that particular task. Often, “useless” data becomes “useful.” For example, if business users request daily data instead of weekly, then you will have to fix your transformation process, reshape data, and reload it. That would take a few weeks more.

The more data you have, the longer the transformation process is.

The transformation rules are very complex, and even if you have only a few terabytes of data, loading time can take hours. Given the time it takes to transform and load a big dataset, your end users will almost never get fresh data. “Today” will mean today last week, and “yesterday” – a week ago yesterday. Sometimes it takes several weeks or months to get a new update to rollup.

To summarize some of the cons of ETL:

  • ETL is expensive to implement, especially for small and medium businesses.
  • ETL is expensive to maintain.
  • ETL eliminates access to raw data.
  • ETL is time consuming. Users have to wait for transformation to be finished.

Why have we been doing it for decades?

A prior generation of data warehouses weren’t able to work with the size and complexity of raw data. So, we had to transform data before loading and querying it.

The latest advances in database technologies made warehouses much faster and cheaper. Storage is getting more accessible all the time, some data warehouses even separate pricing for storage and computation. For example, BigQuery storage pricing is quite cheap and you can just store all your raw data there.

The most important changes in recent years to data warehouses, which made the comparison of ETL vs ELT possible:

  • Optimized for analytical operations. Modern analytical warehouses tend to be columnar and optimized for aggregating and processing huge datasets.
  • Cheap storage. No worries about what to store, you can just dump all your raw data into a warehouse.
  • Cloud based. It scales infinitely and on-demand, so you can get the performance you need the moment you need it.

ETL vs ELT: running transformations in a data warehouse

What exactly happens when we switch “L” and “T”? With new, fast data warehouses some of the transformation can be done at query time. But there are still a lot of cases where it would take quite a long time to perform huge calculations. So instead of doing these transformations at query time you can perform them in the warehouse, but in the background, after loading data.

Once raw data is loaded into a warehouse, heavy transformations can be performed. It makes sense to have both real-time and background transformations in the BI platform. Users consume and operate on the business definitions level when querying data, and BI is either performing transformation on-the-fly or querying data already transformed in the background.

This approach gives flexibility and agility for development of a transformation layer.

Software engineers nowadays deploy several times a day and praise continuous delivery. The same principle should be adopted for how we approach a transformation. If metric definition changes or some new data is required, one can easily make this changes in hours, not weeks or months. It is especially valuable for fast-growing startups, where changes happen daily and data teams have to be flexible and agile to keep up with product development and business needs.

As data warehouses advance more and more, I’m sure we will see how query time transformations will entirely replace background transformations. Before that happens, we can run some transformations in the background with ELT. Since they are already SQL based and run in the data warehouses, the final switch would be easy and painless.



ETL vs ELT: Considering the Advancement of Data Warehouses

How IOT is changing our world?

How IOT is changing our world?

The Internet of Things connected all of kinds intelligent devices, such as mobile devices, sensors, machines or vehicles with each other and with the cloud computing. Analysis of the IoT data offers many opportunities for companies, as they can make faster decisions, optimize business processes or develop new applications and even restructure business models. This has the enormous potential of Internet of Things that is virtually present in every industry, including energy, retail, healthcare, financial services, transportation and manufacturing.

The Internet of Things has changed the dimensions of traditional Business IT. To tap the potential need for a highly scalable and reliable IT Infrastructure, they should be on standardized components and open protocols and include the three layer Devices, Controllers and Data Center or the Cloud solutions.

The growth is a positive sign for all industry, but however, we cannot ignore the sheer size and the public nature of the Internet of Things, which will bring great challenges. Network and system architects need to optimize the IT infrastructure to meet the higher requirements of the IoT in terms of scalability, reliability and security. IoT based applications and automated business processes make high demands on the availability of the system. Many intelligent systems are used for mission critical applications where system downtime leads to decreased productivity.

Intelligent IT solutions such as Red Hat technologies are based on things to meet the requirements of IoT based systems for scalability, reliability and security. The solutions are based on a hierarchical model with device layer (Edge Nodes), control layer (controller gateway) and Data Center Tier or Cloud layer. Here come standardized protocols and components for use.

The device layer involving a variety of intelligent terminals that includes mobile devices, wearable gadgets, sensors, control and regulation devices, autonomous machines and appliances, etc… The communication between the devices and the control or checkpoints is based on standard network protocols, either wired or wireless. The forwarding of the raw data and the exchange of control information is based on open messaging standards.


How IOT is changing our world?

Decision-driven before data-driven : A roadmap for data-driven organizations

Decision-driven before data-driven : A roadmap for data-driven organizations

A MANAGER MUST MAKE DECISIONS

Some thirty years back, when I started my career as a management-trainee with an Aerospace company; the 18 months long training program included several courses in management.

I still remember the very first class-room session as if it was yesterday. A distinguished-looking retired professor from IIM Calcutta was introduced by the principal of the Staff-college, and as he addressed the class he declared “I am a Jew! …hope none of you have a problem?” (This used to be the time when the Indian Government very strongly identified with the Palestinian cause).

On assuring himself that we have no problems whatsoever, he proceeded to ask the next question. “Who is a good manager?” … There were a few of us bold enough to try a response. The professor faithfully wrote down everything we said on the board… A wide variety of juvenile definitions ranging from “someone who always gets work done” to “someone who gets work done more efficiently.” … all to be expected from a set of green-behind-the-ears freshers. 

“How many of you have heard of Peter Drucker?” …

Fortunately, a good many of us did. (I came to know much later that Peter Drucker was also of Jewish descent. His ancestors were Jewish, but his parents had converted to Lutheranism)

“Well, glad you do… Peter Drucker defines a good manager as someone who takes three good decisions out of ten”.

Not surprisingly, we were all completely lost… It made no sense… Just 30%? …One fails to clear one’s exam at 30%…

The Professor then went on to relate how he asked the very same question when he met Peter Drucker in a seminar. Apparently, Peter Drucker told him – “A manager must take decisions, even if only three out of ten turn-out to be good decisions”.

As they say, that is one premise that I have kept unchallenged ever since. A manager must take decisions. Must take decisions in-time, and as many good decisions as he could.

Overtime, I joined a business school to do my MBA, but I guess Peter Drucker was passé by then, I don’t remember any professor specifically talking about the importance of taking right decisions in right time. I have been through eight different companies in multiple roles ever-since, set-up and scaled businesses, consulted for several companies across industry sectors for over two decades, set-up multiple ODCs and shared-services for customers….

But I doubt if I ever paused to think if I was taking right decisions in right time? Ever?

DECISION-DRIVEN BEFORE DATA-DRIVEN

However, all that changed when I was trying to create a business-case for setting up an Analytics Lab in my last assignment – a kind of internal Centre-of-Excellence for Analytics. After a gruelling tale of trying to take advise and help from different consulting firms and searching through all the published sources, we discovered there was little to nothing as a process for creating an organization-wide roadmap for analytics. Given the near-complete absence of any published sources (three notable exceptions including one from Bain consulting will be explained in the next article), we were forced to come-out with a process of our own.

The most disturbing discovery was – that an organization necessarily needs to be Decision-driven before it is Data-driven.

We have noticed the decision-making process across organizations (not just ours) was informal, and more often-than-not without an audit-trail. No Manager seems to list the decisions that he takes – let alone record the reasons. The only notable exceptions are related to the capital-expenditure, where most organizations have a set process and an audit-trail mandated for decisions beyond a certain threshold value.

However, there are hundreds of operational decisions that are repetitive in nature, usually made by an every-day manager – that cumulatively account for a much-higher value than all capital-expenditure related decisions taken in the company. For example: the quantum of buffer-stock by each SKU may be individually a very small decision, but by annual-cumulative value across SKU’s, it may be the-most-important decision that significantly influences the profitability and performance of the company.

A Decision-driven organization is expected to be process-driven with documented GSOPs (Global Standard Operating Procedures) for each of the decisions recognized as important – Specifically for those 10% of the decisions which influence 90% of the business outcomes.

In our experience, we are yet to come across any organization which meticulously identifies critical decisions, and creates GSOPs for each one of them.

Even more surprisingly, a good many of them have a budget for big-data and Analytics, and all of them aspire to be data-driven….

——————————————————————————————————————————————————


Decision-driven before data-driven : A roadmap for data-driven organizations