How to Transform into a Data-Driven Organization?

Click to learn more about author Konain Qurban. It is a journey to ensure the alignment of analytics initiatives to organizational objectives, combined with consistent and effective coordination of activities across all business units. The road from a pile of raw data to insights and from insights to action is paved with strategic goals. More […]

The post How to Transform into a Data-Driven Organization? appeared first on DATAVERSITY.

Source: New feed

K-Anonymization: An Introduction for First Graders

Click to learn more about author John Murray. Now that privacy-enhancing technologies (PETs) have become a subject of dinner table conversations, our research team continues to field questions on these complex topics which can be difficult to explain. As part of this series, I will attempt to explain another PET, K-anonymization, in a first-grade context: […]

The post K-Anonymization: An Introduction for First Graders appeared first on DATAVERSITY.

Source: New feed

It’s Time to Admit It: Backup is Your Job

Click to learn more about author W. Curtis Preston. Salesforce finally made it official: Backup is the customer’s responsibility. To be clear, it always was the customer’s responsibility, but Salesforce used to provide a safety net that some used as an excuse not to back up their Salesforce account. As of July 31, 2020, that […]

The post It’s Time to Admit It: Backup is Your Job appeared first on DATAVERSITY.

Source: New feed

Can Our Future Networks and Connected Society Survive Without Automation?

Click to learn more about author Ilan Sade. Are we at a watershed moment for network automation? COVID-19 has pushed a significant number of consumers and businesses into a digital-first, connectivity-centric ecosystem beyond anything we’ve seen before. The pandemic has made it clear we can’t afford gaps and broken areas within our network infrastructure that […]

The post Can Our Future Networks and Connected Society Survive Without Automation? appeared first on DATAVERSITY.

Source: New feed

Time based heatmaps in R

Time based heatmaps in R

 

 

Tutorial Scenario

In this tutorial, we are going to be looking at heatmaps of Seattle 911 calls by various time periods and by type of incident.  This awesome dataset is available as part of the data.gov open data project.  

 

Steps

The code below walks through 6 main steps:

  1. Install and load packages
  2. Load data files
  3. Add time variables
  4. Create summary table
  5. Create heatmap
  6. Celebrate

 

 

Code

 

#################### Import and Install Packages ####################

 

install.packages(“plyr”)

install.packages(“lubridate”)

install.packages(“ggplot2”)

install.packages(“dplyr”)

 

 

library(plyr)

library(lubridate)

library(ggplot2)

library(dplyr)

#################### Set Variables and Import Data ####################

 

#https://catalog.data.gov/dataset/seattle-police-department-911-incident-response-52779

 

incidents <-read.table(“https://data.seattle.gov/api/views/3k2p-39jp/rows.csv?accessType=DOWNLOAD”, head=TRUE, sep=”,”, fill=TRUE, stringsAsFactors=F)

col1 = “#d8e1cf”

col2 = “#438484”

head(incidents)

attach(incidents)

str(incidents)

 

#################### Transform ####################

 

 

#Convert dates using lubridate

 

incidents$ymd <-mdy_hms(Event.Clearance.Date)

incidents$month <- month(incidents$ymd, label = TRUE)

incidents$year <- year(incidents$ymd)

incidents$wday <- wday(incidents$ymd, label = TRUE)

incidents$hour <- hour(incidents$ymd)

attach(incidents)

head(incidents)

 

#################### Heatmap Incidents Per Hour ####################

 

#create summary table for heatmap – Day/Hour Specific

 

dayHour <- ddply(incidents, c( “hour”, “wday”), summarise,

N = length(ymd)

)

dayHour$wday <- factor(dayHour$wday, levels=rev(levels(dayHour$wday)))

 

attach(dayHour)

 

 

#overall summary

ggplot(dayHour, aes(hour, wday)) + geom_tile(aes(fill = N),colour = “white”, na.rm = TRUE) +

scale_fill_gradient(low = col1, high = col2) +

guides(fill=guide_legend(title=”Total Incidents”)) +

theme_bw() + theme_minimal() +

labs(title = “Histogram of Seattle Incidents by Day of Week and Hour”,

x = “Incidents Per Hour”, y = “Day of Week”) +

theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

 

 

 

#################### Heatmap Incidents Year and Month ####################

 

#create summary table for heatmap – Month/Year Specific

 

yearMonth <- ddply(incidents, c( “year”, “month” ), summarise,

N = length(ymd)

)

 

yearMonth$month <- factor(summaryGroup$month, levels=rev(levels(summaryGroup$month)))

 

attach(yearMonth)

 

 

#overall summary

ggplot(yearMonth, aes(year, month)) + geom_tile(aes(fill = N),colour = “white”) +

scale_fill_gradient(low = col1, high = col2) +

guides(fill=guide_legend(title=”Total Incidents”)) +

labs(title = “Histogram of Seattle Incidents by Year and Month”,

x = “Month”, y = “Year”) +

theme_bw() + theme_minimal() +

theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

 

 

#################### Heatmap Incidents Per Hour by Incident Group ####################

 

#create summary table for heatmap – Group Specific

 

groupSummary <- ddply(incidents, c( “Event.Clearance.Group”, “hour”), summarise,

N = length(ymd)

)

 

#overall summary

ggplot(groupSummary, aes( hour,Event.Clearance.Group)) + geom_tile(aes(fill = N),colour = “white”) +

scale_fill_gradient(low = col1, high = col2) +

guides(fill=guide_legend(title=”Total Incidents”)) +

labs(title = “Histogram of Seattle Incidents by Event and Hour”,

x = “Hour”, y = “Event”) +

theme_bw() + theme_minimal() +

theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

 

Please see here for the full tutorial and steps
Time based heatmaps in R

ETL vs ELT: Considering the Advancement of Data Warehouses

ETL vs ELT: Considering the Advancement of Data Warehouses

ETL stands for Extract, Transform, Load. It has been a traditional way to manage analytics pipelines for decades. With the advent of modern cloud-based data warehouses, such as BigQuery or Redshift, the traditional concept of ETL is changing towards ELT – when you’re running transformations right in the data warehouse. Let’s see why it’s happening, what it means to have ETL vs ELT, and what we can expect in the future.

ETL is hard and outdated

ETL arose to solve a problem of providing businesses with clean and ready-to-analyze data. We remove dirty and irrelevant data and transform, enrich, and reshape the rest. The example of this could be sessionization – the process of creating sessions out of raw pageviews and users’ events.

ETL is complicated, especially the transformation part. It requires at least several months for a small-sized (less than 500 employees) company to get up and running. Once you have the initial transform jobs implemented, never-ending changes and updates will begin, because data always evolves with business.

The other problem of ETL is that during the transformation we reshape data into some specific form. This form usually lacks some data’s resolution and does not include data which is useless for that time or for that particular task. Often, “useless” data becomes “useful.” For example, if business users request daily data instead of weekly, then you will have to fix your transformation process, reshape data, and reload it. That would take a few weeks more.

The more data you have, the longer the transformation process is.

The transformation rules are very complex, and even if you have only a few terabytes of data, loading time can take hours. Given the time it takes to transform and load a big dataset, your end users will almost never get fresh data. “Today” will mean today last week, and “yesterday” – a week ago yesterday. Sometimes it takes several weeks or months to get a new update to rollup.

To summarize some of the cons of ETL:

  • ETL is expensive to implement, especially for small and medium businesses.
  • ETL is expensive to maintain.
  • ETL eliminates access to raw data.
  • ETL is time consuming. Users have to wait for transformation to be finished.

Why have we been doing it for decades?

A prior generation of data warehouses weren’t able to work with the size and complexity of raw data. So, we had to transform data before loading and querying it.

The latest advances in database technologies made warehouses much faster and cheaper. Storage is getting more accessible all the time, some data warehouses even separate pricing for storage and computation. For example, BigQuery storage pricing is quite cheap and you can just store all your raw data there.

The most important changes in recent years to data warehouses, which made the comparison of ETL vs ELT possible:

  • Optimized for analytical operations. Modern analytical warehouses tend to be columnar and optimized for aggregating and processing huge datasets.
  • Cheap storage. No worries about what to store, you can just dump all your raw data into a warehouse.
  • Cloud based. It scales infinitely and on-demand, so you can get the performance you need the moment you need it.

ETL vs ELT: running transformations in a data warehouse

What exactly happens when we switch “L” and “T”? With new, fast data warehouses some of the transformation can be done at query time. But there are still a lot of cases where it would take quite a long time to perform huge calculations. So instead of doing these transformations at query time you can perform them in the warehouse, but in the background, after loading data.

Once raw data is loaded into a warehouse, heavy transformations can be performed. It makes sense to have both real-time and background transformations in the BI platform. Users consume and operate on the business definitions level when querying data, and BI is either performing transformation on-the-fly or querying data already transformed in the background.

This approach gives flexibility and agility for development of a transformation layer.

Software engineers nowadays deploy several times a day and praise continuous delivery. The same principle should be adopted for how we approach a transformation. If metric definition changes or some new data is required, one can easily make this changes in hours, not weeks or months. It is especially valuable for fast-growing startups, where changes happen daily and data teams have to be flexible and agile to keep up with product development and business needs.

As data warehouses advance more and more, I’m sure we will see how query time transformations will entirely replace background transformations. Before that happens, we can run some transformations in the background with ELT. Since they are already SQL based and run in the data warehouses, the final switch would be easy and painless.



ETL vs ELT: Considering the Advancement of Data Warehouses