Seven Common Misconceptions Businesses Have About Big Data and Artificial Intelligence

Click to learn more about author Irfan Ak. Artificial intelligence and big data are two of the hottest and most discussed topics in the tech circle. Despite this, there are many misconceptions surrounding both big data and artificial intelligence. There is a lot of hype around both these topics as well, which can sometimes lead […]

The post Seven Common Misconceptions Businesses Have About Big Data and Artificial Intelligence appeared first on DATAVERSITY.

Source: New feed

How to Transform into a Data-Driven Organization?

Click to learn more about author Konain Qurban. It is a journey to ensure the alignment of analytics initiatives to organizational objectives, combined with consistent and effective coordination of activities across all business units. The road from a pile of raw data to insights and from insights to action is paved with strategic goals. More […]

The post How to Transform into a Data-Driven Organization? appeared first on DATAVERSITY.

Source: New feed

How has Covid-19 impacted investments in data?

How has Covid-19 impacted investments in data?

A new IAB study states that COVID-19 may cause 20% in data cuts in Q3 2020. Survey respondents expect the short-term impact of COVID-19 to require budgetary cuts and reallocation of investments. There is less consensus on the outlook in Q4 2020 and beyond.  Here a few key stats from the study.

  • 64.2% — decrease in Q2/Q3 2020 marketing spend
  • 49.1% — change messaging and content priorities
  • 26.4% — de-emphasize certain products or business lines
  • 22.6% — shift budget from brand initiatives to performance-oriented marketing
  • 20.8% — decrease Q4 2020 and beyond marketing spend
  • 17.0% — increase Q4 2020 and beyond marketing spend

The report focuses on U.S. advertiser and marketer spend on third-party audience data; spend on services, technologies and hybrid activation solutions that support the use of data across consumer and B2B marketing.  It outlines the major demand drivers, operational challenges, and other trends impacting investment in data.

Check out the article on MediaPost https://www.mediapost.com/publications/article/353961/iab-the-impact-covid-has-on-data-spending.html for more details.

Time based heatmaps in R

Time based heatmaps in R

 

 

Tutorial Scenario

In this tutorial, we are going to be looking at heatmaps of Seattle 911 calls by various time periods and by type of incident.  This awesome dataset is available as part of the data.gov open data project.  

 

Steps

The code below walks through 6 main steps:

  1. Install and load packages
  2. Load data files
  3. Add time variables
  4. Create summary table
  5. Create heatmap
  6. Celebrate

 

 

Code

 

#################### Import and Install Packages ####################

 

install.packages(“plyr”)

install.packages(“lubridate”)

install.packages(“ggplot2”)

install.packages(“dplyr”)

 

 

library(plyr)

library(lubridate)

library(ggplot2)

library(dplyr)

#################### Set Variables and Import Data ####################

 

#https://catalog.data.gov/dataset/seattle-police-department-911-incident-response-52779

 

incidents <-read.table(“https://data.seattle.gov/api/views/3k2p-39jp/rows.csv?accessType=DOWNLOAD”, head=TRUE, sep=”,”, fill=TRUE, stringsAsFactors=F)

col1 = “#d8e1cf”

col2 = “#438484”

head(incidents)

attach(incidents)

str(incidents)

 

#################### Transform ####################

 

 

#Convert dates using lubridate

 

incidents$ymd <-mdy_hms(Event.Clearance.Date)

incidents$month <- month(incidents$ymd, label = TRUE)

incidents$year <- year(incidents$ymd)

incidents$wday <- wday(incidents$ymd, label = TRUE)

incidents$hour <- hour(incidents$ymd)

attach(incidents)

head(incidents)

 

#################### Heatmap Incidents Per Hour ####################

 

#create summary table for heatmap – Day/Hour Specific

 

dayHour <- ddply(incidents, c( “hour”, “wday”), summarise,

N = length(ymd)

)

dayHour$wday <- factor(dayHour$wday, levels=rev(levels(dayHour$wday)))

 

attach(dayHour)

 

 

#overall summary

ggplot(dayHour, aes(hour, wday)) + geom_tile(aes(fill = N),colour = “white”, na.rm = TRUE) +

scale_fill_gradient(low = col1, high = col2) +

guides(fill=guide_legend(title=”Total Incidents”)) +

theme_bw() + theme_minimal() +

labs(title = “Histogram of Seattle Incidents by Day of Week and Hour”,

x = “Incidents Per Hour”, y = “Day of Week”) +

theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

 

 

 

#################### Heatmap Incidents Year and Month ####################

 

#create summary table for heatmap – Month/Year Specific

 

yearMonth <- ddply(incidents, c( “year”, “month” ), summarise,

N = length(ymd)

)

 

yearMonth$month <- factor(summaryGroup$month, levels=rev(levels(summaryGroup$month)))

 

attach(yearMonth)

 

 

#overall summary

ggplot(yearMonth, aes(year, month)) + geom_tile(aes(fill = N),colour = “white”) +

scale_fill_gradient(low = col1, high = col2) +

guides(fill=guide_legend(title=”Total Incidents”)) +

labs(title = “Histogram of Seattle Incidents by Year and Month”,

x = “Month”, y = “Year”) +

theme_bw() + theme_minimal() +

theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

 

 

#################### Heatmap Incidents Per Hour by Incident Group ####################

 

#create summary table for heatmap – Group Specific

 

groupSummary <- ddply(incidents, c( “Event.Clearance.Group”, “hour”), summarise,

N = length(ymd)

)

 

#overall summary

ggplot(groupSummary, aes( hour,Event.Clearance.Group)) + geom_tile(aes(fill = N),colour = “white”) +

scale_fill_gradient(low = col1, high = col2) +

guides(fill=guide_legend(title=”Total Incidents”)) +

labs(title = “Histogram of Seattle Incidents by Event and Hour”,

x = “Hour”, y = “Event”) +

theme_bw() + theme_minimal() +

theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

 

Please see here for the full tutorial and steps
Time based heatmaps in R

ETL vs ELT: Considering the Advancement of Data Warehouses

ETL vs ELT: Considering the Advancement of Data Warehouses

ETL stands for Extract, Transform, Load. It has been a traditional way to manage analytics pipelines for decades. With the advent of modern cloud-based data warehouses, such as BigQuery or Redshift, the traditional concept of ETL is changing towards ELT – when you’re running transformations right in the data warehouse. Let’s see why it’s happening, what it means to have ETL vs ELT, and what we can expect in the future.

ETL is hard and outdated

ETL arose to solve a problem of providing businesses with clean and ready-to-analyze data. We remove dirty and irrelevant data and transform, enrich, and reshape the rest. The example of this could be sessionization – the process of creating sessions out of raw pageviews and users’ events.

ETL is complicated, especially the transformation part. It requires at least several months for a small-sized (less than 500 employees) company to get up and running. Once you have the initial transform jobs implemented, never-ending changes and updates will begin, because data always evolves with business.

The other problem of ETL is that during the transformation we reshape data into some specific form. This form usually lacks some data’s resolution and does not include data which is useless for that time or for that particular task. Often, “useless” data becomes “useful.” For example, if business users request daily data instead of weekly, then you will have to fix your transformation process, reshape data, and reload it. That would take a few weeks more.

The more data you have, the longer the transformation process is.

The transformation rules are very complex, and even if you have only a few terabytes of data, loading time can take hours. Given the time it takes to transform and load a big dataset, your end users will almost never get fresh data. “Today” will mean today last week, and “yesterday” – a week ago yesterday. Sometimes it takes several weeks or months to get a new update to rollup.

To summarize some of the cons of ETL:

  • ETL is expensive to implement, especially for small and medium businesses.
  • ETL is expensive to maintain.
  • ETL eliminates access to raw data.
  • ETL is time consuming. Users have to wait for transformation to be finished.

Why have we been doing it for decades?

A prior generation of data warehouses weren’t able to work with the size and complexity of raw data. So, we had to transform data before loading and querying it.

The latest advances in database technologies made warehouses much faster and cheaper. Storage is getting more accessible all the time, some data warehouses even separate pricing for storage and computation. For example, BigQuery storage pricing is quite cheap and you can just store all your raw data there.

The most important changes in recent years to data warehouses, which made the comparison of ETL vs ELT possible:

  • Optimized for analytical operations. Modern analytical warehouses tend to be columnar and optimized for aggregating and processing huge datasets.
  • Cheap storage. No worries about what to store, you can just dump all your raw data into a warehouse.
  • Cloud based. It scales infinitely and on-demand, so you can get the performance you need the moment you need it.

ETL vs ELT: running transformations in a data warehouse

What exactly happens when we switch “L” and “T”? With new, fast data warehouses some of the transformation can be done at query time. But there are still a lot of cases where it would take quite a long time to perform huge calculations. So instead of doing these transformations at query time you can perform them in the warehouse, but in the background, after loading data.

Once raw data is loaded into a warehouse, heavy transformations can be performed. It makes sense to have both real-time and background transformations in the BI platform. Users consume and operate on the business definitions level when querying data, and BI is either performing transformation on-the-fly or querying data already transformed in the background.

This approach gives flexibility and agility for development of a transformation layer.

Software engineers nowadays deploy several times a day and praise continuous delivery. The same principle should be adopted for how we approach a transformation. If metric definition changes or some new data is required, one can easily make this changes in hours, not weeks or months. It is especially valuable for fast-growing startups, where changes happen daily and data teams have to be flexible and agile to keep up with product development and business needs.

As data warehouses advance more and more, I’m sure we will see how query time transformations will entirely replace background transformations. Before that happens, we can run some transformations in the background with ELT. Since they are already SQL based and run in the data warehouses, the final switch would be easy and painless.



ETL vs ELT: Considering the Advancement of Data Warehouses

How IOT is changing our world?

How IOT is changing our world?

The Internet of Things connected all of kinds intelligent devices, such as mobile devices, sensors, machines or vehicles with each other and with the cloud computing. Analysis of the IoT data offers many opportunities for companies, as they can make faster decisions, optimize business processes or develop new applications and even restructure business models. This has the enormous potential of Internet of Things that is virtually present in every industry, including energy, retail, healthcare, financial services, transportation and manufacturing.

The Internet of Things has changed the dimensions of traditional Business IT. To tap the potential need for a highly scalable and reliable IT Infrastructure, they should be on standardized components and open protocols and include the three layer Devices, Controllers and Data Center or the Cloud solutions.

The growth is a positive sign for all industry, but however, we cannot ignore the sheer size and the public nature of the Internet of Things, which will bring great challenges. Network and system architects need to optimize the IT infrastructure to meet the higher requirements of the IoT in terms of scalability, reliability and security. IoT based applications and automated business processes make high demands on the availability of the system. Many intelligent systems are used for mission critical applications where system downtime leads to decreased productivity.

Intelligent IT solutions such as Red Hat technologies are based on things to meet the requirements of IoT based systems for scalability, reliability and security. The solutions are based on a hierarchical model with device layer (Edge Nodes), control layer (controller gateway) and Data Center Tier or Cloud layer. Here come standardized protocols and components for use.

The device layer involving a variety of intelligent terminals that includes mobile devices, wearable gadgets, sensors, control and regulation devices, autonomous machines and appliances, etc… The communication between the devices and the control or checkpoints is based on standard network protocols, either wired or wireless. The forwarding of the raw data and the exchange of control information is based on open messaging standards.


How IOT is changing our world?