The Changing Landscape of Device Tracking in AdTech
Google may have been very slow to sunsetting cookies, but now wants to join the effort to phasing out the use of IP addresses in AdTech. In the fast-pace world of digital advertising, the tides are turning once again. This significant shift will have far-reaching consequences for the digital advertising industry.
Google just announced that it plans to roll out a feature that shields users’ IP addresses within its browser. Keep in mind that Apple has been doing this in their products for some time.
This shielding feature will limit IP tracking by third parties. IP addresses are unique digits that can identify a particular device. This helps advertisers create a pseudo unique identifier that is used to track users’ movement and activity.
What this means is that IP tracking by third parties, a key practice for advertisers to create unique identifiers and track user behavior, is about to become considerably more challenging.
This technology will initially be offered as an opt-in feature for users in the US, with plans for a phased global rollout. It will eventually become a core element of Chrome, Google’s immensely popular web browser. While Google spokesperson Scott Westover confirmed the company’s intent to experiment with this feature, details remain scarce at this stage.
If this proposal becomes a reality, it would signify a pivotal moment for AdTech. Two of the largest web browsers, Chrome and Apple’s Safari, will limit a signal that has been frequently exploited by AdTech companies to target advertisements, manage ad frequency, and detect ad fraud. Chrome boasts a 46% market share in the U.S., while Safari commands 43%, making their combined impact on the industry undeniable.
Crucially, IP protection will primarily block third-party access, allowing publishers to retain access to users’ IP addresses. This allows them to continue understanding their readers and viewers. However, programmatic partners may lose this vital access, creating a potential seismic shift in the AdTech landscape. This shift could be especially disruptive for AdTech companies that rely on IP addresses to create identifiers and target households.
The impending crackdown on IP address tracking dovetails with Google’s plan to eliminate third-party cookies in the second half of next year. Currently, IP addresses are considered less useful than third-party cookies. But in a world without cookies, many industry experts see IP addresses as the next best alternative. Paul Bannister, Chief Strategy Officer at Rapture, a publishing tech company, explains, “If cookies go away, more and more programmatic advertising companies will start relying on IP addresses more heavily.”
If both cookies and IP addresses are no longer viable options, the AdTech industry will face a colossal transformation. Andrew Casale, President and CEO of the SSP Index Exchange, underscores the magnitude of the task: “It’s a gargantuan task, and it’s going to amount to a significant amount of work.”
Moreover, the reduced availability of data may lead to reduced investments in the AdTech space. Addressability and signal strength are crucial for advertisers. As Andrew Casale explains, “When addressability diminishes, when signal diminishes, fewer advertisers bid on media, and when less advertisers bid, prices drop.”
However, it’s worth noting that potential price drops did not materialize when Safari began hiding users’ IP addresses. According to Bannister, this change had “no material impact on monetization.” So, there’s no need to sound the alarm just yet.
What’s clear is that Google and Apple’s strategic moves suggest that the IP address may not be the panacea that some thought could be in a post-cookie world. In this ever-evolving landscape, it’s a cat-and-mouse game, with tech giants aiming to close as many mouse holes as possible before the mice get out.
The implications of this shift in device tracking are profound, with AdTech companies now facing a crucial decision: adapt to new methodologies, or risk falling behind in an industry that’s evolving at breakneck speed. As we navigate this transition, one thing remains certain: the AdTech space is in for some exciting, albeit challenging, times ahead.
Last Sunday at Trivadis Tech Event, I talked about R for Hackers. It was the first session slot on Sunday morning, it was a crazy, nerdy topic, and yet there were, like, 30 people attending! An emphatic thank you to everyone who came!
R a crazy, nerdy topic, – why that, you’ll be asking? What’s so nerdy about using R?
Well, it was about R. But it was neither an introduction (“how to get things done quickly with R”), nor was it even about data science. True, you do get things done super efficiently with R, and true, R is great for data science – but this time, it really was about R as a language!
Because as a language, too, R is cool. In contrast to most object oriented languages, it (at least in it’s most widely used version, S3) uses generic OO, not message-passing OO (ok, I don’t know if this is cool, but it’s really instructive to see how little you need to implement an OO system!).
What definitely is cool though is how R is, quite a bit, a functional programming language! Even using base R, you can write in a functional style, and then there’s Hadley Wickham’s purrr that implements things like function composition or partial application.
Finally, the talk goes into base object internals – closures, builtins, specials… and it ends with a promise … 😉 So, here’s the talk: rpubs, pdf, github. Enjoy!
Interview with Two Women Data Scientists Genevera I. Allen (left) is professor in the Departments of Statistics, and the Electrical and Computer Engineering, at Rice University. Corinne Cath (right) is a doctoral student at the Alan Turing Institute, the national institute for data science in UK. Below are extracts of recent interviews that are most relevant to our audience. Links to full interviews are provided.
Genevera, what do you think of the shift from “Statistics” to “Statistical Learning and Data Science” in the statistics community (The “Data vs Math” Question?)
I am huge proponent of data science. I think it’s very critical to scientific discussions, and especially important that statisticians are involved. As a discipline (especially in academic circles) we place a preponderance of emphasis on mathematical / statistical rigor and theory; and this often comes at the expense of focusing on the intricacies of the data itself. Data should be paramount and the focus in everything that we do. Rich mathematics and theory are important in statistics, but it should be put in the context of supporting our understanding of data. Specifically, I think that the really important questions in science (that need to be addressed by data) are not addressed by methods that were developed from small, clean, toy data sets, that lend themselves to the assumptions of clean statistical theory. The data sets that are most important, the ones we need to be working with, are big, messy, and complex; they present so many initial statistical challenges that you don’t know where to start. Just because the dataset doesn’t lend itself to beautiful, clean, math to describe the problem, doesn’t mean that the dataset should be ignored. Instead the statisticians should feel encouraged to tackle these complex problems, put data challenges first, and the nice mathematics can follow in its own time. Also, practical, effective, or heuristic methods should not be discarded just because we don’t yet understand all of their statistical properties. If a method is effective, there’s generally a very good mathematical and scientific reason that it is. In this respect, we could be a lot more like engineers and computer scientists because they go out and do things that are useful. I’m a very big proponent of “do something”. Then, we can go back and study the mathematics / theoretical statistical properties of the practical, effective methods.
Corinne, few young women take up technology subjects and careers; just 16% of the graduates in computer studies are women and the figure is 14% for engineering and technology*. Why do you think this is the case?
I think one of the main problems is – to stay within Internet parlance – that there are very strong institutionalized “memes” about which type of people are expected to excel in what kind of professions. Unfortunately, these memes are often based on flawed stereotypes and prejudices. This holds not just for gender, but also for class, race, and a host of other issues. In practice, it means that for instance girls are often not encourage to the same extent as boys to pursue certain interests or careers at a young age. Over the years this adds up. Not just on personal level, but on an institutional level as well.
The challenges and obstacles that women face to access scientific fields vary. Like with most sectors, sexism remains a serious obstacle. But we need to be careful about defining gender equality as the barriers that women face. In my opinion gender inequality can only be tackled when we understand gender as a broad term that includes a variety of gender identities and expressions, and when such gender-based inequality is addressed in tandem with the other forms of structural discrimination like racism, classism, or ableism.
In this article, I present a few modern techniques that have been used in various business contexts, comparing performance with traditional methods. The advanced techniques in question are math-free, innovative, efficiently process large amounts of unstructured data, and are robust and scalable. Implementations in Python, R, Julia and Perl are provided, but here we focus on an Excel version that does not even require any Excel macros, coding, plug-ins, or anything other than the most basic version of Excel. It is actually easily implemented in standard, basic SQL too, and we invite readers to work on an SQL version.
In short, we offer here an Excel template for machine learning and statistical computing, and it is quite powerful for an Excel spreadsheet. The techniques have been used by the author in automated data science frameworks (AI to automate content production, selection and scheduling for digital publishers) but also in the following contexts:
click, website, and keyword scoring (assigning a commercial value to a keyword, group of keywords, or content category)
credit card fraud detection,
Botnet detection, and predicting blog popularity.
The technique blends multiple algorithms that at first glance look traditional and math-heavy, such as decision trees, regression (logistic or linear) and confidence intervals. But they are radically different, can fit in a small spreadsheet (though the Python version is more powerful, flexible, and efficient), and do not involve math beyond high-school level. In particular, no matrix algebra is required to understand the methodology.
The methodology presented here is the result of 20 years worth of applied research on various large industrial data sets, where the author tried for years (eventually with success) to build a system that is simple and work. Most everyone else believed or made people believe that only complex system work, and have spent their time complexifying algorithms rather than simplifying them (partly for job security purposes.)
Node table (extract, from spreadsheet)
Who should use the spreadsheet?
First, the spreadsheet (as well as the Python, R, Perl or Julia version) are free to use and modify, even for commercial, purposes, or to make a product out of it and sell it. It is part of my concept of open patent, in which I share all my intellectual property publicly and for free.
The spreadsheet is designed as a tutorial, thought it processes the same data set as the one used for the Python version. It is aimed at people that are not professional coders, people who manage data scientists, BI experts, MBA professionals, and people from other fields, with an interest in understanding the mechanics of some state-of-the-art machine learning techniques, without having to spend months or years learning mathematics, programming, and computer science. A few hours is needed to understand the details. This spreadsheet can be the first step to help you transition to a new, more analytical career path, or to better understand the data scientists that you manage or interact with. Or to spark a career in data science. Or even to teach machine learning concepts to high school students.
The spreadsheet also features a traditional technique (linear regression) for comparison purposes.
2. Description of the techniques used
Here we explain the differences between the standard and the Excel versions, and we provide an overview, at a high level, of the techniques being used, as well as why they are better in pretty much all applications, especially with unstructured and large data sets. Detailed descriptions are available in the articles referenced in this section.
Spreadsheet versus Python version
The Python version (also available in R, Perl and Julia) of the core technique is described here. Python / Perl offer the following advantages over Excel:
It easily handles version 2.0 of HDT including overlapping nodes
It easily handles big datasets, even in a distributed environment if needed
It easily handles a large number of nodes
Of course, it is incredibly faster for large data sets
The Excel version has the advantage of being interactive, and you can share it with people who are not data scientists.
But Excel (at least the template provided here) is mostly limited to nodes that form a partition of the feature space, that is, it is limited to non-overlapping nodes: see HDT version 1.0. So even if we have two nodes, one for the keyword data, and one for the keyword data science, in version 1,0, they are not overlapping: text buckets contain either data and not data science, or data science. In version 2.0, we no longer have this restriction. Note that nodes can be a combination of any number of keyword values or any other variables (called features in machine learning), and these variables can be quantitative or not.
For those familiar with computer science, nodes, both in the Excel or the Python version, are represented here as key-value pairs, typically stored as hash tables in Perl or Python, and as concatenated strings in Excel. For statisticians, nodes are just nodes of decision trees, though no tree structure is used (nor built) in my methodology — and this is why it is sometimes referred to as hidden decision trees (HDT). But you don’t need to understand this to use the methodology or understand how the spreadsheet works.
What is it about? What kind of algorithms are offered?
The methodology features an hybrid algorithm with essentially two components:
Data aggregation into bins, based on sound feature selection, binning continuous and even discrete features, and metric design, not unlike decision trees. However, no tree is actually built, and the nodes may belong to several overlapping small decision trees, each one corresponding to a case or cluster easy to interpret. This is particularly true in HDT 2.0. I will call this the pseudo decision tree algorithm.
Some kind of regression algorithm called Jackknife regression (see also here), but with much fewer parameters than in classical regression models, and more meaningful parameters, to avoid over-fitting and to be able to cope with cross-correlated features, while at the same time offering a simple interpretation. In the application discussed in the spreadsheet, one could argue that the Jackknife regression used is closer to logistic than linear regression as data is transformed using a log mapping, and we are trying to predict, for an article, the odds of being popular.
Data points belonging to a small node (say n < 10 observations) have the estimated / predicted response computed using the Jackknife regression (algorithm #2 above), the remaining points get scored using the pseudo-decision tree algorithm (algorithm #1 above.)
A lot of intelligence and creativity is put into creating great predictors (the features) and then perform sound feature selection. However, the features used in the spreadsheet and in the previous article on HDT (dealing with the same data set) will apply to all NLP (natural language processing) systems in numerous contexts.
In addition, while not incorporated in the spreadsheet, confidence intervals can be computed for each node with at least n observations (say n = 10) using percentiles for the response, computed for all data points (in this case, representing articles) in the node in question, see example at the bottom of section 3. This percentile function is even available in Excel. Then, data points in a node with too large a confidence interval are scored using the Jackknife regression rather than the pseudo decision tree. By scoring, I mean having the response estimated or predicted. By response, I mean the variable that we are trying to predict: in this case the page views number attached to an article (indeed, its logarithm, to smooth out big spikes due to external factors, or the fact that older articles have by definition more page views — see page view decay prediction for details.)
So no statistical theory is used anywhere in the methodology, not even to compute confidence intervals.
Why a brand new set of machine learning tools?
The HDT methodology offers the following advantages:
The loss of accuracy, compared with standard procedures, is so small in the control data set, that it is negligible and much smaller than the inherent noise present in the data. This has been illustrated before on a different data set, and is confirmed again here (see next section.).
The accuracy is much higher in the test data set, in a cross-validation framework where HDT is performed on a control data set, and performance measured on a different data set called test data set. So the methodology simply works better in the real world. This will be illustrated in an upcoming article, though the reason is easy to understand: HDT was designed as a robust method, to avoid over-fitting and issues caused by outliers, as well as to withstand model failures, messy data, and violations of assumptions.
In addition HDT also offers the following benefits:
Easy interpretation of the results
Simplicity, scalability, easy to implement in a distributed environment, and tested on unstructured big data
No need to know statistical or mathematical theory to understand its inner workings
Great to use as a machine learning tutorial for peple who do not code or not interesting in learning more about machine learning and coming from a different field (software engineering, management consulting, bioinformatics, econometrics, journalism, and so on.)
Could be used in STEM programs in high schools, to give kids the chance to work on real machine learning problems using modern techniques.
Few parameters to deal with, this is essentially a non-parametric, data-driven (as opposed to model-driven) technique.
Since most companies use standard tools and software, using HDT can give you a competitive advantage (if you are allowed to choose your own method), and the learning curve is minimum.
Another way to highlight the benefits is to compare with Naive Bayes. Naive Bayes assumes that the features are independent. It is the workhorse of spam detection, and we all know how bad it performs. For instance, a message containing the keyword “breast cancer” is flagged because it contains the keyword “breast”, and Naive Bayes erroneously assumes that “breast” and “cancer” are independent. Not true with HDT.
Classical decision trees, especially the large ones with millions of nodes from just one single decision tree and involving more than 5 or 6 features at each final node, suffer from similar issues: over-fitting, artificial feature selection resulting in difficulties interpreting the results. maintenance challenges, over-parameterization making it more difficult to fine-tune, and most importantly, lack of robustness.
3. The Spreadsheet
The data set and features used in this analysis are described here. The spreadsheet only uses a subset of the original features, as it is provided mostly as a template and for tutorial purposes. Yet even with this restricted set of features, it reveals interesting insights about some keywords (Python, R, data, data science) associated with popularity (Python being more popular than R), and some keywords that surprisingly, are not (keywords containing ”analy”, such as analytic.) Besides keywords found in the title, other features are used such as time of publication, and have also been binarized to increase stability and avoid an explosion in the number of nodes. Note that HDT 2.0 can easily handle a large number of nodes, and even HDT 1.0 (used in the spreadsheet) easily handles non-binary features.
There are 2,616 observations (articles) and 74 nodes. By grouping all nodes with less than 10 observations into one node, we get down to 24 nodes. Interestingly, these small nodes perform much better than the average node. The correlations between the features and the response are very low, mostly because the keyword-like features trigger very few observations: very few articles contain the keyword R in the title (less than 3%.) As a result, the correlation between the response and predicted response is not high, around 0.33 regardless of the model. The solution is of course to add many more keywords to cover a much larger proportion of articles.
In the Python version, keyword detection / selection (to create features) is part of the process, and included in the source code. Here, the keywords used as features are assumed to be pre-selected.
Page view index (see spreadsheet) is a much better performance indicator than R-squared or correlation with response, to measure the predictive power of a feature. This is clearly the case with the feature “Python”.
The Excel version is slightly different from the Python version, from a methodological point of view, as described in section 2.
The goodness-of-fit for Jackknife and linear regression are very close, despite the fact that Jackknife is a very rough (but robust) approximation of linear regression.
Jackknife has been used in its most elementary version, with only one M. When the cross-correlation structure is more complex, I recommend using Jackknife with two M’s as described here.
Some of the features are correlated, for instance “being a blog” with “being a forum question”, or “containing data but not data science” with “containing data science”.
When combining Jackknife with the pseudo-decision trees (applying Jackknife to small nodes) we get a result that is better than Jackknife, pseudo-decision trees, or linear regression taken separately.
For much larger data sets that include all sorts (categories) of articles (not just about data science), I recommend creating and adding a feature called category. Such a feature can be build using an indexation algorithm.
Side Note: Confidence intervals for response (example)
Node N-100-000000 in the spreadsheet has an average pv of 5.85 (pv is the response), and consists of the following pv values: 5.10, 6.80, 5.56, 5.66, 6.19, 6.01, 5.56, 5.10, 6.80, 5.69. The 10th and 90th percentiles for pv are respectively 5.10 and 6.80, so [5.10, 6.80] is our confidence interval (CI) for this node. This computation of CI is similar to the methodology discussed here. This particular CI is well below the average pv — even the upper bound 6.80 is below the average pv of 6.83. In fact this node corresponds to articles posted after 2014, not a blog or forum question (it could be a video or event announcement), and with a title containing none of the keywords from the keyword feature list. The business question is: Should we continue to accept and promote such poor performing content? The answer is yes, but not as much as we used to. Competition is also dropping this kind of content for the same reasons, so, ironically, this is an opportunity to build a monopoly. Also variety is critical, and only promoting blogs that work well today is a recipe for long term failure, though it works well in the short term.
To read my best articles on data science and machine learning, click here.