Choosing a high school is one of the first big decisions in life. With over 400 public high schools in New York City, the families of students can be overwhelmed by the long list of schools, each of which promises secondary education for students with curriculum ranging from biology to engineering to musical theater. The options seem endless. Using the public high school data sets released by NYC Department of Education, I made this shiny app to visualize the information from these data sets.
There are 5 data sets available from NYC OpenData portal: NYC school district map, 2016 Department of Education(DOE) high school directory, SAT results from 2010 and 2012, DOE high school survey results from 2011. Even though school survey and SAT are conducted every year, summaries of more recent SAT and high school surveys are not yet available on the website.
The app’s user interface contains 3 tabs: interactive map, school info and school district comparison. The map tab (figure below) shows the map of New York City, overlaid with school district map and public high school locations. The markers representing schools are color-coded in 3 colors: blue for schools prioritize the corresponding school districts, green for schools open to all New York City students, red for special high schools, usually international high schools for new residents of NYC.The panel on the right displays 3 histograms; the top one is the distribution of average SAT scores for NYC high schools in 2010 and 2012, the middle one shows the distribution of 2011 high school survey rating, and the bottom one presents the numbers of students enrolled in NYC high schools in 2016.
When a school on the map is selected via mouse-click. A pop-up displaying name and address of the selected school will show up at the school’s location on map. The school’s name, address, contact info along with SAT scores, survey rating and student number will also be showed on the information panel on the right. The histograms on the panel will be updated and the school’s standing among NYC public schools can be visualized on the graphs(figure below).
The school info tab(figure below) provides detailed information of individual schools. The drop-down menus at the top can be used to filter the list of schools based on borough and neighborhood along with the search box. Once a school is selected in the data table, details of the school, such as contact information, nearby transits and school programs will be displayed below.
The School District Comparison tab(figure below) allows users to compare different school districts. The drop down menu contains NYC’s 32 school districts, selecting them will filter the school list. NYC DOE labeled 9 public high schools as “Gifted and Talented” to meet the needs of gifted students, these 9 schools are open to all NYC students whilst being the most selective public high schools. These schools are also added to the selection list as “Gifted and Talented” to compare with district schools. There are 3 measurements available to compare the selected school districts: 2012 SAT score, 2010 SAT score and 2011 school survey. If SAT scores are picked, 4 box plots will be displayed to compare the cumulative, reading, math and writing between the schools in the 2 chosen school district; if school survey is picked, the 4 box plots will compare the school districts in 4 aspects of the survey: safety and respect, communication, engagement and academic expectation.
I had this question after making the app: since 2 ways to measure a school’s academic performance, SAT score and academic expectation aspect of school survey, are present in the data, which one should you trust more to gauge the school with? Why are some schools with average SAT scores having very good academic ratings in the survey? Let’s look at the plot below. In this plot of 2011 survey academic expectation rating vs 2012 SAT score, a linear trend line is added along with 2 parameters to evaluate the fit: the very small P-value of coefficients indicates that there is a positive relationship between survey rating and SAT score, the small adjusted R-squared value indicates the linear fit does not represent the full relationship between them. SAT score is an objective evaluation of schools while school survey is a subjective review representing how well a school’s academic performance match the reviewer’s expectation.
Visualizing the Game Style and Shooting Performance among Superstars via NBA Shot-log
Contributed by Xinyuan Wu.
In the NBA, a top player makes around a thousand shots during the entire regular season. A question worth asking is: What information can we get by looking at these shots? As a basketball fan for more than 10 years, I am particularly interested in discovering facts that can not be directly seen on live TV. When I was surfing on web last week, I found a data set called NBA shot-log from Kaggle. This data summarizes every shot made by each player during the games in the 14/15 regular season along with a variety of features. I decided to perform an exploratory visualization with this data. Now Let’s dive into the shot-log, and see what interesting information we can discover in terms of game style and shooting performance among NBA players. I focused this analysis on Stephen Curry, James Harden, Lebron James and Russell Westbrook, who are ranked 1-4 in the MVP ballot in 2014-to-2015 season and undoubtedly superstars in the league.
Data Obtaining and Processing
Data cleaning, feature creating and graph processing were performed using R. The package used for generating graphs is ggplot2. The R code for data cleaning and feature creation can be found here.
Figure 1. Shot density plot with respect to shot distance. The graph above demonstrates the distribution of the shot attempts by each player versus shot distance. All four players have a local maximum centered at around 5 feet and 25 feet, corresponding to lay-up region and three-point region. Curry has the shot density leaning towards three-point zone while James shot more shots at the paint zone, indicating different play style between two players. It can also be seen that Westbrook uses two-point jumper frequently, as suggested by the peak at around 17 feet.
Figure 2. Violin plot that summarizes shot accuracy for each player.
The above violin plot summarizes the the shot accuracy for each player throughout the season. Based on the visual inspection of this plot, Curry and James have relatively stable shot accuracy compared to Harden and Westbrook (as suggested by a wider shape).
Figure 3. Boxplot that describes the shot accuracy with respect to match result.
After seeing the summary of shot attempt and shot accuracy, let’s explore how these values behave when other factors are taken into account. Let’s divide the shot accuracy according to the match result. From the plot, Curry, James and Westbrook display a large gap between the won games and the lost games. In contrast, Harden shows a relatively small accuracy gap.
Figure 4. The shot number and shot accuracy with respect to date.
Then let’s look at how the shot number and accuracy change over the season timeline. Westbrook tends to make more shots at the end of the season, during which time Oklahoma City Thunder is fighting for the last playoff position. From the graph on the right, Curry and James have relatively stable shot accuracy throughout the timeline, while the accuracy of Harden and Westbrook seems to have greater variance.
Figure 5. Number of shots with respect to touch time.
Now let’s see the number of shots plotted against touch time. Curry performed more shot at a very short touch time, indicating his catch-and-release shooting style. In contrast, Westbrook tends to have the ball in hand for a few seconds before taking the shot.
Figure 6. Shot accuracy with respect to shot distance.
An interesting phenomenon was observed when plotting shot accuracy against the shot distance. As shown above, the shot accuracy decreases from the lay-up region to around 10 feet. For Curry, James and Westbrook, although value of accuracy differ with each other, they all have a local maximum at around 14 feet. Let’s call this region the comfortable zone. On the other hand, the accuracy peak of Harden extends out of the three-point line, which is different with the others. When the comfortable zone is passed, the accuracy for all players decreases monotonically.
Figure 7. Density plot with respect to shot distance and closest defender distance.
When combining defender distance into figure 1, we get a contour plot that can give us a general feeling about the play style of each player. From the plot on the left, it can be seen that at lay-up region, the contour plot for Westbrook lies below the one for Curry, meaning that Westbrook tends to make more tough lay-ups than Curry. To my surprise, Westbrook is even more aggressive at the rim than Lebron James. Figure 8. Shot number and shot accuracy with respect to opponent and players. From the heat map above, we can view the number of shots and shot accuracy with respect to each opponent. For example, Westbrook made more shots when playing against New Orleans Pelican and Portland Trail Blazers, and Harden had poor accuracy when playing against Boston Celtics. Figure 9. The shot accuracy after made shots. The top graph combines all shots, while the bottom graph takes only three point shots into account. Some people believes that making one shot will affect the accuracy of the next shot. Based on the shot-log, we can actually explore this effect. A set of plots has been generated. For each player, the left most red bar represents the shot accuracy of all shots right after missing one shot. The green, blue, and purple bars represent the shot accuracy after making 1, 2 and 3 consecutive shots. It is interesting to note that, almost for all players under study, having one shot made seems to have a negative effect on the following shot. The more consecutive shots are made, the lower the accuracy of the next shot. When only three-point shots are taken into account, this trend still holds true for Curry and Lebron James.
Takeaways and Future Direction
From these graphs, we can see that four stars have dramatically different play styles. For example, Stephen Curry tends to perform catch and quick release, while Russell Westbrook prefers to attack the rim with ball in hand. In terms of shot accuracy, Stephen Curry and Lebron James have a more stable performance than Harden and Westbrook. Interestingly, in most cases, hitting one shot tends to have a negative effect on the next shot. A deeper exploration is needed for more detail about this phenomenon. For the future direction, focusing on the defender side of the data is a potentially interesting extension. Further more, we could also apply machine learning techniques to predict the probability of hitting a shot.
Alternatives to algebraic modeling for complex data: topological modeling via Gunnar Carlsson
For many, mathematical modeling is exclusively about algebraic models, based on one form or another of regression or on differential equation modeling in the case of dynamical systems.
However, this is too restrictive a point of view. For example, a clustering algorithm can be regarded as a modeling mechanism applicable to data where linear regression simply isn’t applicable. Hierarchical clustering can also be regarded as a modeling mechanism, where the output is a dendrogram and contains information about the behavior of clusters at different levels of resolution. Kohonen self-organizing maps can similarly be regarded in this way.
Topological data analysis is also an non-algebraic approach to modeling. In fact, one way to think about topological data analysis is as a new modeling methodology for point cloud data sets. As such, it is a very natural extension of what we usually think of as mathematical modeling.
The goal of this post is to discuss what we mean by mathematical modeling, and see how topological modeling fits with other modeling methodologies. In order to do this, let’s first observe that there are at least three important goals of a modeling system.
Compression: Any method of modeling should produce a very compressed representation of the data set, which is small enough for a user to understand. For example, linear regression takes a set of data points and represents it by two numbers, the slope and the y-intercept. The point is that most data sets contain too much information, and we need to strip away detail in order to obtain insight about the set.
Functionality: It should also allow users some actions based on the structure of the model. For example, linear regression permits the user to produce predictions of one or more outcome variables based on the values of independent variables
Explanations: The model should have the ability to explain its structure. For example, a clustering based model should have explanations about what distinguishes one cluster from another. Let’s discuss how these ideas work out for some different modeling methodologies.
We mean here any method that approximates a data set by algebraic equations. It builds on the idea of analytic geometry, introduced by Descartes, in which compression of geometric objects is performed by recognizing that many geometric objects can be represented as solutions sets of equations. Approximating data sets by such geometric objects is then what constitutes the regression procedure. The output of such a model is an equation or a system of equations. The fact that there is compression is clear, since a geometric object consisting of infinitely many points can be represented by finitely many coefficients in an equation. It clearly provides functionality, in the form of predictive capability. Finally, it supplies explanatory information, for example by studying the coefficients of the variables to understand the dependence of an outcome variable on an individual independent variable.
While this case demonstrates that algebraic models can accommodate the goals of modeling system it is also important to note that algebraic models are often somewhat difficult to use for many kinds of data.
For example, consider the object below.
To represent this object algebraically requires equations of very high degree. What this means is that the degree of compression achieved is considerably less, since there will now be too many different coefficients which can be varied in the compression. This illustrates the point that, to be very informative, algebraic models need to be of relatively low degree, for example linear or quadratic. To be sure, one can do regression using families of functions other than polynomials, but that requires a choice of such a family. However, it is rarely clear which family of functions would be appropriate.
Cluster analysis is not often thought of as modeling, but in my view it should be. The output of a cluster analysis is a partition of the data into disjoint groups. In this case, the compression is achieved by providing the collection of clusters, which is typically small. This kind of compression often allows one to obtain understanding of phenomena that naturally fit in this paradigm, rather than the algebraic one.
For example, the division of diabetics into type 1 and type 2 cohorts, or division of voters into liberal, conservative, or independent are very useful conceptually, and this kind of behavior is very common in a large number of data sets. One can attach functionality to such a model by constructing a classifier that operates on a new data point and gives its best guess about which cluster the point belongs. In many cases, explanations can often be achieved. For example, if a data set consists of vectors with several entries or features, one can determine which of the variables best distinguish a given cluster from another one, or the given cluster from the entire data set, based on various statistical characterizations about the degree to which a variable distinguishes the clusters.
Hierarchical clustering is a particularly interesting methodology, since its output is a dendrogram, which is a subtler object than an algebraic equation or a partition. This modeling method was created to address the fact that many clustering algorithms require a choice of threshold, and it is often difficult to determine the right choice (if there is one) of the threshold. In this case, there is again compression, since one typically studies the high levels of the dendrogram, which produces a small number of clusters. In this case, there is also functionality in the model, since it can create clusters at every level, as well as determine the clusters at a higher level that contain a given cluster at a lower level. For example, in a clustering describing a taxonomy of animals, one can have a picture that looks as follows:
This is clearly an improvement over clustering at one or the other of the levels, since it encodes the fact that, say, Carnivora, Cetacea, and Chiroptera all belong to the higher level group of mammals.
This is the methodology developed by Ayasdi. It produces as an output a topological network instead of algebraic equations, partitions, or dendrograms. It provides a great deal of compression of the data, since each node of the network corresponds to a (possibly large) group of data points, and the edges correspond to the intersections of these subgroups. It also provides a great deal of functionality, including the following possibilities:
Layout of the network on the screen for visualization and unsupervised analysis.
The ability to select subgroups of the data set by selecting groups of nodes, using standard gestural techniques such as lassoing.
Given a group or pair of groups, determining the “explanatory variables” for the group vs. the rest of the data set or vs. the second group.
The ability to color the network by a quantity of interest, by coloring each node by the average value of the quantity on the points in the collection corresponding to the node.
Given a group defined externally, the ability to color each node by the fraction of the members of the corresponding group which belong to the externally defined group.
Creation of new features associated to the network structure. This can include analogues of “bump functions” based on groups, centrality measures in the graph distance, and others.
Within this topological model, four distinct leukemia subgroups are visible (they are colored orange, green, yellow and greenish-blue. The various leukemia types are numbered and the topological model subgroups are colored by the average value of the leukemia types in each group. From left to right, the orange group primarily contains ALL and B-ALL types, the green group is almost pure CLL, the yellow is AML and T-ALL derivatives and the bluegreen group is MDS and CML derivatives.
These primitive capabilities can be used in conjunction with each other to create methods to:
improve predictive models
deliver principled dimensionality reduction based on the data set’s network model structure
develop hot spot analysis for understanding the distribution of a quantity of interest or a predefined group within the data set.
As for the explanations, the ability to color the network by a quantity of interest (#3 above) permits the explanation of various groups and the distinction of the groups in terms of the most distinguishing variables.
The main point to be made here is that mathematical modeling is not restricted to a small number of mathematical techniques, but should instead be viewed as the science (and art) of representing data by informative mathematical models of various kinds. Virtually all disciplines in mathematics contain kinds of objects that can serve as useful models in various domains. The key point about any methodology is the amount of insight it yields, and the extent to which it enables and speeds up the work of data scientists trying to extract knowledge from the data.
In 1945, Count,Richard Taaffe* a Dublin gem collector, was sorting through a set of spinel gems that he had bought, and found one that refracted light differently – instead of simply bending light rays, it split them into two rays (“double refraction”). The anomalous gem was named after him and earned a place on the “world’s rarest gems” list. In analytics, it sometimes not the rule (i.e. the model) that is of interest, but rather the exception.
Detecting anomalous cases in large datasets is critical in conducting surveillance, counteringcredit-card fraud, protecting against network hacking, combating insurance fraud, and many more applications in government, business and healthcare.
The techniques of anomaly detection are not new to the era of Big Data. Nitin Indurkhya, who teaches an Anomaly Detection course, told me of an interesting application of anomaly detection to data that long pre-dates the era of Big Data. A student in his course specialized in researching the records of civil service exams in Korea, dating back to medieval times. These challenging and high stakes exams controlled access to jobs in the state bureaucracy. The student found that one particular patrilineal kin group in Korea, the Andong Kim, stood apart from the other 700 – the Andong Kim were much younger at time of passing than all the other groups.
Sometimes, the analyst has a set of known anomalies (labeled data), and identifying similar anomalies in the future can be handled as a supervised learning task (a classification model). More often, though, little or no such “training” data are available. In such cases, the goal is to identify cases that are very different from the norm.
Careful thought and analysis may be needed to define what is “normal.” Distance metrics (e.g. Euclidean Distance) are often used to define groups (clusters) of records that are close to one another and, by exclusion, anomalies. Applying this metric to stock returns, though, has limited utility. Stocks that have returns that are far from the crowd may not be anomalies – they may be members of a group of stocks that experience extreme variability in return. Using this distance metric, they are not only far from the crowd but probably far from each other, and yet they belong to a cluster that has its own (different) definition of normality. In this case, using other attributes of stocks in the analysis can help tease out true anomalies.
Other challenges pointed out by Chandola, Banerjee and Kumar (in “Anomaly Detection: A Survey” inACM Computing Surveys, July, 2009) include:
Noisy data, where the noise (which is uninteresting) may be similar to the truly interesting anomalies.
An evolving “normal,” where what is normal tomorrow may be different from what is normal today.
Some of the same statistical and machine-learning methods used for prediction and exploration are also used for anomaly detection, and they rely on some form of distance measure.
One is k-nearest neighbors, in which the metric used is “distance to the kth nearest neighbor.” The idea is to identify areas of density, where records are close to one another. Distance to the kth nearest neighbor will, in such areas, be relatively low, and a record that is quite distant from its kth nearest neighbor is anomalous.
A related method is cluster analysis, in which distance between records is used to define clusters of records that are close to one another. Records outside of the clusters can then be classed as anomalous. (Note that both k-means and hierarchical clustering may produce both compact clusters, as well as spread-out clusters composed of leftovers; those leftover clusters may contain anomalous records.)
In using distance measures, one issue to be alert to is high dimensionality. If sufficient variables are used, then distances between all records end up being similar, and it is difficult to identify true outliers. Therefore, it is best to reduce dimensionality (e.g. via principal components analysis) first.
Compared to supervised methods for predictive modeling (i.e. training models based on labeled data), anomaly detection is more open-ended, and is a bit of an art. Effective application may require broad familiarity with multiple areas of statistics and data science, as well as domain knowledge. The most effective approach in given circumstances may depend on the nature of the domain, as well as characteristics of any expected anomalies.
For example, one application of anomaly detection is in alerting public health officials to outbreaks of new diseases. For starters, the detection algorithm needs to rule out attributes that only appear after an epidemic has blossomed to a dangerous state – early detection is the key. Next, looking just at Emergency Room admissions, there are several types of data that might be involved in detection:
Demographic data on the patients
Time series data
A detection algorithm might therefore be an ensemble approach to defining what is “normal,” and, by extension, what is abnormal. Having existing data on unusual disease outbreaks would be useful, but perhaps too useful? Excessive reliance on what will necessarily be very limited cases of actual outbreaks could drift into the equivalence of what, in supervised learning, might be termed overfitting.
Such a complex scenario encourages development of high-cost vertical solutions, that tend to be push-button, and opaque to the user. But a data scientist wants to know what’s going on, and wants to have a variety of arrows in the quiver to meet different circumstances.
In addition to traditional statistical arrows like cluster analysis, principal components, and k-nearest neighbor, Indurkhya, in his course, also teaches non-standard methods. One such method is based in information theory and relies on measures of data complexity. The idea is to isolate the records that add the most complexity to the data. An intuitive way to think about and measure this, without getting too deep into information theory, is to consider data compression algorithms. The simpler a dataset is, the more it can be compressed (i.e. the less space it occupies once compressed). Thus, the ratio of the compressed size to the uncompressed size is a good proxy for the complexity of the data.
The bottom line is that anomaly detection is a fertile and engaging area for data scientists – it leverages a broad range of knowledge, and requires you to get acquainted with the domain in question (often the most fun part of a project).
Taafe, who was an Austrian, had an interesting heritage. His father held two aristocratic titles, both of which the family lost at the end of World War I. He was a Count in the Austro-Hungarian Empire, and a Viscount in the Irish Peerage. The former title was lost when Austria abolished titles at the conclusion of the war, and the latter as a consequence of fighting on the wrong side in the war.
When I tell people that I work at an AI company, they often follow up with “So what kind of machine learning/deep learning do you do?” This isn’t surprising, as most of the market attention (and hype) in and around AI has been centered around Machine Learning, and its high profile subset, Deep Learning, and around Natural Language Processing, with the rise of the chatbot and virtual assistants. But while machine learning is a core component for artificial intelligence, AI is in fact more than just ML. So what does it really mean for an application to be “intelligent”? What does it take to create a system that is “artificially intelligent? In Blade Runner, Harrison Ford and co. used the Voight-Kampf machine to see if a suspect is a replicant. In the real world, the late and great Alan Turing came up a test to measure whether a machine is able to exhibit behaviour is that equivalent to that of a human, aptly known as the Turing Test. “A computer would deserve to be called intelligent if it could deceive a human into believing that it was human,” said Turing. Since its advent in the 1950s, it has been the most common litmus test employed in AI. The test is rudimentary but it presents a baseline that needs to be reached before it can be considered intelligent.
Fooling Turing — Components of an AI system
Now, let’s look behind the curtain to see how to can create a system that can pass the Turing Test.
According to Norvig and Russell:
In order to pass the Turing Test, the computer would need to possess the following capabilities:
– natural language processing
– knowledge representation,
– automated reasoning, and
– machine learning
Now, I assume you already trust that I’m quoting from an authoritative source, but in case there’s any doubt (or if your memory needs jogging), Norvig and Russell are AI heavyweights. Dr Norvig is currently Director of Research at Google, and Dr Russell is a Professor of Computer Science at UC Berkeley. They wrote the textbook on artificial intelligence used in more than 1100 universities in the world.
What intelligent systems need to possess
Each component outlined above serves a different purpose to create a system that mimics human-like intelligence. Natural Language Processing allows a machine to communicate and receive information in an organic human form, rather than as unwieldy lines of code. Machine Learning systems gives it the ability to adapt to new circumstances and detect new patterns and facts.
Knowledge Representation and Automated Reasoning
As the lesser-known components of AI, Knowledge Representation and Automated Reasoning aren’t as commonly spoken about in the press but nonetheless play a key role in the creation of intelligent systems.
Knowledge Representation allows us to make sense of the complexity in information. I like to think of it like a “mental map” of how things are related to each other in the real world and its context.
In the world of bubbly alcoholic drinks, Champagne is a type of sparkling wine but it is also a region that spans across several departments in France. You only have to look at the number of disambiguation pages on Wikipedia to realize how many items share the same name but represent totally different concepts, which is something that a human can recognize naturally, but needs to be made explicit to a machine. As another example, consider “face” — this could be, among other things, a clock face, a person’s face, the side of a cliff, a magazine, an album or the verb representing the direction something is pointing. So now that we have a model of how the world works, what is Automated Reasoning? Similar to the Type 2 thinking that Kahneman describes, automated reasoning capabilities allows a system to “fill in the blanks” as there is no such thing as complete information or data with no gaps. So when tasked with the question of finding out in what country Dom Perignon is made, the system would be capable of automatically inferring that it is in France. The logic chain here would be Dom Perignon is a type of champagne, champagne is only produced in the region of Champagne which lies in the country of France.
Knowledge Representation + Automated Reasoning = Knowledge Base
Just as behind every winning trivia/pub quiz team lies a knowledge base composed of multiple human brains, there is an synthetic knowledge base behind an intelligent system. It allows the storage of complex information, answers queries and infers new facts from existing data. It doesn’t mean that you don’t need NLP and ML capabilities to effectively communicate and to ingest new information, — these are crucial elements of any intelligent system. However, an intelligent storage layer where knowledge can be stored and retrieved efficiently and systematically, is also fundamental.