13 Great Articles About K-Nearest-Neighbors And Related Algorithms

13 Great Articles About K-Nearest-Neighbors And Related Algorithms

This resource is part of a series on specific topics related to data science: regression, clustering, neural networks, deep learning, Hadoop, decision trees, ensembles, correlation, outliers, regression, Python, R, Tensorflow, SVM, data reduction, feature selection, experimental design, time series, cross-validation, model fitting, dataviz, AI and many more. To keep receiving these articles, sign up on DSC.

13 Great Articles About K-Nearest-Neighbors And Related Algorithms

Source: article flagged with a +

DSC Resources

13 Great Articles About K-Nearest-Neighbors And Related Algorithms

Ux + AI = Cognitive Ergonomics

Ux + AI = Cognitive Ergonomics

Summary:  The addition of AI capabilities to our personal devices, applications, and even self-driving cars has caused us to take a much deeper look at what we call ‘User Experience’ (Ux).  A more analytical framework identified as Cognitive Ergonomics is becoming an important field for data scientists to understand and implement.


I for one tend to be skeptical when I see a new term used in place of an existing data science term.  Is this new descriptor really needed to describe some new dimension or fact that we currently describe with a comfortable, commonly used phrase?  Or is this just some gimmick to try to differentiate the writer’s opinion?

That comfortable and commonly used phrase in this case is ‘User Experience’ (the ubiquitous abbreviation Ux).  Ux describes the totality of the end user’s experience with our app, personal device, IoT enabled system, wearable, or similar.  So far this has meant taking a great deal of care in designing screens, icons, buttons, menus, and data displays so they are as easy and as satisfying for the end user as possible.  It also means a design that as much as possible eliminates common user errors in interacting with our system.

Recently I’ve been seeing ‘Cognitive Ergonomics’ used where we used to see ‘Ux’ to describe this design process.  Ux was such a compelling shorthand.  Is there some substance in ‘Cognitive Ergonomics’ that might make us want to switch?

Gartner goes so far to say that whether you are working on autonomous vehicles, smart vision systems, virtual customer assistants, smart (personal) agents or natural-language processing, within three years (by 2020) “organizations using cognitive ergonomics and system design in new AI projects will achieve long-term success four times more often than others”.


What’s Changed?

What’s changed is the inclusion of AI in all these devices just over the last one or two years.  The 2014 prediction of Kevin Kelly, Sr. Editor at Wired has come true:

“The business plans of the next 10,000 startups are easy to forecast: Take X and add AI.”

I personally dislike generalizations that make broad reference to “AI” so let’s be specific.  The deep learning functionalities of image, video, text, and speech processing that rely on Convolutional Neural Nets (CNNs) and Recurrent Neural Nets (RNNs) are the portions of AI that are ready for prime time.  AI is much broader than this but its other capabilities are still too developmental for now to be market ready.

Since CNNs and RNNs are basically the eyes, ears, and mouth of our ‘robot’ some folks like IBM have taken to calling this cognitive computing.  This actually ties very nicely with what Cognitive Ergonomics is trying to communicate since the AI we’re talking about are its cognitive capabilities corresponding to eyes, ears, and mouth.

We no longer have just our fingers to poke at screens; we can now use voice and a variety of image recognition functionalities.  These have the capability to significantly improve our Ux but it seems correct that we should focus on all these new types of input and output by using the more specific name Cognitive Ergonomics.


What Exactly Is Cognitive Ergonomics

The International Ergonomics Association says Cognitive Ergonomics has the goals of:

  • shortening the time to accomplish tasks
  • reducing the number of mistakes made
  • reducing learning time
  • improving people’s satisfaction with a system.

So it broadly shares with Ux the goals of making the application or product ‘sticky’ and ensuring the use of the product does not produce any unintended bad result (like texting while driving creates accidents).

Where Cognitive Ergonomics exceeds our old concepts of Ux is that it expands our ability to interact with the system beyond our fingers to include our eyes, ears, and mouths.  This multidimensional expansion of interaction now more fully involves “mental processes, such as perception, memory, reasoning, and motor response, as they affect interactions among humans and other elements of a system”.

Cognitive Ergonomics is not new but you probably thought about it more in the design of complex airplane cockpits or nuclear reactor control rooms.  Projecting that onto our apps, personal devices, wearables, or even self-driving cars requires not only knowledge of the physical situation in which the device or app is used, but more particularly the end user’s unconscious strategies for performing the required cognitive tasks and the limits of human cognition.


Some Examples

Nothing beats a good example or two to illustrate how our ability to interact with our devices is changing.


Natural Language Processing (NLP)

Voice search and voice response are major developments in our ability to interact not only with personal devices like our phones but also with consumer and business apps.

Voice search enables commands like ‘how do I get to the nearest hospital’ but also ‘will it rain at 5:00 today’ or ‘give me the best country songs of 2010’.

In terms of improving our satisfaction with apps, one of the strongest uses is to eliminate visual, multi-level menus.  Who hasn’t drilled down in their phone through too many levels of menus to try to turn on or off a feature?  This eliminates mistakes, frustration, and speeds results when we can simply say ‘turn off X application’.

Commands no longer need be one way conversations.  Chatbots enable a dialogue with the app if the action is unclear.  This might be a device operating question but it could as easily be a customer service issue or even personal psychological counselling (Andrew Ng recently announced the release of Facebook’s Woebot.  Woebot is a chatbot that gives one-on-one psychological counseling for depression and by all reports does a pretty good job.)

Perhaps the most important improvement is one impacting safety.  The ability of your app to ‘read’ texts aloud or to allow you to respond to a text by voice keeps your eyes on the road while you’re driving.


Facial and Image Recognition

It might seem that text/voice AI is the most prevalent advancement but facial and image recognition is not far behind.  Of course cameras can face outward to guide self-driving cars and they can a capture street or business signs in foreign languages and offer audio translations.

Importantly they can also face inward to detect many things about the user.  Facial recognition cues can trigger algorithms that work for our well-being or safety.

  • Face detection and localization
  • Facial expression
  • Assessment of head turns and viewing direction
  • Person tracking and localization
  • Articulated body tracking
  • Gesture recognition
  • Audio-visual speech recognition
  • Multi-camera environments


Other Haptic Inputs and Outputs

Not yet as common but coming shortly are a wide variety of other sensors beyond voice, text, and image.  One that is widely written about in partially self-driving vehicles are cameras that scan the driver to ensure their eyes are on the road and therefore ready to take over from the AV at a moment’s notice.  When the scan detects that the driver is visually distracted it can provide haptic feedback by vibrating the steering wheel.

Apple recently introduced a texting feature that automatically suspends notifications and ring tones when it detects you are driving, presumably using a combination of GPS and accelerometers to measure movement.

Other types of feedback sensors are already at use in some types of games which can measure heart rate, tightness of grip on the controller, or even galvanic skin resistance to detect sweat.  Combined, these can be used as guides to how excited or frustrated you are with the game.  Based on these inputs the game can then either become easier or more challenging to keep you engaged.

Another soon-to-be-major input/output device will be the Augmented Reality headset with even greater implications for this category.


How Will This Change the Life of the Data Scientist?

Ux these days is still considered largely an art form, but given the need for multi-dimensional inputs and outputs Cognitive Ergonomics will require more analytics in design and objective testing.

Gartner and others already see this in an uptick in the need for data scientists to be able to work with unstructured text data.  While this will often happen through easier to use packages or the expansion of current platforms to include this feature, working with unstructured text will be a new skill for many data scientists used to supervised modeling.

Crowdflower recently released its survey of 170 data scientists in mid-size and large organization and more than half already report that a significant portion of their work involves unstructured data.

And that the majority of that unstructured data was text with almost half also reporting working with image and video.

While not all chatbots or image recognition projects are driven by the need to improve Ux, a significant majority are and foreshadows the importance of Cognitive Ergonomics for data scientists.



About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist since 2001.  He can be reached at:



Ux + AI = Cognitive Ergonomics

Top 5 Virtual Reality Companies Looking To Change The World

Top 5 Virtual Reality Companies Looking To Change The World

Truly a game-changer, the technology behind virtual reality is aimed to make one’s interaction with technology more personal. In simple words, VR technologies are designed to build up a virtual environ with which you can interact with 3D computer simulations in some physical way. Wearable equipment like gloves or headsets acts as needed aides to feel the interaction physically. It can be mentioned here that VR banks on two most important senses- sound & sight. Though it’s largely used in the online gaming world yet VR technology can have applications in other industries as well, including education, retail, medical, travel or e-commerce.

The year 2016 witnesses a lofty pique in attention surrounding Virtual Reality technology with some of the biggest tech honchos investing in it. Last year, Facebook acquired Oculus Rift (headset to aid in experiencing VR in gaming) under a whopping $2 billion deal which marked the biggest sale from the VR field till date.

The post below is a brief on 5 companies who are planning to make it big in the Virtual technology world.


When it comes to cultivating the latest of technologies, Microsoft definitely comes in top 5. The company has already filed 365 patents for Virtual Reality technologies. As per the reports, it has released Hololens that enables interaction with hologram via a headset. The very headset is targeting pros such as engineers or architects or designers or construction- who can use the VR accessory infield to bring ideas and plans to live.


One of the biggest investors in Virtual Reality, Qualcomm has joined hands with Google & many other such VR patrons to fund for Florida-based company Magic Leap under a $542 million deal. Magic Leap is experimenting with VR technology & Qualcomm has got some really high hopes from it. The company has already filed its patents under the name of “Data Processing” – an advance computing method specially designed for commercial, financial, administrative, supervisory, forecasting or managerial purposes.


Samsung’s first notable brush with VR technology carries an adorable touch as the leading oriental tech tycoon used the VR stimulation to live stream a childbirth to a dad’s headset- as the father was unable to be at the hospital on a special day. The company joined hands with Vortex to create mobile Virtual Reality headset ($100) that streams display on G3 Quad LG HD smartphone. As per the reports, Samsung’s VR efforts seem to be greater in quality compared to Microsoft, even when the latter has more patents.


The discussion would be incomplete without the mention of Google who has been surprising the tech world with interesting VR accessories for some time now. In 2014, the company launched “Cardboard”- a cardboard viewer that enables users to use mobile handsets as AR/VR platform. News has it that the tech biggies is on huge plans to come up with its “true” Virtual Reality headset this year that can pose serious competition to Samsung’s VR Gear.


Microgaming is all set to infuse VR technology in the online gambling world in a big way with its first ever VR Roulette game. The company’s Cryolab revealed a VR video at 2016 ICE to inform the world about the VR Roulette that would be used with Leap Motion 3D Controller and Oculus Rift DK 2 headset. The game is meant to study analog communication across virtual & real environments- tracking gamer’s hand movements & then projecting the data into 3D world.


Top 5 Virtual Reality Companies Looking To Change The World

Logistic Map, Chaos, Randomness and Quantum Algorithms

Logistic Map, Chaos, Randomness and Quantum Algorithms

The logistic map is the most basic recurrence formula exhibiting various levels of chaos depending on its parameter. It has been used in population demographics to model chaotic behavior. Here we explore this model in the context of randomness simulation, and revisit a bizarre non-periodic random number generator discovered 70 years ago, based on the logistic map equation. We then discuss flaws and strengths in widely used random number generators, as well as how to reverse-engineer such algorithms. Finally, we discuss quantum algorithms, as they are appropriate in our context.

Source for animated picture: click here

Logistic map

The logistic map is defined by the following recursion

X(k) = r X(k-1) (1 – X(k-1))

with one positive parameter r less or equal to 4. The starting value X(0) is called the seed, and must be in [0, 1]. The higher r, the more chaotic the behavior.  At r = 3.56995… is the onset of chaos. At this point, from almost all seeds, we no longer see oscillations of finite period. In other words, slight variations in the initial population yield dramatically different results over time, a prime characteristic of chaos.

When r=4, an exact solution is known, see here. In that case, the explicit formula is

The case r=4 was used to build a random number generator decades ago. The k-th number Z(k) produced by the random number generator in question, is equal to

The numbers Z(k)’s are uniformly distributed on [0, 1]. I checked whether they were correlated or not, and could not find any statistically significant auto-correlations. Interestingly, I initially found all this information in a military document published in 1992, still hosted here on the .mil domain, and of course, not classified (not sure if it was ever classified.) The original work is by S.M. Ulam and J. von Neumann, and was published in 1947. Very few seeds will result in periodicity — an infinite number of them actually, but they are extremely rare, just like rational numbers among all real numbers.

Source: 2-D logistic map, see here

The logistic map has been generalized, for instance in two dimensions. see here for an application to population dynamics. The 2-dimensional case has also been used in image encryption, see here. Also, in 2012, Liu Yan and Tong Xiao-Yung published a paper entitled A new pseudorandom number generator based on complex number chaotic equation. However, the simplest non-periodic good random number generator might just be defined in the complex plane by

Exact solution and recursion can easily be obtained for the real and imaginary parts separately (both yield real-valued random number generators) using complex exponentiation. Just like with the logistic map generator, a transformation must be applied to map the non-uniform deviate X(k), to a uniform deviate Z(k) on [0, 1] x [0, 1]. Note that the complex norm ||X(k)|| is equal to 1. The case b = 2 is similar to the logistic map generator. Such home-made generators are free from NSA backdoor, in contrast for instance with the Dual Elliptic Curve Deterministic Random Bit Generator.

Big Flaws in Popular Random Number Generators

Many applications require very good random number generators. For instance, to produce secure cryptographic systems, or when simulating or testing a large number of stochastic processes to check whether they fit with particular statistical distributions, as in my recent article on self-correcting random walks. In another article about the six degrees of separation problem, I needed a random number generator capable of producing millions of distinct integer values, and that is how I discovered  the flaws with the Perl rand() function still in used today.

This generator can only produce 32,768 distinct values, and when you multiply any generated value by 32,768, you obtain an integer value. These values are supposed to simulate uniform deviates on [0, 1].  Applications that are designed using this generator might not be secure. Indeed, the Perl documentation states that

“Rand() is not cryptographically secure. You should not rely on it in security-sensitive situations. As of this writing, a number of third-party CPAN modules offer random number generators intended by their authors to be cryptographically secure, including: Data::EntropyCrypt::RandomMath::Random::Secure, and Math::TrulyRandom.”

Chances are that similar issues can be found in many random number generators still in use today. I also tested the Excel rand() function, but could not replicate these issues. It looks like Microsoft fixed glitches found in its previous versions of Excel, as documented in this report. The Microsoft document in question provides details about the Excel generator: Essentially, it is equivalent to the sum of three linear congruential generators of periods respectively equal to 30,269, 30,307, and 30,323, so its period is equal to the product of these three numbers, at best.

To test these flaws and indeed reverse-engineer a random number generator producing deviates on [0, 1], one can proceed as follows.

  • Generate a few deviates. Multiple the deviates by N, testing all N‘s between 1 and 1,000,000,000, to see if some N always result in an integer. Note that if you try with too large values of N, say N = 10^14, you will always get an integer, but in this case it is due to machine precision, not because of the random number generator.
  • Generate millions of deviates. When you find one that is identical to the first deviate, check whether you’ve reached periodicity, with all subsequent values repeating the numbers found at the beginning. If that is the case, you’ve discovered the period of your random number generator.
  • Generate millions of deviates. Check how many distinct values you get. For instance, if you generate 10,000,000 deviates and find only 32,768 distinct values, you random number generator has a severe issue.

Interestingly, unlike the random number generator produced by the logistic map and discovered in 1947 (described above in this article) or generators based on hardware or on the time function of your computer, many random number generators used today (for instance the one used in Java) are based on linear congruence and are thus periodic, and poor. But a few are excellent, and have such a large period that they are un-breakable for all practical purposes, such as the Mersenne twister, with a period of   2^19937 −1 .

An advantage of the logistic map generator, over hardware-based generators, is that you can reproduce the same sequence of random numbers over and over in any programming language and on any computer — a critical issue in scientific research, with many tests published in scientific journals not being re-producible. Other non-periodic, high quality random numbers offering the same reproducibility feature, include those based on irrational numbers. For instance, a basic recursion no more complicated than the logistic map formula, produces all the binary digits of SQRT(2)/2, one per iteration, and these digits are known to have a random behavior, according to standard tests. See also here, for an application about building a lottery that would not be illegal. For a fast random number generator based on the decimals of Pi, click here

Note that the Perl random number generator has a period larger than 2^23 according to my computations, despite producing only 32,768 distinct values. It could even be non-periodic, afterall the binary digits of SQRT(2) produce only two distinct values – 0 and 1 – yet they are not periodic, otherwise it would represent a rational number.

Finally, the standard Diehard tests to assess the randomness of these generators should be updated. It was published in 1995 when big data was not widespread, and I am not sure whether these tests would be able to detect departure from randomness in generators used in sensitive applications requiring extremely large quantities of pseudo-random numbers.

Quantum algorithms

Quantum algorithms work well for applications that require performing a large number of repetitive, simple computations in which no shortcut seems available. The most famous one today is probably Shor’s algorithm, to factor a product of integers.  Traditional encryption keys use a product of two very large prime numbers, large enough that no computer is able to factor them. Factoring a number is still considered an intractable problem, requiring to test a bunch of numbers as potential factors, without any obvious pattern that could accelerate the search. It is relevant to our context, as much of our concern here is about cryptographic applications and security. Also, the issue with these encryption keys is that they also should appear as random as possible – and products of large primes mimic randomness well enough. Maybe one day a classical (non-quantum) but efficient factoring algorithm will be found (see my articles on the subject) but for now,  quantum computing with Shor’s algorithm seems promising, and can theoretically solve this problem very efficiently, thus jeopardizing the security of systems ranging from credit card payments to national security. This is why there is so much hype about this topic. Yet in 2014, the largest integer factored with quantum computers was only 56,153, a big increase over the record 143 achieved in 2012. But because of security concerns, new post-quantum cryptographic systems are being designed. 

Another algorithm, mentioned in the previous section, is the detection of a period in long sequences that may or may not be periodic, and if periodic, have a very large period. This is a version of the element distinctness problem.  For instance, as discussed earlier, in my 6 degrees of separation problem, I had to make sure that my random number generator was able to produce a very large number of distinct integers, each one representing a human being. Distinctness had to be tested. Again, this is an example of algorithm where you need to test many large blocks of numbers – independent from each other – without obvious shortcut to increase efficiency. Quantum algorithms can solve it far more efficiently than classical algorithms. As an alternative, the logistic map is known to produce infinite sequences of pseudo-random numbers that are all distinct – thus no test is needed. The related problem of counting the number of distinct values in a large set of integers, also mentioned in the previous section, can probably benefit as well from quantum architectures.

For further, reading, I suggest searching for Simulation of a Quantum Algorithm on a Classical Computer. Since quantum algorithms are more complex than classical ones, for quantum computing to become popular, one will have to develop a high-level programming or algorithmic language that can translate classic tasks into quantum code.

For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on Twitter at @GranvilleDSC or on LinkedIn.

DSC Resources

Popular Articles

Logistic Map, Chaos, Randomness and Quantum Algorithms

Searching and Smelling by Objects to Find Profiles and Settings

Searching and Smelling by Objects to Find Profiles and Settings

A person will ordinarily search the contents of a database using matching keywords and tags.  Sophisticated databases might allow for filtering:  for example using NOT, AND, OR on a number of keyword strings such as both titles and product descriptions.  It is not normally possible to submit, say, a personality profile to a database – or a personality profile and a setting.  Searching for “serial murders subway terminals” might lead to event information about precisely this, apparent serial murders occurring in subway terminals.  There is no “personality” per se in a “setting.”  By setting I mean, for example, a “dimly lit underground enclosure” that might manifest itself as a “subway terminal.”  By personality, I might actually be looking for an “aggressive individual exhibiting pathological indifference to the needs of others.”  A database intended for general public use is naturally optimized for headlines and buzzwords.  For professional applications, I consider it worthwhile to have a database system that can respond to the submission of complex data objects.

I maintain three databases:  1) holding character data; 2) holding settings; and 3) holding narratives.  The narrative database actually contains all available data including characters and settings.  Therefore, if I want to find a particular character in a specific setting, I submit the code for these objects to the narrative database; and the database grades all of the cases.  This is not to say that the character and setting ever came together in real-life.  I might just want to find out what sort of stories “seem” to extend from a particular character and/or setting.  The narrative database contains additional data that the other databases don’t have.  It contains “circumstantial” data.  All of the circumstantial data is lumped together; and the likelihood of finding close matches is quite poor given the large number of circumstances likely associated with any storyline.

My databases interact with data objects through the use of “thematic openings.”  This is an interesting concept.  A thematic opening is like a listing of tags meant to mean something.  For example, “snowstorm,” “poor visibility,” “closed highways,” point to “mobility impairment”; this is assuming there isn’t a hierarchical structure that has to be taken into account.  A listing of thematic openings might be “mobility impairment,” “zombie attack,” “cannibalism,” “family dinner party.”  Maybe I am being a bit loose with my example.  The general purpose of thematic openings is to establish feel or atmosphere – or the exact expression I use is “to give scent.”  On the other hand, the submission of a data object leads to “thematic placements” – the odour being sought.  The program that I use to determine resemblance to odours is called Bloodhound.

Along with the idea of a thematic opening is the concept of thematic analogs and pathologs.  An analog as it pertains to themes is meant to add information that is possibly absent from the case in question.  For example, “employee stress” is not clearly connected to “marital problems.”  But if employee stress and marital problems exist in the same theme, the database cannot help but infer a possible connection.  In fact, if the theme is called “poor workplace performance,” the presence of employee stress “might” also mean that marital problems are involved.  An analog therefore represents an assertion outside the confines of the case.  A pathalog is an analog meant to reveal potential pathologies in the character that might not be indicated by the data available from the case itself.  In this way, especially on the absence of data, there is a framework to pursue “leads.”

Those unfamiliar with my “Animal Spirit Model” might not recognize the spreadsheet below; it contains both (N) normal and (P) pathological settings (indicated by the position of the “<” characters).  Although a character might not appear to have any pathological traits on the surface, the spirit object asserts potential connections.  Obviously, there should be conversation here on the validity of my pathologs.  I have to point out however that at least a conversation can take place.  The settings can be discussed and changed.  This isn’t frothy discourse but something directly influencing database search patterns.  I also want to emphasize how in most cases there might be an absence of “individual” data – particularly psychological test data – especially in relation to behaviours outside a controlled environment.  The patholog below is called “Maximale” – meant to draw connections between aggressive personalities and potential pathologies.


Next I provide an explanation – using some rather peculiar examples – of my conceptualization of pathologies.  The underlying data object is called “Melting Vulcan,” which registers in the areas of “I’m Peter Pan,” “If I blow hard enough,” and “Mirror Mirror.”  I constructed this data object from what information was available about Stephen Paddock – the Vegas shooter.  Pathology for me relates to the augmentation of reality through the insertion of fabricated reference points.  The closest common example that some of us might encounter in routine life relates to alcoholism.  I admit however that apart from the materials I reviewed during my graduate studies, I am no expert on alcoholism or pathologies for that matter.  I am however an authority on the construction of pathologs since I invented them.  Or at least maybe I invented them.  I’m sure readers will correct me.  I find that titles emerge on Google after I make my blog submissions.  At the moment, I’m not seeing any use of the term “pathalog.”

The flatness of themes ignores the fact that hierarchies might exist.  For example, in an assertion of “spontaneous,” the presence of a number of tags might not be enough to validate the resemblance.  The tags might have to be present in a certain way.  It is necessary therefore to maintain testable hierarchical constructs.  Again, although these constructs are debatable, at least I have a setting where they can be scrutinized and challenged.  Hierarchical constructs are important in relation to pathologs since there might be contradictions between intent, appearance of intent, and actions.  For instance, a person might be extremely forceful and yet be quite indifferent about the subject matter.  (A “normal” person is normally passionate about something he believes in.)  Or this individual might draw a lot of conclusions while exhibiting an inability to take the available information into account.  (A “normal” person tends to draw conclusions related to the information available – unless of course most his conclusions are socially constructed – in which case these aren’t conclusions at all but preconceptions.)

Using a conventional search engine to search by keywords is fine in most cases – I guess especially since it eliminates contentious arguments relating to psychological models.  However, the absence of debate does not bring a person any closer to the truth.  I believe that things withstanding the test of time tend to change with the times.  My database system is “model friendly.”  I fully expect to continue developing new models, applying them, testing their effectiveness.  Theory is fine.  Having the ability to sift through the data in order to test those theories is better.  I don’t believe that complex ideas can be tested without the ability to analyze sophisticated data objects.  Using quantitative analysis on its own hardly seems reasonable especially if the outcome is some kind of disassociated metric.  We have to “unplug” (unshackle) the analysis.  We need to embrace quotidian details of existence.

Searching and Smelling by Objects to Find Profiles and Settings

Bitcoin Price Forecasting Using Model with Experts Opinions

Bitcoin Price Forecasting Using Model with Experts Opinions

One of the main goals in the Bitcoin analytics is price forecasting. There are many factors which influence the price dynamics. The most important factors are: the interaction between supply and demand, attractiveness for investors, financial and macroeconomics indicators, technical indicators such as difficulty, how many blocks were created recently, etc. A very important impact on the cryptocurrency price has trends in social networks and search engines. Using these factors, one can create a regression model with good fitting of bitcoin price on the historical data. To perform price forecasting, we need to know these factors values in the future. In this challenge, we can use quantified opinions of experts for whom the prediction of these factors is a simpler problem than predicting target variable – bitcoin price. To work out a regression model for bitcoin price, we loaded appropriate time series from Quandl, Yahoo Finance and Google Trend services. For the modeling, we used Python with packages pandas, numpy, scipy, matplotlib, seaborn , sklearn, quandl, pystan . Using Linear regression with Lasso regularization and features normalization, we found the important factors which form the bitcoin price. The following picture shows the bitcoin price and predicted values on the historical data:

Here are the coefficients of the factors in the regression model:

Here are some time series for factors:

For probabilistic approach, which makes it possible to get risk assessments, one can use Bayesian inference approach. To take into account extreme values, we can describe the bitcoin price using distributions with fat tails, e.g. Student’s distribution. For Bayesian modeling, we used Stan software with pystan python package. The following figure shows the probability distribution functions for regression coefficients for important factors:

Let us consider the forecasting of bitcoin price using experts’ opinion. Suppose we have two experts who have the opinions about such factors as Difficulty and Google Trend. We suppose the other factors the same as at the end of historical data.  Suppose an expert can predict the most probable, minimum and maximum percent of change for the factor under study. Expert’s opinion can be approximated by triangular distribution. Suppose that experts made their prediction as the following:


Expert 1: min=10%; mode=15%; max=30%. Expert 2: min=15%; mode=30%; max=50%;

Google Trend:

Expert 1:min=0%;mode=10%;max=25%. Expert 2:min=-25%;mode=45%; max=55%;

These experts’ opinions can be described by the following probability density functions:

Using Mote-Carlo approach, we can also take into account the confidence for each expert. Suppose Expert 1 is more precise in 70% of predictions and Expert 2 is more precise in 30%. The values of confidence can be found on the validation set if we have multiple experts’ prediction or these values can be set up by experts moderator. Having probability density functions for regression coefficients and density functions of experts’ opinions for impact factors, we can receive a probability density function (PDF) for the forecasted bitcoin price which is shown on the figure:

Having PDF for bitcoin, we can find the most probable value for the forecasted price and Value at Risk (VaR) which is important for assessment of different types of risks.

Bitcoin Price Forecasting Using Model with Experts Opinions