Posts

Symmetric and Asymmetric Encryption

Image
What is Encryption and Cryptographic Keys? Encryption is actually an age-old practice dating back to the times of the famous Roman king Caesar, who encrypted his messages using a Caesar cipher. The practice can be viewed as a transformation of information whereby the sender uses plain text, which is then encoded into cipher text to ensure that no eavesdropper interferes with the original plain text. On receiving the encoded message, the intended receiver decrypts it to obtain the original plain text message. Once the transaction data encrypted then it can only be decrypted using the appropriate keys, its called a “Cryptographic keys“. A cryptographic key is a password which is used to encrypt and decrypt information. There are two types of cryptographic keys. They are known as symmetric key and asymmetric key cryptography: symmetric and asymmetric encryption. What is Symmetric Encryption? Symmetric Encryption also called Secret Key Cryptography, it employs the same secret key...

Optimizations of Gradient Descent

Image
Introduction Gradient Descent is one of the most popular technique to optimize machine learning algorithm. We've already discussed Gradient Descent in the past in Gradient descent with Python article, and gave some intuitions toward it's behaviour. We've also made an overview about choosing learning rate hyper-parameter for the algorithm in hyperparameter optimization article. So by now, you should have a fair understanding of how it works. Today we'll discuss different ways to optimize the performance of the algorithm itself. Gradient Descent Variants We've already three variants of the Gradient Descent in Gradient Descent with Python article: Batch Gradient Descent, Stochastic Gradient Descent and Mini-Batch Gradient Descent. What we haven't discussed was problems arising when using these techniques. Choosing a proper learning rate is difficult. A too small learning rate leads to tremendously slow convergence, while a very large learning rate that ca...

Overview of Machine Learning Metrics

Image
Introduction One of the core tasks in building a machine learning model is to evaluate its performance. The usual data science pipeline consists of prototyping a model on some historical data, reaching a satisfying model and deploying it into production, where it will go through further testing on live data. The stages are usually called offline and online evaluations, where the former analyses prototyped model on historical data and the latter the deployed model on live data. Surprisingly to some, evaluation is really hard as good measurement are often vague or infeasible. Also generally statistical models assume that the distribution of data stays the same over time. But in practice, the distribution of data changes constantly, sometimes drastically. This is called distribution drift. One way to detect distribution drift is to continue tracking the model’s performance on the validation metric on live data. That's why any data science project cannot just end after the model i...

Markov chain Monte Carlo with PyMC

Image
Markov Chain Monte Carlo (MCMC) is a technique for generating a sample from a distribution, and it works even if all you have is a non-normalized representation of the distribution. Why does a data scientist care about this? Well, in a Bayesian analysis a non-normalized form of the posterior distribution is super easy to come by, being just the product of likelihood and prior - so MCMC can be used to sample from (essentially simulate) a Bayesian posterior. In python one of the most widely used packages for doing exactly this is called PyMC . What is Markov chain Monte Carlo A Markov Chain is a sequence of RVs {X} each of which will have an observed value from the state space of possible values, {x}. A Markov Process is a sequence of such RVs where the distribution of the initial RV's value is specified (Π0) as well as a Transition Rule (P) which gives the probability to transition from one state to another, P(i,j)=P(Xn+1=xj|Xn=xi), for all pairs of states. Notice P doe...

Natural Language Processing with Python

Image
Introduction Natural language processing, or NLP, is a process of analyzing the text and extracting insights from it. It is used everywhere, from search engines such as Google or Bing , to voice interfaces such as Siri or Cortana . The pipeline usually involves tokenization , replacing and correcting words, part-of-speech tagging , named-entity recognition and classification. In this article we'll be describing tokenization, by using a full example from Kaggle notebook . The full code can be found on GitHub repository . Installation For the purposes of NLP, we'll be using NLTK Python library, a leading platform to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries. Installing the package is easy using the Python p...

Length extension attack

Image
What is length extension? When a Merkle–Damgård based hash is misused as a message authentication code with construction H(secret ‖ message), and message and the length of secret is known, a length extension attack allows anyone to include extra information at the end of the message and produce a valid hash without knowing the secret. Quick sidebar, before you freak out: Since HMAC does not use this construction, HMAC hashes are not prone to length extension attacks. So, a length extension attack is a type of attack where an attacker can use Hash(message1) and the length of message1 to calculate Hash(message1 ‖ message2) for an attacker-controlled message2, without needing to know the content of message1. Algorithms like MD5, SHA-1 and most of SHA-2 that are based on the Merkle–Damgård construction are susceptible to this kind of attack. Truncated versions of SHA-2, including SHA-384 and SHA-512/256 are not susceptible, nor is the SHA-3 algorithm. A bit more details ...

SVM kernel approximation with Python

Image
Introduction A high-performance SVM classifier will likely need thousands of support vectors, and the resulting high complexity of classification prevents their use in many practical applications, with large numbers of training samples or large numbers of features in the input space. To handle this, several approximations to the RBF kernel (and similar kernels) have been devised. Typically, these take the form of a function z that maps a single vector to a vector of higher dimensionality, approximating the kernel. where Phi is the implicit mapping embedded in the RBF kernel. Implementation Scikit-learn has already implemented Fourier transform and Nystroem approximation techniques using RBFSampler and Nystroem classes accordingly. import matplotlib.pyplot as plt import numpy as np from sklearn import datasets, svm, pipeline from sklearn.kernel_approximation import (RBFSampler, Nystroem) # The digits dataset digits = datasets.load_digits(n_class=9) # To apply an clas...