## Information Entropy — a Study of Unpredictability

### Introduction

When Mom says your room is messy, have you ever wondered, feeling treated unfairly, how could she make such judgment? What’s the foundation of her conclusion? Perhaps you have many trinkets in your room, but does that make your room “messy”? Doctors have hundreds of gadgets in their operating rooms, but those rooms are among the tidiest places on earth. What’s the difference between complicated and complex? How exactly do we measure the level of messiness? Well, that’s what we are going to talk about in this article — a concept in information theory that measures the level of randomness — entropy.

### 1. Random Variables

Just like graph algorithms are based on graphs, entropy can only be brought down to earth by the notion of random variables. It’s hard for one to internalize the idea of entropy without some basic understandings of random variables. If you are already familiar with them, please skip to section 2. If you are not, don’t worry. I will walk you through the concept quickly. Some very basic background knowledge is enough for our discussion of entropy.

#### 1.1 Basic Ideas of Random Variables

You must have seen things like this in high school.

A = “It will rain tomorrow”, P(A) = 30%

The A here is a naive form of random variables. We now need to define the concept more concretely. In a higher level, random variables are just like graphs. They are abstractions of things. Graphs, for instance, are abstractions of network-like data structures like road maps or social networks. Similarly, random variables are abstractions of uncertain events. For instance, we can represent “the result you get after tossing a fair coin” by a random variable X. We set X = 0 if the coin shows tail and X = 1 otherwise. By common sense, we know that P(X = 0) = 0.5 (the probability of getting a head on a toss is one half) and that P(X = 1) = 0.5 (the probability of getting a tail on a toss is also one half).

Note that we are, in most cases, not very interested in the numeric value of a random variable because they are just indicators or representations of an event. We are instead more interested in the probability that a random variable takes on a particular value. This is what distinct random variables from normal variables. If a normal variable y = 2, then y IS the integer 2. They are equivalent. There’s nothing more to be said. But for random variables, numeric values are just tokens that indicate a certain event in the sample space. For our random variable X, we don’t care about the number 0 or 1. We can even use different numbers. For example, we can instead set X = 111111 if the coin shows tail and X = 99999 if the coin shows head. It doesn’t change the meaningful properties of X. What we care about is that P(X = 111111) = 0.5 and P(X = 99999) = 0.5, which are the probabilities of events in the sample space.

The coin tossing case is just a very simple example. Random variables can represent more complicated uncertain events such as “the number you get after tossing a fair dice,” or “the number you get by asking a person to choose an integer between 1 and 10000.” Rather than saying, “what’s the probability of getting a six after tossing a fair dice?” We can instead evaluate P(X = 6).

The coin tossing case is just a very simple example. Random variables can represent more complicated uncertain events such as “the number you get after tossing a fair dice,” or “the number you get by asking a person to choose an integer between 1 and 10000.” Rather than saying, “what’s the probability of getting a six after tossing a fair dice?” We can instead evaluate P(X = 6).

#### 1.2 Continuous Random Variables (Optional)

The random variable X we talked about is discrete, which means It only takes on non-continuous values. For example, it is impossible to get a 5.4542 on a fair dice toss. However, random variables are not limited to the discrete world. They are also good models for random events with continuous outcomes. For instance, we can use random variable Y to represent the real number we get by tossing a “baseball” onto the x-axis.

In a word, Y is just X’s counterpart in the continuous world. The probability formulas related to them are very similar, except that one involves summation and the other employs integration. I will not go more in-depth on those here because they don’t contribute to an introductory understanding of entropy.

#### 1.2 Probability Distribution

Random variables are good models for uncertain events. And the most useful property of them is called probability distribution. Which shows how the probabilities of different events distribute in the entire sample space. For example, this diagram shows the distribution of random variable X (the outcome of a fair dice toss), which is uniformly distributed across 1-6.

Notice that the sum of all the possibilities must be one.

### 2. Entropy — a Measurement of Unpredictability

OK, now let’s talk about entropy. In a word, entropy measures the unpredictability of a random variable. It is a ruler of randomness. I don’t want to throw you its mathematical form right away. I think that leads to passive memorization instead of initiative exploration. Given that we have now understood what random variables are, let’s see if we can devise a function to measure their randomness by ourselves. Don’t worry. As long as you have taken high school algebra, there is nothing complicated algebraically in the following content.

#### 2.1 Two Dimensions of Randomness

We have many basic functions in our mathematical toolbox. And we want to piece some of them together in a way that can properly measure the level of randomness. But before that, we need to figure out what factors influence randomness. Otherwise, we won’t even be able to know where to begin.

For starters, consider two discrete random variables G and R describing the respective outcomes of tossing two different dices. And think about which one is “less predictable.” This is not hard at all. Please do make sure that you figure this out before you continue reading!

Obviously, the green dice is a fair dice and therefore less predictable. And the red dice is a cheating dice, maybe one of those in the movie Inception. We can predict, with considerable confidence, that 4 would be the outcome of a toss of the red dice. The moral here is that the clustering of probability in a random variable’s distribution (for R, probability clusters around the value 4), causes the level of randomness to decline. In other words, randomness peaks when probabilities are spread out evenly across the range of a random variable.

Now that we have obtained our first revelation, let’s move a step further. Consider random variable G which describes the outcome of tossing a fair six-sided dice and M, which represents the outcome of tossing a fair eighteen-sided dice. Which random variable is less predictable, or more random?

The answer is intuitive: although both random variables have evenly spread possibility distributions, M can take on 18 values instead of 6, therefore is more unpredictable. We can thus obtain our second revelation: the wider the range of a random variable, the higher its level of randomness.

#### 2.2 Devising a Measurement of Randomness

What we just derived are the two dimensions of randomness: Uniform distribution or clustered distribution? Wide range or narrow range? And we have figure out their correlations with randomness. Now let’s devise a function that takes these two dimensions into account and outputs a universal measurement of unpredictability. Let call this potential function H(X), where X is our random variable. Since we know the range of X positively influences its level of randomness, it seems that we should walk through all possible k values that X can take on and get a sum of something. That way, the wider its range, the more things we can sum up, and the bigger H(X) will be. Since we are not sure what that “something” is, let’s just call it f( ). Thus, we know that our function will look something like this. $H(X) = \sum_{k} f(P(X=k))$

By doing the sum, we hooked up our measurement H(X) with the range of X to implement revelation 2. Good! Now let’s figure out what f() should be. How should we treat individual probability? We know the sum of f() is supposed to take X’s probability distribution into account. The more evenly spread the distribution is, the bigger the sum of f() should be. But it is “the sum of f(k)” we are talking about here. It’s tricky. Let’s focus on a single k for now. For example, when k = 4 in our cheating dice example.

We see that probability clusters around k = 4, therefore f(P(X=4)) should contribute very little to our total sum because our first moral told us that clustering of possibility decreases randomness. It seems that the closer to 0 the probability of k, the bigger f( ) should be. Because if we have tiny possibilities everywhere across a wide range of values, our random variable indeed seems pretty “unpredictable.” What function in our mathematical toolbox has this property? Well, there is one that fits perfectly — the logarithmic function. If we inverse the logarithmic function, which means adding a negative sign in front of it, we will get a function where y increases as x approach 0, and y = 0 where x = 1. This makes perfect sense because if the possibility of, say, k = 4 equals 1, which means the dice will 100% land on 4, there is no randomness at all. Our measurement should indeed output 0 at that case.

Well, are we good? Should we just use -logx as f(k) and we are done? Our current function is. $H(X) = \sum_{k} -log_{2} (P(X=k))$

Is this going to work? Consider this example below.

In the counter-example above, G is highly predictable. Yet our “ruler of randomness” outputs a big value because f(2) is so big. Unfortunately, our models fail. -logx alone doesn’t work. Because it only considers the possibility of a single k value instead of the entire distribution. But we are very close to the final answer. There is only one missing piece of the puzzle: if we weight -logx by the corresponding probability, the problem is fixed. $H(X) = \sum_{k} (P(X=k)*-log_{2} (P(X=k)))$

In the counter-example above, if we weight -log(P(X=k)) by P(X=k), the problem of miniscule possibility causing major increase in randomness is resolved. Because if the probability is very small, P(X=k) will also be very small and thus stopping -log(P(X=k)) from getting too big.

And the last formula is the mathematical definition of entropy in information theory. If we simplify the notation a little bit. Use p to represent all P(X=k). Then the formula can also be written as: $H(X) = \sum_{p} -plog_{2} p$

### 3. Another Intuition behind Entropy — a Measurement of Informativity.

The unit of entropy is bit. Sounds familiar? Well, this is not a coincidence. In fact, information theory is where the word “bit” and its kin units started to be widely used to measure data size. Not only have we just devised a measurement of randomness, but we have also found a good model for the level of informativity. For example, in our 8-sided fair dice example, we get an entropy of 3, which means if we use the dice as a “telegraph signal source,” it’s capable of sending out 3 bits of information.

You can think of the clustering of probability as a force which stops our dice signal source from sending out whatever number it wants. Therefore, the clustering of probability will lead to a decreased level of informativity.

If, for example, our dice can only land on the number 3. We can only send out one number if we use the dice as a signal source. Therefore, it’s entropy is 1 bit.

### Summary

The idea of entropy is one of the fundamental concepts in information theory developed by American mathematician Claude Shannon. Its applications include data compression and machine learning algorithm like decision tree. Entropy is a delicate and informative tool if we can see the logic beneath the formula. If you find anything above hard to understand, please leave a comment.

## Queue Optimized Bellman-Ford Algorithm

### Introduction

Let’s talk about single-source shortest paths. If you are given a weighted graph and a starting node, how do you find the shortest paths to every other node on the graph?

Dijkstra’s algorithm? Yes, but what about graphs with negative edges like the one above? You might remember that Dijkstra’s greedy strategy cannot handle negative edges. In competitive programming, there is another popular algorithm that is slightly slower than Dijkstra’s but is much easier to implement and is capable of handling negative edges — Bellman-Ford algorithm. In this article, we will talk about queue optimized Bellman-Ford algorithm, which is also known, to some Chinese scholars, as Shortest Path Faster Algorithm(SPFA). For Simplicity, I will just refer to it as SPFA in the following content. Let’s get started!

### 1. Core Idea

Just like many other shortest path algorithms, SPFA assumes the distances to all nodes to be infinity before doing anything. It is as if all other nodes are disconnected to the starting node. Then, by exploring the edges, SPFA gradually shortens the distances until there is no room for further optimization. In this distance updating process, SPFA maintains an examine queue, which is the heart of the algorithm. The queue stores nodes that have to be examined for distance update. And SPFA terminates when the queue empties, which indicates that there is no further possible optimization.

That is just an overview of the algorithm. It’s totally fine if you are still confused. Just like I said, the examine queue is the heart of the algorithm. If you understand the queue, you understand SPFA. Let’s walk through an analogical example to see how the examine queue is maintained.

Consider a technician Dave whose job is to collect malfunctioning parts in a batch of old cars. The cars come in a queue. Dave can identify a defective car by running a quick test, but the tests won’t tell him exactly which part of the car is malfunctioning. The only way to do that is to disassemble the car and check each of its parts. If an integrated part contains multiple sub-parts, for example, the engine, he would have to disassemble it as well to locate the real malfunctioning part, which could be as small as a tiny gear inside the engine. Dave is a smart worker. He doesn’t like to work in the orthodox way, which means only moving on to the next car after locating the exact malfunctioning parts of the previous car. Instead, he works like this:

• If the quick test shows that a car/integrated part is defective, Dave disassembles it and put the sub-parts at the end of the queue then move on to the next item.
• If an item is defective and cannot be disassembled into smaller parts, pick it out and move on to the next item.

As you may have guessed, the queue of cars in our example stands for the examine queue in SPFA. The idea is that we only push a part’s sub-parts to the examine queue if the quick test shows that it is defective. This is the core idea of maintaining the examine queue in SPFA: if we find a shorter path to a node S, that means the current distance value we have for this node is not good enough, or “defective.” Therefore after we shorten its distance value, it is necessary to check for distance updates on its adjacent nodes, so we push S to the end of our examine queue.

The queue structure may remind you of Breadth-First Search(BFS). Admittedly, there are some similarities. But SPFA pushes a node into the queue not when it’s discovered, but when its distance is shortened, or, as some may call it, relaxed, and SPFA may push a node into the examine queue multiple times. The commonality is that they both terminate on the exhaustion of the queue.

### 2. Example

One of the problems with many SPFA articles online is that they confuse readers by tracing the algorithm on complicated graphs. It’s really hard to get the intuition of the algorithm when you cannot even tell which node is which. Let’s walk through an example on a small graph to give you the essence of how SPFA works before diving into detailed C++ implementations.

Let’s say Bill just arrived in a new town and decided to explore it. He discovered that from where his location, there is a direct road to the theater which has a big traffic jam and a roundabout freeway which goes through the farm. He also found that the only way to get to the police station is to get to the theater first. Bill just learned SPFA and wants to find out the fastest way to get to each of these locations in the city. Notice that the edge weights here represent travel time rather than physical distance, but for consistency, let’s just call them distance.

Initially, SPFA assumes that all the nodes are disconnected from the starting node. Then it pushes the starting node into the examine queue to kick start the relaxation process.

After that, it examines the first item in the queue and updates distances to its adjacent nodes. If a node’s distance is updated, that means our distance record for it is defective, which means we also need to examine that node’s adjacent nodes later, which means we push it into the examine queue. You can see it here that if a node is in the examine queue, that means this node has the possibility to bring distance updates to its adjacent nodes.

And repeat this process until we exhaust the examine queue, which indicates that there is no node with the possibility to shorten our current distance value, which means we have already found the shortest path to all the nodes on the graph.

Pause! Take a look at the graph.

You can see that at this point, SPFA found a shorter path to the theatre, which is the path through the farm. We’ve relaxed the distance to the theatre, which means for theatre’s successor nodes, in this case, the police station, their distances could also be defective, because we used the traffic jam road to the theatre when calculating their distances! So, we push theatre to the examine queue to use the shorter path we have found to update its predecessor nodes later.

And when the queue is empty, the algorithm terminates. We got our answer.

### 3. Detect Negative Cycles

As we mentioned before, SPFA can handle graphs with negative weight edges. However, a graph has no shortest path if it contains a negative cycle, which means a cycle the edge weight sum of which is a negative number because distance can get infinitely smaller by looping in the negative cycle.

In the example above, there is no shortest path to the theatre. The more times Bill walks through the cycle, the less total distance he yields. Does this mean that our algorithm may fail? No. We can design a terminate condition for our algorithm, which detects negative cycles.

Since a negative cycle will cause our examine queue never to be exhausted, it will result in an infinite loop. Therefore, the first  and easiest solution is to set a high ceiling for the number of nodes that have been pushed into the queue. For example, if there are more than |v| times |e| nodes pushed into the queue, terminate the algorithm and print “No Solution.” Because |V|·|E| (number of vertices times number of edges）is the worst-case runtime of the algorithm.

If we want to be more efficient, we can keep track of the number of edges we used to get to each node. If any number exceeds |V| – 1, we know the path has looped in some negative cycle. Where this boundary come from? I don’t want to confuse you with some mathematical proof. Just think about it. If a graph is connected, it is possible to get to any node from a starting point with at most |V|-1 edges. Think about a linked list! A linked list with three nodes contains two “edges.” If a path contains more than |V|-1 edges, that means it must have visited some nodes twice, which means it must contain loops. And it cannot be a positive loop because positive loops increase total distance. Our algorithm never chooses longer paths. Therefore it has to be negative loops that decrease distance.

### 4. Implementation

Let’s talk about how you can implement SPFA at contests. Notice that if you are not familiar with competitive programming graph construction techniques. Please refer to this article.

First of all, for graphs without negative loops.

You could implement the algorithm like this.



#include <cstdio>
#include <cstring>
#include <iostream>
#include <queue>
#include <vector>
#define maxn 1005
#define maxm 100005

using namespace std;

//struct for directed edge
struct Edge{int to,w;};
//Examine queue
queue<int> q;
int n,m;
//dist[i] stores distance to node[i]
int dist[maxn];
//inq[i] stores wether node i is in queue
bool inq[maxn];

//read input data and build graph
void init(){
scanf("%d%d",&n, &m);
for(int i = 1; i <= m; i++){
int x,y,z; scanf("%d%d%d", &x, &y, &z);
}
//Initialize distance to every node to infinity
memset(dist, 127, sizeof(dist));
}

void spfa(){
inq = true; q.push(1); dist = 0;
while(!q.empty()){
int t = q.front(); q.pop(); inq[t] = false;
for(int i = 0; i < adj[t].size(); ++i){ //traverse all the outgoing edges
if(dist[tn] > dist[t] + adj[t][i].w){ //If there is room for relaxation
//If the node relaxed is not in the queue, add it to the queue.
if(!inq[tn]){inq[tn] = true; q.push(tn);}
}
}
}
}

int main(){
init();
spfa();

for(int i = 2; i <= n; i++){
cout << dist[i] << endl;
}
return 0;
}




And for graphs with negative cycles, just add an array cnt[i] to store the edges used to get to the i-th node.



//Queue optimized bellman-ford with negative cycle detection
#include <cstdio>
#include <cstring>
#include <iostream>
#include <queue>
#include <vector>
#define maxn 1005
#define maxm 100005

using namespace std;

//struct for directed edge
struct Edge{int to,w;};
//Examine queue
queue<int> q;
int n,m;
//dist[i] stores distance to node[i]
int dist[maxn];
//inq[i] stores wether node i is in queue
bool inq[maxn];

//cnt[i] stores the edges used to get to i
int cnt[maxn]

//read input data and build graph
void init(){
scanf("%d%d",&n, &m);
for(int i = 1; i <= m; i++){
int x,y,z; scanf("%d%d%d", &x, &y, &z);
}
//Initialize distance to every node to infinity
memset(dist, 127, sizeof(dist));
}

bool spfa(){
inq = true; q.push(1); dist = 0;
while(!q.empty()){
int t = q.front(); q.pop(); inq[t] = false;
for(int i = 0; i < adj[t].size(); ++i){ //traverse all the outgoing edges
if(dist[tn] > dist[t] + adj[t][i].w){ //If there is room for relaxation

/*Check for negative cycle*/
cnt[tn] = cnt[t] + 1;
if(cnt[tn] > n-1) return false;

//If the node relaxed is not in the queue, add it to the queue.
if(!inq[tn]){inq[tn] = true; q.push(tn);}
}
}
}
return true;
}

int main(){
init();

if(spfa()){
for(int i = 2; i <= n; i++){
cout << dist[i] << endl;
}
}else{
cout << "No Solution";
}
return 0;
}



### Summary

The worst-case complexity of SPFA is $O(|V||E|)$, which is dangerous on dense graphs. There are shortest path contest problems the test data of which are designed specifically to stop SPFA. It is safer to use Dijkstra’s algorithm when the graph is big and contains no negative weight edges. But queue optimized bellman-ford guarantees quick AC on small graph problems, and, most importantly, graphs with negative edges.

## Quick Graph Construction

### Introduction

The study of graphs makes a big part of computer science. It plays an important role in competitive programming problems, too. Before delving into graphs algorithms, it is important to know how to build effective graphs in the first place. Forget the full-fledged Graph class with add_edge(), delete_edge(), get_weight() and other functions defined in it, because coding it up is too time-consuming. In contests, you need more efficient ways to construct graphs. I will introduce a few classic techniques in this article, which you can use as templates in graph problems. But before you start reading, make sure you are already familiar with coding conventions in competitive programming and basic properties of graphs.

### 0. Input Data Format

In contest problems, graphs have different forms. They can be maps, social networks, traffic flows, etc. However, no matter what real-life entities they depend on, the formats of input data are often very similar. It’s usually the size of the graph tailed by edge data. Here are some input data patterns of different types of graphs.

Note that real problems might not strictly stick to these formats. But as long as you are familiar with the general pattern, it’s not hard to apply one of the following three methods to construct the corresponding graph.

### 1. Adjacency Matrix with Primitive Arrays

This is probably the easiest way to build a graph. Abstractly, a graph is just a set of pairs. If two nodes are connected by an edge, they form a pair. Therefore, we can describe a graph with N nodes by merely storing all the pairs it contains. If you remember the concept of the cartesian product, the grids of an N by N array can store all possible pairings among N items, just like the x-y plane gives you all possible (x,y) tuples. This way, graph construction means simply ticking existing pairs in a 2D array.

What about weighted graph? Just change the ones to edge weights!

For undirected graphs, adjacency matrices are always symmetrical. However, it’s not necessarily so in directed graphs.

Pretty simple, right? However, adjacency matrix trades efficiency for convenience. If your graph has more than 1,000 nodes, this method might give you an MLE (Memory Limit Exceed) or TLE (Time Limit Exceed). Since an entire row has to be traversed to find a node’s outgoing edges, adjacency matrix can compromise the optimal runtime of many graph algorithms. For example, BFS and DFS in adjacency matrices have a runtime of $O(|V|^2)$ rather than $O(|V|+|E|)$. Hungarian algorithm with adjacency matrices have a runtime of $O(|V|^3)$ rather than $O(|V||E|)$. Only use adjacency matrix if you have a small graph. Finally, let’s see the code to construct a graph with an adjacency matrix.

Consider this input.

You can construct the graph like this.



#include <cstdio>
#define maxn 505 //Since n <= 500, 505 should be enough for every test ease.
using namespace std;

int main(){
scanf("%d%d", &n, &m); //Read n and m
for(int i = 1; i <= m; i++){
int x,y,z; scanf("%d%d%d", &x, &y, &z);
g[x][y] = z;
}
return 0;
}




That concludes our discussion of adjacency matrices. Again, if any style-wise stuff of the above code confuses you, you might find this article about contest coding style useful.

### 2. Adjacency List with Struct and Vector Array

Since adjacency matrices only works for small graphs, we need a technique that can handle graphs with a large number of nodes. This is when adjacency list comes under the spotlight. Rather than listing all possible pairs and leaving lots of empty memory, adjacency list only stores the pairs that exist.

How to store this data structure in C++? Fixed-size arrays would make us fall back to adjacency matrices: the list has to be stored by a resizable container. STL vector would be a good choice. Consider an array of vectors called g.

And there’s not much of a difference for directed graphs, for which the list only stores outgoing edges.

Up to this point, our adjacency list can only store unweighted graphs. Each vector contains a list of adjacent nodes’ indices. What if each edge has a weight? Then we need to encapsulate edge data with the help of struct.



struct Edge{
/*
to: The node to which an edge is pointing to
w: Edge weight
*/
int to, w;
};




Now, we can use an array of vector<Edge> to store a weighted graph.

With that in mind, let’s write some C++ code to build a graph with adjacency list. Consider this example.



#include <cstdio>
#define maxn 10005

using namespace std;
int n,m;
struct Edge{int to, w;};

int main(){
scanf("%d%d" &n, &m); //Read graph size
for(int i = 1; i <= m; i++){
int x,y,z; scanf("%d%d%d", &x, &y, &z);
/*
If the graph is undirected, we should add this edge to both nodes it connects.
*/
}
return 0;
}




Again, feel free to check out this article if any of the above code confuses you.

### 3. Adjacency List with Primitive Arrays

Welcome to the most sophisticated graph construction method! If you are new to competitive programming, the previous two techniques should be enough for now. This last technique specializes in solving more advanced graph problems such as network flow, where there is a need to access reverse edges or reconstruct paths quickly. Since it only uses primitive arrays, this method is often preferred by some old school contestants.

The basic idea of this technique is to store all the edges in a big array, hook up edges from the same node with linked lists, then use another array called head to keep track of the starting edge of each linked list. I understand this might sound confusing. Let’s walk through the idea step by step.

Firstly, as we read the input data, store the edges inside an array called arr. This array is nothing but a big warehouse of edges. Its purpose is to allow quick access of edge data through simple array indices instead of pointers.

Our second step is to connect adjacent nodes with a linked list. This name might remind you of complicated error-prone pointer operations. However, don’t forget that we are talking about contest coding. EVERYTHING HAS TO BE SIMPLE! Here, since we used the “edge warehouse.” We can build a linked list by simply adding a variable called next to our struct declaration, which is the warehouse index of the next edge.



struct Edge{
int to, w;
//This variable stores the index of the next edge in arr (the edge warehouse)
int next;
}




To indicate the end of the list, we can set the ‘next’ value of the last edge in a list to -1. That way, by starting from a head edge and keep jumping to the edge indicated by the previous edge’s next value until a -1 is found, we can traverse all the outgoing edges of a particular node.

Finally, don’t forget to use a head array to store all the starting edge of the linked lists.

Time for some code. For this input data.



#include <cstdio>
#define maxn 10005
#define maxm 100005

using namespace std;
struct Edge{int to, w, next;};

int main(){
scanf("%d%d" &n, &m);

//Initialize head array with our "End of List" indicator: -1

for(int i = 1; i <= m; i++){
int x,y,z; scanf("%d%d%d", &x, &y, &z);

//Add the edge to the edge warehouse

}
return 0;
}




Notice that the above code is just a basic version of method 3. Contestants sometimes add "plugins" to it to make it suit different graph problems. If you find this last method to be unnecessarily obscure, don’t worry. Just leave it for now and come back when you actually need it for some problems.

### Summary

The three quick graph construction techniques we talked about apply to different problems. Adjacency matrices are super convenient for small graphs,  adjacency list 1 is less convenient but can handle most graph problems, and adjacency list 2 is the most troublesome but can be critical in some advanced graph algorithms.

## Getting Used to Contest Style

### Introduction

No one could resist the impulse to say WTF when first seeing the code of a programming contest problem solution. Contest style code seems to ignore all the code-writing protocols — self-explanatory variable names, well-defined functions, clear comments, and avoidance of global variables. To give you a taste of it, you might see something like this.



#include <bits/stdc++.h>
using namespace std;
const int M = 1e9;
#define maxs 4096
int m, n, f[maxs], F, field;
bool state[maxs];
int main()
{
scanf("%d%d", &m, &n);
for (int i = 1; i <= m; i++)
for (int j = 1; j <= n; j++)
cin >> field[i][j];
for (int i = 1; i <= m; i++)
for (int j = 1; j <= n; j++)
F[i] = (F[i] << 1) + field[i][j];
int MAXSTATE = 1 << n;
for (int i = 0; i < MAXSTATE; i++)
state[i] = ((i&(i<<1))==0) && ((i&(i>>1))==0);
f = 1;
for (int i = 1; i <= m; i++)
for (int j = 0; j < MAXSTATE; j++)
if (state[j] && ((j & F[i]) == j))
for (int k = 0; k < MAXSTATE; k++)
if ((k & j) == 0)
f[i][j] = (f[i][j] + f[i-1][k]) % M;
int ans = 0;
for (int i = 0; i < MAXSTATE; i++)
ans += f[m][i], ans %= M;
cout << ans << endl;
return 0;
}



One of the first things I found challenging in competitive programming was this grotesque style of coding, which is the product of participants' insatiable desire to code faster. Even if you are shrew in C++, it could be torture trying to read the solutions written by experienced contestants. Just figuring out the roles of variables takes forever, not even mentioning the underlying algorithms.

Perhaps this uncanny coding style is the culprit that scared many people away from competitive programming. Because if you cannot read the code, you will have a hard time understanding the solutions. If you don't fully understand the solutions, it's hard for you to get better. Therefore, in this article, I will unveil the arcane coding style of competitive programming and clear the first obstacle on your way to enjoying programming contests.

To point out before you continue reading, the following contents are in C++. It is the most widely used language in contests like ICPC and OI because it is speedy yet not as complicated as C. Unless for some unique problems — like those which require big integer operations — C++ is almost always your best choice.

### 1. Global Variables — they make your life a lot easier.

The first thing to notice in contest style code is the extensive use of global variables. It is a standard move to declare all the variables you could think of at the beginning of your code. For example:



#include <bits/stdc++.h>
using namespace std;

int x,y,z,a;
bool t,s;

int main(){...}



To understand why people do this in contests, let’s first recall why we hate global variables in the first place. It’s because it would cause a problem called shadowing, which makes our programs error-prone. For example:


string a = "World";
void func(){
//Local variable
string a = "Bam! You are shadowed by me!"

cout << "Hello" << a;
/*Prints out Hello Bam! You are shadowed by me!*/
}



Avoiding global variables during the development of large projects with thousands of line of code and potentially years of future maintenance is indeed worthwhile. But things are different during contests. In most cases, a solution to a particular problem wouldn’t exceed 100 lines of code. As long as you keep an eye on your variable names, global variables won’t cause any problem at all.

#### 1.1 The advantages of using global variables.

First of all, since everything is global. You don’t need to pass values among functions anymore. That means there is no need to worry about pointers, or references, or calculating the scope of variables. Just declare every function as void, and you are good to go. Remember, during contests, we want to use every minute in thinking, not typing.


int n, m, ans, arr;

void algorithmA(){
/*Something*/
}

void algorithmB(){
/*Something*/
}
int main(){
algorithmA();
algorithmB();
cout << ans;
return 0;
}



Secondly, global variables get a free initialization — 0 for numbers and false for boolean. We know that when we declare local variables without initialization, they store random garbage values(not necessarily 0). If we need base case values for algorithms like Dynamic Programming, we need to use for-loops or memset function to initialize the arrays.  However, if you declare the variables as global, initialization is already there for you.

#### 1.2 The naming of global variables

Everyone has their habits in terms of naming global variables, but there is one universal principle: names should be as short as possible. Here are some popular names for different kinds of global variables. Notice that it’s totally OK for you to develop your own naming conventions, but familiarizing yourself with these names can make it much easier for you to read solutions written by others.


/*Common names for input data size*/
int n,m,k;

/*Common names for input data*/
int a;
int mat;
int arr;

/*Prefix sum array or other arrays related to sum*/
int s;

/*Common names for Dynamic Programming arrays*/
int f;
int dp;

/*Starting node for graph problems*/
int s;

int res;
int ans;

/*Counter variables*/
int cnt;
int c;
int tot;

/*Temporary Variables*/
int t;
int tmp;



### 2. Primitive arrays — they can represent everything.

We use primitive arrays and some basic STL data structures to store everything. You might be thinking: what if my array size is variable? Luckily, every contest problem will specify the upper bound of data size and use the first line of input to tell you how many lines of data are in a particular test case. You can just declare arrays with the size of upper bounds. Contest judges want problems to be about algorithms, not I/O. Here is an example of a contest problem.

Here is how you can store data in this problem.


#include <bits/stdc++.h>
using namespace std;
int a; /*Since we know n <= 100, 105 should be enough*/
int n; /*n is used to store input size*/
int main(){
cin >> n;
for(int i = 1; i <= n; i++)cin >> a[i];
/*Solve the problem*/
/*...*/
}



For now, just remember you have to befriend yourself with primitive arrays. In future posts, I will introduce how you can represent more complicated data structures with primitive arrays.

### 3. Structs — all you need for OOP.

Contestants almost never define full-fledged classes to store data. It’s like using a supercomputer to do arithmetics. Forget good-old object-oriented programming rules here because ain’t nobody got time for that. When there is a need to encapsulate data, we turn to super lightweight structs. Just specify the data it stores and you are good to go. Here is an example of using structs to store edges in a directed graph.


struct Edge{
int to, w;  /*The node it points to and the edge weight*/
};



If you want to create an array of edges, just tail the struct definition with an array declaration. If this statement is global, all the member variables are initialized to default values.


struct Edge{
int to, w;  /*The node it points to and the edge weight*/
}arr; // Creat an array of Edge called arr.



You don’t even need constructors. Here is an example of pushing an Edge to a vector. Notice the statement inside the push_back function creates an Edge object, and the order of data has to match the ones in your struct declaration.


/*Push back an Edge with weight 2 pointing to node 4 to vector 'vec'. */
vec.push_back((Edge){4,2});



### 4. Macros — get rid of meaningless numbers in your code.

In contest problems, there are always some constants. The most frequent ones are the upper bound of data size. Another common one would be infinity. Macros can save you a lot of time when these constants are used very often.


#define N 10005 //Max size
#define M 200005 //Max size
#define INF 1999999999 //Infinity

int arr[M][M], f[N], s[N];



### 5. Cstdio — it can save your life sometimes.

When it comes to scanning input data, some people prefer cin/cout. I used to be one of them until I ran into problem 919c on codeforces. Admittedly, cin/cout can be pretty smart and user-friendly, they will get you TLE(Time Limit Exceed) on some I/O heavy test cases. Therefore, it’s always good to familiarize yourself with that.


#include <cstdio>



There are many functions in cstdio, but all you need to know is scanf and printf. Click the link for detailed usage clarification and examples.

### Summary

Up to this point, we have discussed some basic routines when it comes to writing fast contest code. If you internalize all these habits, it should get you started on reading other people’s solutions quickly. There are, of course, many other techniques. But those are hard to remember without understanding the related algorithms first. I will introduce them in future posts.

## Intuitions behind Logistic Regression

Intuitions behind Logistic Regression

Widely used in machine learning algorithms, logistic function is often described as a function “with nice properties.” It is smooth and at the same time provides a threshold structure that is useful for problems like binomial classification.

Take a look at this guy! $f(x) = \frac{1}{1 + e^{-x}}$

In what situation might this function be useful. Well, imagine that you are a jury member who have to decide whether a person is guilty or not guilty. In order to reach a conclusion, you must take into account all the evidence and testimonies. After hours of hard thinking, there emerges a “feeling” in your mind, sort of like a measurement for the defendant’s level of innocence. This measurement involves a lot of variables, for example, the suspect’s fingerprint in the crime scene would significantly decrease this innocence level. Finally, if this measurement passes a certain threshold, your decision is guilty, otherwise not guilty.

Logistic function is such a mechanism. Given an input x, it can decide whether this input falls in class A or class B according to its vallue.

This is a very useful property for machine learning algorithms. The function can become more and more accurate in classification after training data to tweak the threshold and change how the input x is derived. Now, let’s not simply accept that logistic function is just a magical tool with a weird name. Let’s got a step furthera and explore how it is derived, which is actually pretty straightforward.