Queuing theory is a branch of applied mathematics that deals with the study of queues or waiting lines. Queues are a common phenomenon in everyday life, and we encounter them in various situations, such as waiting for a bus, standing in a line at a grocery store, or waiting for a web page to load. In computer systems, queuing theory helps in analyzing the behavior of computer systems under different load conditions.
A queuing system consists of three basic components:
In a queuing system, customers arrive randomly, and they join the queue if the service facility is busy. When a customer arrives at the service facility, the service facility serves the customer, and the customer leaves the system. The queuing system’s performance can be measured using various metrics, such as the average waiting time, the average queue length, the utilization of the service facility, and the throughput.
Kendall’s notation is a standard notation used to describe queuing systems. Kendall’s notation uses four letters to describe a queuing system. The first letter represents the arrival process, the second letter represents the service process, the third letter represents the number of servers, and the fourth letter represents the queue discipline.
The following table summarizes the meaning of each letter in Kendall’s notation:
Letter | Meaning |
---|---|
$A$ | Arrival process |
$M$ | Markovian (exponential) service process |
$D$ | Deterministic service process |
$G$ | General service process |
$M/D/C$ | Number of servers ($C$) |
$F$ | Queue discipline |
The following are some essential formulas used in queuing theory:
Little’s law states that the average number of customers in a queuing system is equal to the product of the average arrival rate and the average time a customer spends in the system. Mathematically, Little’s law can be expressed as follows:
\[L = \lambda W\]where $L$ is the average number of customers in the system, $λ$ is the arrival rate, and $W$ is the average time a customer spends in the system.
Erlang’s formula is used to calculate the probability of a customer having to wait in the queue before receiving service. Erlang’s formula assumes that the arrival process is Poisson, the service process is Markovian, and there is only one server. Mathematically, Erlang’s formula can be expressed as follows:
\[P_n = \frac{\frac{(\lambda/\mu)^n}{n!}}{\sum_{i=0}^{C-1} \frac{(\lambda/\mu)^i}{i!} + \frac{(\lambda/\mu)^C}{C!(1-\rho)}}\]where $P_n$ is the probability of $n$ customers in the system, $λ$ is the arrival rate, $μ$ is the service rate, $C$ is the number of servers, and $ρ$ is the utilization of the service facility.
Kendall’s formulas are used to calculate the performance measures of queuing systems. The following are the most commonly used Kendall’s formulas:
where $L_q$ is the average queue length, $ρ$ is the utilization of the service facility, and $C$ is the number of servers.
where $W_q$ is the average waiting time and $λ$ is the arrival rate.
where $ρ$ is the utilization of the service facility, $λ$ is the arrival rate, $μ$ is the service rate, and $C$ is the number of servers.
where $X$ is the throughput, $λ$ is the arrival rate, and $P_n$ is the probability of $n$ customers in the system.
Queuing theory has many practical applications in various fields. In telecommunications, queuing theory is used to optimize the performance of call centers and customer service systems. By using queuing theory, call centers can determine the optimal number of agents needed to handle customer demand and minimize waiting times. In the healthcare sector, queuing theory is used to manage patient flows and optimize hospital resources. By understanding the behavior of queues, hospitals can reduce waiting times, improve patient satisfaction, and increase the efficiency of their operations.
In manufacturing, queuing theory is used to optimize production lines by minimizing queue lengths and waiting times. By analyzing the arrival and service rates of a production line, queuing theory can help manufacturers determine the optimal number of servers (e.g., machines) needed to meet customer demand and reduce waiting times. Queuing theory is also used in inventory management to determine the optimal inventory level that minimizes costs while meeting customer demand.
Queuing theory is also useful in traffic engineering, where it is used to optimize the performance of traffic systems. By understanding the behavior of queues, traffic engineers can design traffic systems that minimize congestion and waiting times. For example, queuing theory can help traffic engineers determine the optimal number of traffic lanes needed to handle traffic flows during peak hours and minimize waiting times.
Queuing theory has been used in the entertainment industry to predict demand for rides and attractions at theme parks. By analyzing the arrival rate and service rate, queuing theory can help theme parks determine the optimal number of employees needed to reduce queue lengths and waiting times. Queuing theory has also been applied in the aviation industry to optimize the allocation of gates and reduce waiting times at airports.
While queuing theory is a powerful tool for understanding waiting lines, it has some limitations. To make theoretical analysis feasible, queuing theory sometimes relies on strong assumptions, for example, customers arrive randomly and independently of each other (i.e., input distributions), which may not always be the case in real-world scenarios. For instance, customers may arrive in groups or bunches, and their arrival may be dependent on external factors like weather, time of day, or season. Additionally, queuing theory assumes that the service process is independent of the arrival process, which may not always hold true. In some cases, the arrival of customers may be dependent on the state of the queue or the number of customers present. Finally, queuing theory assumes that the queue discipline is fixed, which may not be the case in real-world scenarios where priorities may change. For example, in a hospital, the priority of patients may change based on their medical condition.
Queuing theory is a dynamic field that continues to evolve as new applications and challenges arise. With the advent of big data and machine learning, queuing theory is now being used to optimize complex systems that were previously difficult to model. For example, queuing theory is being used to optimize cloud computing systems, where the arrival rate and service rate can vary significantly depending on the workload. Queuing theory is also being used to optimize supply chain management, where the arrival rate and service rate can vary depending on the demand for goods and services.
Another area of interest for future research in queuing theory is the study of the impact of social distancing measures on queues. The COVID-19 pandemic has brought about significant changes in the way we wait in lines. Queuing theory can be used to study the effectiveness of social distancing measures in reducing queue lengths and waiting times.
Furthermore, queuing theory is being applied to study the impact of customer behavior on queuing systems. It is being used to analyze the effect of customer impatience, customer balk, and jockeying behavior on queueing systems. By understanding the behavior of customers, queuing theory can help organizations design better queue management systems that cater to the needs and preferences of their customers.
This post gives a brief overview of queuing theory, a methematical concept for understanding waiting lines and improving their performance. By applying queuing theory to real-world problems, we can optimize the performance of queues in various applications, from customer service systems to traffic systems. Queuing theory provides a quantitative framework for evaluating the performance of queues and improving their efficiency, enhancing the experience of customers and users. While queuing theory has its limitations, it remains an essential tool for analyzing and optimizing waiting lines in a variety of fields. As the world becomes more complex, queuing theory will continue to play a critical role in optimizing systems and improving the efficiency of operations.
In addition to the above, queuing theory is also being increasingly applied in the field of e-commerce, where it is used to optimize online shopping experiences. With the exponential growth of online shopping, queuing theory is being used to reduce the waiting times for customers during peak periods, such as Black Friday and Cyber Monday. By analyzing the behavior of online shoppers, queuing theory can help retailers to design better queuing systems that can accommodate the surge in demand during such peak periods.
Another area in which queuing theory is being applied is in the field of public transportation. Queuing theory is used to optimize public transportation systems by minimizing waiting times at bus stops and train stations. By analyzing the arrival and service rates of public transportation systems, queuing theory can help transit agencies to determine the optimal number of buses or trains needed to meet demand and reduce waiting times for passengers.
Queuing theory is also being used to improve the performance of online streaming services. By analyzing the arrival and service rates of streaming services, queuing theory can help streaming providers to determine the optimal number of servers needed to handle the demand for streaming content and reduce buffering times for users.
In conclusion, queuing theory is a powerful tool that has numerous applications in various fields. As the world becomes more complex, the need for efficient queue management systems becomes even more paramount. By understanding the behavior of queues, organizations can optimize their operations, reduce waiting times, and enhance the experience of their customers and users. Queuing theory will continue to play a critical role in optimizing systems and improving the efficiency of operations in various fields.
Moreover, the application of queuing theory has expanded to the field of public health. During the COVID-19 pandemic, queuing theory has been used to model the spread of the virus and to predict the impact of social distancing measures on the spread of the virus. Queuing theory has been used to model the behavior of the virus in different populations and to predict the effectiveness of various interventions, such as lockdowns and vaccination programs. By understanding the behavior of the virus and the effectiveness of interventions, queuing theory can help public health officials to design better policies and strategies to combat the spread of the virus.
In addition to the above-mentioned applications, queuing theory is also being used in the field of finance. Queuing theory is used to optimize the performance of financial markets by minimizing waiting times and reducing transaction costs. By analyzing the arrival and service rates of financial markets, queuing theory can help investors to determine the optimal timing and size of their trades and to minimize their losses due to transaction costs.
Another area of interest for future research in queuing theory is the study of the impact of emerging technologies on queue management systems. With the rapid pace of technological innovation, queuing theory can help organizations to design better queue management systems that can accommodate new technologies such as artificial intelligence, robotics, and the internet of things. By understanding the behavior of queues in the context of emerging technologies, queuing theory can help organizations to optimize their operations and improve the efficiency of their systems.
The survival function is a fundamental concept in survival analysis. It describes the probability that an individual survives beyond a certain time. Mathematically, the survival function is defined as:
\[S(t) = P(T > t)\]where $T$ is the random variable representing the time until the event of interest occurs, and $t$ is a specific time point. The survival function can also be interpreted as the proportion of individuals who survive beyond time $t$.
The hazard function is another important concept in survival analysis. It describes the instantaneous rate of occurrence of the event of interest at time $t$, given that the individual has survived up to time $t$. The hazard function is defined as:
\[h(t) = \lim_{\Delta t \rightarrow 0} \frac{P(t \leq T < t + \Delta t | T \geq t)}{\Delta t}\]The hazard function can also be interpreted as the probability of the event of interest occurring in the next infinitesimal time interval, given that the individual has survived up to time $t$.
The Kaplan-Meier estimator is a non-parametric method used to estimate the survival function. It is particularly useful when the distribution of survival times is unknown or non-normal. The estimator is based on the observed survival times of a sample of individuals. Let $t_1, t_2, …, t_n$ be the observed survival times, and let $d_1, d_2, …, d_n$ be the corresponding number of events (i.e., deaths) at each time point. The Kaplan-Meier estimator is defined as:
\[\hat{S}(t) = \prod_{i:t_i \leq t} \left( 1 - \frac{d_i}{n_i} \right)\]where $n_i$ is the number of individuals at risk at time $t_i$. The estimator can be interpreted as the product of the probabilities of survival up to each time point. The denominator in the product is the number of individuals at risk at each time point, which is equal to the total sample size minus the number of events up to that time point.
The log-rank test is a statistical test used to compare the survival distributions of two or more groups. The null hypothesis is that the survival distributions are the same across all groups. The test is based on the difference between the observed and expected number of events in each group, assuming that the null hypothesis is true. The test statistic is given by:
\[Z = \frac{(O_1 - E_1)^2}{V_1} + \frac{(O_2 - E_2)^2}{V_2} + ... + \frac{(O_k - E_k)^2}{V_k}\]where $O_i$ is the observed number of events in group $i$, $E_i$ is the expected number of events in group $i$ under the null hypothesis, and $V_i$ is the variance of the number of events in group $i$ under the null hypothesis. The test statistic follows a chi-squared distribution with $k-1$ degrees of freedom, where $k$ is the number of groups being compared.
The Cox proportional hazards model is a widely used semi-parametric method in survival analysis. It is used to model the relationship between covariates (i.e., explanatory variables) and the hazard function, while assuming that the hazard ratios are constant over time. The hazard ratio represents the relative risk of the event of interest for individuals with a certain value of the covariate, compared to individuals with a reference value of the covariate. The Cox model assumes that the hazard function is given by:
\[h(t, \boldsymbol{X}) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p)\]where $h_0(t)$ is the baseline hazard function (i.e., the hazard function for individuals with a reference value of all covariates), $\boldsymbol{X}$ is a vector of $p$ covariates, and $\beta_1, \beta_2, …, \beta_p$ are the corresponding regression coefficients. The Cox model does not require any assumptions about the shape of the baseline hazard function, making it more flexible than parametric models.
The likelihood function for the Cox model can be written as:
\[L(\beta_1, \beta_2, ..., \beta_p) = \prod_{i=1}^n \left( \frac{\exp(\beta_1 X_{i1} + \beta_2 X_{i2} + ... + \beta_p X_{ip})}{\sum_{j \in R_i} \exp(\beta_1 X_{j1} + \beta_2 X_{j2} + ... + \beta_p X_{jp})} \right)^{\delta_i}\]where $X_{i1}, X_{i2}, …, X_{ip}$ are the covariate values for individual $i$, $\delta_i$ is the event indicator for individual $i$ (i.e., $\delta_i=1$ if the event of interest occurs for individual $i$, and $\delta_i=0$ otherwise), and $R_i$ is the set of individuals at risk at the time of event for individual $i$. The Cox model estimates the regression coefficients that maximize the likelihood function.
Survival analysis is a powerful tool for analyzing time-to-event data. The survival function and hazard function are important concepts in survival analysis, and the Kaplan-Meier estimator is a useful non-parametric method for estimating the survival function. The log-rank test and Cox proportional hazards model are commonly used statistical methods for comparing survival distributions and modeling the relationship between covariates and the hazard function, respectively. By understanding these concepts and methods, researchers can gain valuable insights into the time-to-event outcomes in their data.
]]>/proc
pseudo file system./proc/[pid]
, where the pid
is the unique numerical ID for each.pid
” is not just a process ID but could be an ID for either a thread or a process.htop
shows both processes and threads, and doesn’t distinguish them by default.
pid
, potentially sharing some resources like virtual memory and file descriptors.In 2001, Linux 2.4 introduced “Thread groups”, which gave rise to threads within a process. From the clone(2) man page:
Thread groups were a feature added in Linux 2.4 to support the POSIX threads notion of a set of threads that share a single PID. Internally, this shared PID is the so-called thread group identifier (TGID) for the thread group. Since Linux 2.4, calls to
getpid(2)
return the TGID of the caller.
getpid()
are the same for all of them.gettid()
are always unique./proc/[pid]/task/[tid]
subdirectories where tid
is the kernel thread ID.tid
s.
/proc/[pid]/task/[tid]
shares the same content as /proc/[pid]/
if pid==tid
, i.e., it contains the same information describing the same process/thread./proc/[pid]
directory of a multithreaded process:
task/[tid]
subdirectories are all the threads within the same thread group.pid
.task/
directory).fork()
syscall, a thread is created by e.g., pthread_create()
in C. (Under the hood, they all use the syscall clone()
but with different parameters)
stress-ng
on Linux.We can run the stress test for memory accesses with the following command, which will spawn a stressor that runs with 5 threads reading and writing to two different mappings of the same underlying physical page.
hy@node-0:~$ stress-ng --mcontend 1 -t 10h
stress-ng: info: [56472] dispatching hogs: 1 mcontend
With htop
, we can see the process and threads therein in hierarchy (the PGRP
is the GPID, and the PID
is the ID for threads/processes).
PID
(process/thread ID).
56472
: The single-threaded parent process spawned from the bash command.56473
: The multithreaded child process (and the main thread) spawned from the parent 56472
.56473-56477
: The 5 sibling threads created by the main thread 56473
.Then, by using pidof
, we get the pid
of the main thread.
hy@node-0:~$ pidof stress-ng-mcontend
56473
Since the bash command 56472
is the parent process that spawned the child process 56473
, we can examine this relationship by checking:
hy@node-0:~$ cat /proc/56472/task/56472/children
56473
Then, navigate to the /proc
directory and check the /proc/[pid]/task/
subdirectories. We get the 5 threads within this process:
hy@node-0:~$ ll /proc/56473/task/
total 0
dr-xr-xr-x 7 hy hy 0 Dec 31 22:21 ./
dr-xr-xr-x 9 hy hy 0 Dec 31 22:21 ../
dr-xr-xr-x 7 hy hy 0 Dec 31 22:21 56473/
dr-xr-xr-x 7 hy hy 0 Dec 31 22:21 56474/
dr-xr-xr-x 7 hy hy 0 Dec 31 22:21 56475/
dr-xr-xr-x 7 hy hy 0 Dec 31 22:21 56476/
dr-xr-xr-x 7 hy hy 0 Dec 31 22:21 56477/
To examine directory of a child thread, we get the same output as above since they are siblings.
hy@node-0:~$ ll /proc/56476/task/
total 0
dr-xr-xr-x 7 hy hy 0 Dec 31 22:21 ./
dr-xr-xr-x 9 hy hy 0 Dec 31 22:21 ../
dr-xr-xr-x 7 hy hy 0 Dec 31 22:21 56473/
dr-xr-xr-x 7 hy hy 0 Dec 31 22:21 56474/
dr-xr-xr-x 7 hy hy 0 Dec 31 22:21 56475/
dr-xr-xr-x 7 hy hy 0 Dec 31 22:21 56476/
dr-xr-xr-x 7 hy hy 0 Dec 31 22:21 56477/
Note that the parent process 56472
is a single-threaded process, so its /task
directory contains only itself.
hy@node-0:~$ ll /proc/56472/task/
total 0
dr-xr-xr-x 7 hy hy 0 Dec 31 22:21 ./
dr-xr-xr-x 9 hy hy 0 Dec 31 22:21 ../
dr-xr-xr-x 7 hy hy 0 Dec 31 22:21 56472/
htop
aggregate all the resource usages in the main thread 56473
by default.ps
as well:
Like htop
, It aggregates usage of all threads to the main threads by default.
hy@node-0:~$ ps -p 56473 -o %cpu,%mem,cmd
%CPU %MEM CMD
473 0.0 stress-ng-mcontend
It only works on the main thread but not with the siblings:
hy@node-0:~$ ps -p 56476 -o %cpu,%mem,cmd
%CPU %MEM CMD
To see detailed thread-level information, we can use the -L
flag on the main thread:
hy@node-0:~$ ps -L 56473 -o %cpu,%mem,cmd
%CPU %MEM CMD
97.3 0.0 stress-ng-mcontend
94.0 0.0 stress-ng-mcontend
94.0 0.0 stress-ng-mcontend
94.0 0.0 stress-ng-mcontend
94.0 0.0 stress-ng-mcontend
With -F
option, we can obtain the full glory:
hy@node-0:~$ ps -L 56473 -F
UID PID PPID LWP C NLWP SZ RSS PSR STIME TTY STAT TIME CMD
hy 56473 56472 56473 97 5 22792 2604 13 08:10 pts/2 RLl+ 302:30 stress-ng-mcontend
hy 56473 56472 56474 94 5 22792 2604 7 08:10 pts/2 RLl+ 292:03 stress-ng-mcontend
hy 56473 56472 56475 94 5 22792 2604 31 08:10 pts/2 RLl+ 292:00 stress-ng-mcontend
hy 56473 56472 56476 94 5 22792 2604 15 08:10 pts/2 RLl+ 291:59 stress-ng-mcontend
hy 56473 56472 56477 94 5 22792 2604 0 08:10 pts/2 RLl+ 292:05 stress-ng-mcontend
Note that ALL threads share the same PID but each of them has a unique TID (LWP
).
We can monitor those threads using top
with -H
:
hy@node-0:/proc$ top -H -p 56476
....
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
56473 hy 20 0 91168 2708 2272 R 97.3 0.0 127:24.91 stress-ng-mcont
56474 hy 20 0 91168 2708 2272 R 94.0 0.0 122:55.16 stress-ng-mcont
56475 hy 20 0 91168 2708 2272 R 94.0 0.0 122:54.44 stress-ng-mcont
56476 hy 20 0 91168 2708 2272 R 93.7 0.0 122:55.33 stress-ng-mcont
56477 hy 20 0 91168 2708 2272 R 92.3 0.0 122:56.57 stress-ng-mcont
Without the -H
flag, however, it aggregates all usages to any of the sibling threads:
hy@node-0:~$ top -p 56476
....
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
56473 hy 20 0 91168 2708 2272 R 476.3 0.0 621:39.36 stress-ng-mcont
Also, we can get a single number out:
# Get the CPU utilization of thread 56476.
# * Aggregated.
hy@node-0:~$ top -b -n 2 -d 0.2 -p 56476 | tail -1 | awk '{print $9}'
465.0
# * Per-thread with -H.
hy@node-0:~$ top -b -H -n 2 -d 0.2 -p 56476 | tail -1 | awk '{print $9}'
75.0
/proc
sysfs.
/proc/pid/stat
of a thread (whose TID==pid
) contains the information aggregated from all threads.The file /proc/pid/task/pid/stat
contains per-thread information:
# * Get total cpu time (user and kernel) of all threads belonging to the same TG as that of 56476.
hy@node-0:~$ cat /proc/56476/stat | awk '{print $14, $15}'
9460932 12361
# * Get the cpu time for only thread 56476.
hy@node-0:~$ cat /proc/56476/task/56476/stat | awk '{print $14, $15}'
1879429 3032
Alternatively, we can use the mighty python with psutil
.
>>> import psutil
# The great grandparent process.
>>> tmux_session = psutil.Process(54711)
# * It's spawned from the mother of all processes of PID=1 -- the init(old distros)/systemd(new distros).
>>> tmux_session.ppid()
1
>>> [(child.name(), child.pid) for child in tmux_session.children(recursive=True)]
[('bash', 54712), ('bash', 56236), ('python', 56613), ('stress-ng', 56472), ('stress-ng-mcontend', 56473)]
# The parent process.
>>> parent = psutil.Process(56472)
# * The parent was spawned from one of the above bash sessions (grandparent)
>>> parent.ppid()
54712
# * The sibling threads art NOT children, and invisable to the parent process.
>>> parent.children(recursive=True)
[psutil.Process(pid=56473, name='stress-ng-mcontend', status='running', started='11:21:57')]
# * The parent is single-threaded.
>>> parent.num_threads()
1
# The child process spawned from the parent process.
>>> child = psutil.Process(56473)
# * The child created 5 threads (including itself).
>>> child.num_threads()
5
>>> [thread.id for thread in child.threads()]
[56473, 56474, 56475, 56476, 56477]
# One of the sibling threads.
>>> sibling = psutil.Process(56476)
# * The sibling thread inherits the parent process of the main thread.
>>> child.ppid()
56472
>>> sibling.ppid()
56472
# Accounting resouces.
# * The parent process is in sleep state (S), so it doesn't take any CPU time.
>>> parent.cpu_percent(interval=1)
0.0
# ! psutil **aggregates** all sibling resources to **any** of the siblings.
>>> sibling.cpu_percent(interval=1)
471.5
>>> child.cpu_percent(interval=1)
472.4
# ! Also, its cpu time accounting for children processes is broken somehow ...
>>> tmux_session.cpu_times()
pcputimes(user=7.46, system=3.19, children_user=102.18, children_system=153.15, iowait=0.0)
>>> parent.cpu_times()
pcputimes(user=0.0, system=0.0, children_user=0.0, children_system=0.0, iowait=0.0)
>>> child.cpu_times()
pcputimes(user=45250.11, system=57.79, children_user=0.0, children_system=0.0, iowait=0.0)
>>> sibling.cpu_times()
pcputimes(user=45255.42, system=57.79, children_user=0.0, children_system=0.0, iowait=0.0)
# * Memory usages are accounted the same as it does for CPU.
>>> parent.memory_full_info()
pfullmem(rss=6475776, vms=59777024, shared=6078464, text=1728512, lib=0, data=32018432, dirty=0, uss=3051520, pss=3749888, swap=0)
>>> child.memory_full_info()
pfullmem(rss=2772992, vms=93356032, shared=2326528, text=1728512, lib=0, data=65581056, dirty=0, uss=126976, pss=735232, swap=0)
>>> sibling.memory_full_info()
pfullmem(rss=2772992, vms=93356032, shared=2326528, text=1728512, lib=0, data=65581056, dirty=0, uss=126976, pss=735232, swap=0)
>>> tmux_session.memory_percent()
0.007239506814671662
>>> parent.memory_percent()
0.009602063988251591
>>> child.memory_percent()
0.004111699759675096
>>> sibling.memory_percent()
0.004111699759675096
psutil
is a convenient tool for sys admins when scripting in python.
And it’s time to end our running example:
# ! "Note this will return True also if the process is a zombie (p.status() == psutil.STATUS_ZOMBIE)"
>>> parent.is_running() == child.is_running() == sibling.is_running() == True
True
>>> import signal
>>> sibling.send_signal(signal.SIGINT)
>>> parent.is_running() == child.is_running() == sibling.is_running() == False
True
(It seems that interrupting one thread has a bottom-up cascading effect in stress-ng
🥴 )
Happy New Year 🎆 ~
In many cases, it’s desired to trade off a bit larger bias for a much smaller variance for better estimations of the generalisation error. One way to do so is by adding a regularizer. In the case of linear regression, L1 (Lasso) and L2 (Ridge) regularization are two of the most common ones. Lasso suppresses weights of small magnitudes to zero, making the feature space sparse, whilst Ridge “condenses” all weights to smaller values. Both can restrict the norm of the weights and therefore, mitigate overfitting.
However, regularization is only one way to strike the balance. Another way is to introduce more bias into the equation through the Bayesian lens. Specifically, we can impose prior knowledge by adding a prior distribution to constrain the norm of the leant parameters. For example, if we know the weights are small and centred, then we can set our prior to be $\vec{w} \sim \mathcal{N}(0, \beta\textbf{I})$. Then, by Bayes rule, we have:
\[\mathbb{P}(\vec{w} | X, \vec{y}) = \cfrac{\mathbb{P}(\vec{w}, X, \vec{y})}{\mathbb{P}(X, \vec{y})} = \cfrac{\mathbb{P}(\vec{w}, \vec{y} | X) {\mathbb{P}(X)}}{\mathbb{P}(\vec{y} | X) {\mathbb{P}(X)}} = \cfrac{\mathbb{P}(\vec{w}, \vec{y} | X)}{\mathbb{P}(\vec{y} | X)}\]Thus, both regularization and Bayesian modelling can achieve the same goal, which begs the question: are they connected?
The answer to the above question turns out to be yes! To illustrate this further, let’s use two common prior distributions, Lapace and Gaussian, as our running examples.
Firstly, we assume the following general setting for regression: ${y} = X{\theta}$ and $f_X = y + \epsilon$ where $\theta \sim \text{Laplace}(0, s) = 1/2s \cdot\exp(-\mid\theta\mid / s)$ and $\epsilon \sim \mathcal{N}(0, \delta^2_\epsilon)$.
Then, we obtain the maximum a posteriori (MAP) esitmation as:
\[\begin{align} \arg\max_\theta\mathbb{P}({\theta} | X, {y}) &= \arg\max_\theta\cfrac{\mathbb{P}(y | X, \theta)\mathbb{P}(\theta)} {\mathbb{P}(y)} \nonumber\\ &\propto \arg\max_\theta\mathbb{P}(y |X, \theta)\mathbb{P}(X| \theta)\mathbb{P}(\theta) \nonumber\\ &\propto \arg\max_\theta\mathbb{P}(y | X, \theta)\mathbb{P}(\theta) \nonumber\\ &\propto \arg\max_\theta\mathbb{P}(\theta) \prod^n_i \mathbb{P}(y_i | X_i, \theta) \nonumber\\ &\propto \arg\min_\theta -\log \mathbb{P}(\theta) - \sum_i^n \log \mathbb{P}_\theta(y_i | X_i) \end{align}\]Next, we can substitute both the likelihood and prior into Eq. (1).
\[\begin{align} \arg\min_\theta -\log\cfrac{1}{2s} \exp\left\{-\cfrac{|\theta|}{s}\right\} - \sum^n_i \log \cfrac{1}{Z} \exp\left\{-\cfrac{1}{2}\left(\cfrac{y_i - f_i}{\delta_\epsilon}\right)^2\right\} \end{align}\]where $Z$ is the Gaussian normalising constant. By simplifying Eq. (2), we obtain the following form:
\[\begin{align} & \arg\min_\theta - \cfrac{|\theta|}{s} + \cfrac{1}{2\delta^2_\epsilon} \sum^n_i(y_i - f_i)^2 \\ =& \arg\min_\theta - \sum^n_i(y_i - f_i)^2 - \cfrac{2\delta^2_\epsilon}{s}||\theta||_1 \end{align}\]Now, we have recovered the exact form of Lasso, where $\cfrac{2\delta^2_\epsilon}{s}$ is the coefficient of the L1 regularizor $\lambda$ that controls the strength of the constraint.
Next, let’s play the same trick in the same setting but with a Gaussian prior instead, i.e., $\theta \sim \mathcal{N}(0, \delta_\theta^2)$.
Starting from Eq. (1), we subsitute in the likelihood and prior as above:
\[\begin{align} \arg\max_\theta\mathbb{P}({\theta} | X, {y}) &\propto \arg\min_\theta -\log \mathbb{P}(\theta) - \sum_i^n \log \mathbb{P}_\theta(y_i | X_i) \nonumber \\ &\propto \arg\min_\theta -\log \cfrac{1}{Z'} \exp\left\{-\cfrac{1}{2}\left(\cfrac{\theta-0}{\delta_\theta}\right)^2 \right\} \\ & \quad - \sum^n_i \log \cfrac{1}{Z} \exp\left\{-\cfrac{1}{2}\left(\cfrac{y_i - f_i}{\delta_\epsilon}\right)^2\right\} \nonumber \end{align}\]Finally, letting go all the fluff in Eq. (5), we have:
\[\begin{align} & \arg\min_\theta - \cfrac{||\theta||_2^2}{2\delta^2_\theta} + \cfrac{1}{2\delta^2_\epsilon} \sum^n_i(y_i - f_i)^2 \nonumber\\ =& \arg\min_\theta - \sum^n_i(y_i - f_i)^2 - \cfrac{\delta^2_\epsilon}{\delta^2_\theta}||\theta||_2^2 \end{align}\]By Eq. (6), we have recovered Ridge regression where the fraction $\cfrac{\delta_{\epsilon}^2}{\delta_{\theta}^2}$ denotes regularization constant $\lambda$.
By working out the above two examples, we found that regularised regression is nothing but Bayesian modelling in disguise. In fact, imposing various priors has the same effect as using corresponding regularizers. By the same token, choosing different likelihoods gives us different loss functions. In this post, we used Gaussian likelihood in both examples, and, in turn, recovered the square loss.
There are many other options for prior and likelihood functions. For instance, one can use a student-t as opposed to Gaussian. Lastly, a family of conjugate prior can drastically reduce the cost of Bayesian inference.
]]>The true stories behind the stories of success shall be the same and might not be as glorious.
Well, I’m now under the impression that Computer Science is a pseudoscience :}
How can I live out a life so fully that worries couldn’t sneak in? Perhaps most importantly, what’s my deeper justification and higher pursuit thereof?
I finally understand why I didn’t quite understand the movie 😅
]]>Some of my friends from China saw a bunch of blanks here. This is because YouTube videos can’t pass the firewall 🧱.
This Ted talk conveys very similar messages as that of the below one. Their main ideas are covered in many other talks as well.
The last point ties in with another Ted talk, which will be introduced later.
To be continued 👨💻 …
]]>Use the following command to generate a public(silver)/private(black) RSA key pair under the ~/.ssh/id_rsa
directory. The .ssh/id_rsa
is the private key that you keep in your machine, and the .ssh/id_rsa.pub
is the public key that you distribute to other machines/platforms in order to achieve automatic login via key-pair verification.
ssh-keygen -t rsa
Note that git uses a different type of cryptosystem, namely, the Ed25519 system, which can be generated using the following command.
ssh-keygen -t ed25519 -C "youremail@yourdomain"
Compared to RSA, it is considered to be faster, safer and more compact (Ed25519: 8chars, RSA: 544chars) although RSA is more commonly used.
If Alice wants to log in Server1 shown in the figure, she can run the following command to forward her SSH public key(s) to it. The keys sent will be recorded in .ssh/authorized_keys
of the host.
ssh-copy-id alice-username@server1.domain-or-ip
Afterwards, Alice should be able to log in Server1 without being asked for her password. I.e.,
ssh alice-username@server1.domain-or-ip
Welcome to XXX ...
Note that for Mac users that do not have ssh-copy-id
, you can either intall it via brew
or mannually copy these ssh files through scp
, rsync
or whatnot. If you go for the latter, one thing you should keep in mind is to set the permission bits correctly as shown below.
chmod 700 ~/.ssh
chmod 600 ~/.ssh/*
In the case that Server1 has a firewall or what have you, Alice has to connect it via a proxy, say Server2; therein lies the question: how to access Server2 directly using key-pair authentication?
To tackle this, Alice can first forward her/his keys to Server2, the proxy, through Step 2. Next, Alice should log in Server2 to generate a key pair (Step 1), and then, send keys to Server1.
Last, but certainly not least, Alice should set up her ssh
on her local machine. The configuration (~/.ssh/config
) is along the lines of the following.
Host server2
HostName server1-domain
User alice-username
IdentityFile ~/.ssh/id_rsa
Host server1
HostName server1-domain
User alice-username
ForwardX11Trusted yes
ForwardAgent yes
IdentityFile ~/.ssh/id_rsa
ProxyCommand ssh server2 -W %h:%p
Now, everything should be in place. Alice should be able to do login, port forwarding, or whatnot, with automatic key-pair authentication (without having to type her password every single time for every single server along the way!).
🛑 NB: Data loads between Server1 and Server2 is not encrypted.
# Log in *Server1* via *Server2*.
ssh server1
# Tunnelling ports to *Server1* via *Server2*.
ssh -NL {listening_port}:{hostmachine}:{host_port} server1
p.s. Wrestling with company proxies during these COVID times we live in at the moment can be devastating 😷
]]>The main idea of autoencoders is to extract latent features that are not easily observable yet play an important role in one or several aspects of the data (e.g., images).
The first step of the process is to compress the observed data vector $\vec x$ into the latent feature vector $\vec z$.
There are two obvious benefits yielded from such compression process:
The second phase is to try to reproduce the data (the image) from the latent feature vector $\vec z$.
Apparently, since the first step is a “lossy compression”, the data reconstructed $\vec{\hat{x}}$ will not be exactly the same as the original observation. Here is where the third phase comes about.
As mentioned above, there is a difference between the observation $\vec{x}$ and the reconstruction $\vec{\hat{x}}$.
From the above picture, we can see clearly that the higher the dimension of the latent feature vector $\vec{z}$, the higher the quality of the reconstruction.
Therefore, constraining the size of the latent space will enforce the “importance” of the extracted features.
Further, we can use a loss function to measure such “importance” of the extracted hidden variables. In this case, we use a simple square loss:
\[\mathcal{L}(x, \hat{x})=\|x-\hat{x}\|^{2}\]Thus, the key power of autoencoders is that
Autoencoder allows us to quantify the latent variables without labels (gold-standard data)!
To summarize,
In a nutshell, variational autoencoders are a probabilistic twist on autoencoders, i.e. (stochastically) sample from the mean and standard deviation to compute the latent sample as supposed to deterministically take the entire latent vector $\vec{z}$. That been said, the main idea of the forward propagation does not change compared to traditional autoencoders.
Then, we could compute the loss as follows
\[\mathcal{L}(\phi, \theta, x)=(\text { reconstruction loss })+(\text { regularization term }),\]which is exactly the same as before. It captures the pixel-wise difference between the input and the reconstructed output. This is a metrics of how well the network is doing at generating the distribution that akin to that of the observation.
As to the “regularization term”, since the VAE is producing these probability distributions, we want to place some constraints on how they are computed as well as what that probability distribution resembles as a part of regularizing and training the network.
Hence, we place a prior $p(z)$ on the latent distribution as follows
\[D(p_{\phi}(z|x)\ ||\ p(z)),\]which captures the KL divergence between the inferred latent distribution and this fixed prior for which a common choice is a normal Gaussian, i.e. we centre it around with a mean of 0 and a standard deviation 1: $\ p(z)=\mathcal{N}\left(\mu=0, \sigma^{2}=1\right)$.
In this way, the network will learn to penalise itself when it tries to cheat and cluster points outside sort of this smooth Gaussian distribution as it would be the case if it was overfitting or trying to memorize particular instances of the input.
Thus, this will enforce the extracted $\vec z$ follows the shape of our initial hypothesis about the distribution, smoothing out the latent space and, in turn, helping the network not over-fit on certain parts of the latent space.
Unfortunately, due to the stochastic nature, the backpropagation cannot pass the sampling layer as backpropagation requires deterministic nodes to be able to iteratively pass gradients and apply the chain rule through.
Instead, we consider the sampled latent vector $\vec z$ as a sum of a fixed vector $\vec \mu$ a fixed variance vector $\vec \sigma$ and then scaled this variance vector by a random constant that is drawn from a prior distribution, for example from a normal Gaussian. The key idea here is that we still have a stochastic node but since we have done this reparametrization with the factor $\epsilon$ that is drawn from a normal distribution, this stochastic sampling does not occur directly in the bottleneck layer of $\vec z$. This way, we can reparametrize where that sampling is occurring.
Note that this is a really powerful trick as such reparametrization is what allows for VAEs to be trained end-to-end.
The following is a vanila implementation of a VAE model in Tensorflow.
class Sampling(keras.layers.Layer):
def call(self, inputs) :
z_mean, z_log_var = inputs
batch = tf.shape(z_mean)[0]
dim = tf.shape(z_mean)[1]
epsilon = tf.keras.backend.random_normal_(shape=(batch, dim))
return z_mean + tf.exp(0.5 * z_log_var)
latenet_dim = 2
encoder_inputs = Input(shape=(6), name="input_layer")
X = Dense (5, activation="relu", name="h1")(encoder_inputs)
X = Dense (5, activation="relu", name="h2")(x)
X = Dense (4, activation="relu", name="h3")(x)
z_mean = Dense(latent_dim, name="z_mean")(x)
z_log_var = Dense(latent_dim, name="z_log_var")(x)
z = Sampling()([z_mean, z_log_var])
encoder = keras.Model(encoder_inputs, [z_mean, z_log_var, z], name="encoder")
keras.utils.plot_model(encoder, show_shapes=True)
syscall
, but we have never ever seen a system call!
First, let’s recall the UNIX system structure:
System call is an interface between the user and kernel mode which is not necessarily the interface that you want to give a programmer for security concerns.
Consequently, system call is buried in the programming language run time library (e.g. libc.a
) so it is the C library that actually makes the system calls to the operating system for us.
System calls show their power when we are dealing with multiple different devices. When their syscall
are similar enough, mounting them become easy.
// It is said that the way Linux deals with it is to encompass every system call under the sun from all kinds of different operating systems. Terrific ~
Normally, we will get a different chunk of data reading from different devices but, due to the virtue of uniformity, we are able to read from and write to disk driver in exactly the same way as we read from and write to a flash memory. This is because the interface of the kernel is byte-oriented, which means it is reading and writing bytes so it doesn’t care the size of the data blocks.
read
&& write
Uniformity
open
, read
/write
and close
.find | grep | wc …
.Open before use
Byte-oriented
Kernel buffered reads
Kernel buffered writes
Explicit close
As we discussed in the training session, reading something off a disk is time-consuming and costly, often up to several milli-seconds, which is roughly equal to a million instruction times. Thus, in order not to lose a million instructions, we better put the corresponding processes into sleep yielding processors to other tasks.
Same for writing, when the system call write
returns, the data is not necessarily on the disk but buffered in the memory (the kernel), allowing the applications to keep going.
In a nutshell, the kernel is doing tons of buffering and visualization behind the scenes.
(In other words, if your machine crashes at a wrong point in time, you will lose your data permanently …)
p.s. I’ll (hopefully) get to the user-buffered I/O later.
]]>As we mentioned before, a process is an instance of a program executing. Its state and information are all monitored by the OS. Processes can do their work by invoking system calls.
But is there any operations that are on their own?
Yes, process can create a new process (sometimes called sub-process) by copying itself!
fork()
In typical UNIX systems (exclude Linux in a sense that it may somehow augment the child process in the first place), the fork()
system call (or library precisely) creates a copy of the callee process.
When I say copy, I mean all of the states of the original process duplicated in both the parent and the child! (Memory, File Descriptors, etc…)
fork()
returns 3 kinds of values:
-1
(as UNIX convention) represents an error message.fork()
, the original process will be trapped into the kernel mode and halt until it returns.fork()
together.An example:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#define BUFSIZE 1024
int main(int argc, char *argv[])
{
char buf[BUFSIZE];
size_t readlen, writelen, slen;
pid_t cpid, mypid;
pid_t pid = getpid(); /* get current processes PID */
printf("Parent pid: %d\n", pid);
cpid = fork();
if (cpid > 0) {
/* Parent Process */
mypid = getpid();
printf("[%d] parent of [%d]\n", mypid, cpid);
}
else if (cpid == 0) {
/* Child Process */
mypid = getpid();
printf("[%d] child\n", mypid);
}
else {
perror("Fork failed");
exit(1);
}
exit(0);
}
fork()
is executed in an unblocked manner, which means the parent process will not naturally sit there and wait for their child processes to return.
Furthermore, they are actually running in parallel that both of them are exchanging time on the scheduler queue and the run queue.
The parent processes are able to control their children processes directly.
// Didn’t expect it’s going to be such a creepy parenting blog …
A shell is a job control system which allows programmers to create and manage a set of programs to do some task —— Berkeley CS162
A shell is a command interpreter which makes key process-management system calls that are dealing with the creation and termination of processes. —— Prof. Andy Tanenbaum
Okay~ how come we wind up with Shell anyway?
init
in UNIX;
init
is the first process which calls all other children processes and one of which is the shell.fork()
s itself and immediately calls exec()
to load a new program into its memory address. (This often followed with wait()
that blocks itself until gets the return value from the child, which releases the process from being a zombie.)cc –c sourcefile1.c
cc –c sourcefile2.c
ln –o program sourcefile1.o sourcefile2.o
./program
fork
– system call to create a copy of the current process, and start it running.
UNIX exec
– system call to change the program being run by the current process (replace the current running process with a brand new process).
UNIX wait
– system call to wait for a process to finish.
signal
– system call to send a notification to another process.( To me, the signal
service is just a sort of user-level interruption that works in any process regardless it’s a parent or child. )
An example of using the signal()
:
#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <signal.h>
// this is just a way of changing the original `SIGINT` handler to our self-defined hander
void signal_callback_handler(int signum) {
printf("Caught signal %d - phew!\n", signum);
exit(1);
}
int main() {
signal(SIGINT, signal_callback_handler);
while (1) { }
}