Lecture 10
First heuristic: how we select phase
Now: select variables. We already discussed DLIS, where we often pick the variable in not-yet-satisfied clauses.
Moskewicz was a Bachelor’s student at the time. He got the task of taking the current SOTA solver, GRASP, and turning it into something simpler. During that, he had two insights:
- Watching
- VSIDS (gave exponential improvement)
Original version: count the number of literals in the learned clause (sometimes, “conflict clauses” is used for learned clauses). So whenever you learn a clause, you increase a counter of the literals in that clause. Every literal has its own counter; whenever it occurs in a learned clause, you increase it by 1. Next step: need to pick a decision. Pick a variable that has the largest count and is unassigned. So this part is very similar to DLIS with one exception: in DLIS, you would remove the counts for those already satisfied clauses. So up to bullet 4, it’s an approximation of DLIS.
Nowadays, we’d say we’re putting an exponential moving average on this, which is a low pass filter on this thing. You can also see it the other way around: I’m trying to focus on recently bumped variables (an increase in score is called “bumping”). How do we do that? We do this filter on top of it, and the scores of all variables are decreased by this factor $\beta$ every time.
In ML:
- $\alpha$: smoothing (1/30 previously)
- $\beta$: decay, $\beta = 1 - \alpha$. Close to 1. In the slides, $f = \beta$
Modern ML-context explanation: you put an EMA over this bumping
First bullet: the original version of VSIDS
We increment the score of those variables in the learned clause by one every time. Whenever you get a conflict, increase it. That’s cheap because you have to get this learned clause anyhow. After 256 conflicts, we halve the score of all variables, and at the same time, he suggested to resort the queue. So all literals are on a stack, you sort them with respect to their score, and then when you do this re-scoring (decay after 256 conflicts), you resort the stack. This way, you keep the stack sorted with respect to the score, but not completely because it might change until the next 256 conflicts, and you will not pick the right variable.
The MiniSAT guys game up with Exponential VSIDS.
The original version was costly: because you have to all variables, if you have a million variables, maybe you learn some short clauses, you increment the variables in that clause by 1 (the scores of these variables), and then after 256 conflicts, you have to go through all million variables and halve the score. And you also need to sort these million variables every time.
MiniSAT had an idea: “Why don’t we turn this around?” Instead of decaying, we do the following: we increase the increment every time. So increment one doesn’t stay at 1. We geometrically increase it by the inverse of $f$. If $f$ is close to 1, e.g., 0.9, then this is close to 1.1. The first time you have a conflict, you increase all the literals by one. In the next conflict, you increase every literal by 1.1. And then 1.1 * 1.1 = 1.25. So increase the increment every time by multiplying it by 1.1
For this to work, you’d need floating points for this score. For the original version, integers are enough. If you did the original one with floating points and do a re-score every time with $f$, the order remains the same between the variables. If you sort the variables with respect to the normalized score, which is between 0 and 1, and with respect to the exponential one, you get the same order. Therefore, you can use an exponential one. The benefit is that you don’t have to go through all variables. You only have to go over the variables in the learned clause. This gives you a huge improvement.
This is just the exponential moving average (EMA) we did previously.
$s’$: the score for an individual variable
So you try to predict the next score with this EMA. If $s$ is close to 1, you’d say this variable will likely be in the learned clause next time. If it is, you increase this by $\alpha$. You have a prediction, record what happened, and go in the direction of the error, but with an $\alpha$. So take the old value and correct the old value with a smoothing factor, like 5%, in the direction of the error.
This shows the mass that if you pick like $f$ between 0 and 1 with this update, you will always stay between 0 and 1
https://www.youtube.com/watch?v=MOjhFywLre8
This is the normalized VSIDS. This is not how it’s implemented. To get this picture, Biere had to multiply it by the inverse of this increasing $f$ every time by this exponential VSIDS.
- $x$-axis: number of remaining variables
- On top, there’s a counter which counts the conflicts → time
- $y$-axis: 0-100%, like between 0 and 1
- Green line: $\alpha$
- The colors do not mean anything. They are just there so we can distinguish between the numbers.
The interesting part here is that the score is independent of the actual state of the formula. To compute this, you only take the literals in the learned clause.
It doesn’t look at the state of the current partial assignment. It looks at the history of how the search evolved. It’s also very human in a certain sense: looking at the things that didn’t work. If you try to solve something and get into an issue, you will look at the things that didn’t work before. So you forget things that didn’t work 30 minutes ago. You only work on these things that at this moment didn’t work.
This heuristic has no theoretical result, purely empirical and intuition, even after 20 years.
320 just got increased by $\alpha$
When the height of the numbers increases sufficiently, the graph jumps up.
Bug in heuristics: number went above 100%
When you do this exponential VSIDS, this one will grow exponentially and overflow roughly after 4000 conflicts if you pick 1.1. Then it will go over 10^150, and then you need to re-score everything, which is done by dividing by $f$. In Biere’s code, he forgot to divide something somewhere.
Top: original implementation. They would increment every variable which occurs in a conflict by 1. The counts go up, but that’s not good, as the authors figured out. It’s better to focus on recent variables which change. The way they did this: every 256 conflicts, you divide scores by 2. How do you pick a variable? You pick the one which occurs most often. The trick to do that: when you halve the score, you resort the variables. Because some variables went up, and then you resort them, and then they’re sorted in a descending way, and then you go forward and search for the first variable in this list, which is unassigned and pick this one as a decision.
Then there was a big insight by the 2 Nikas: the idea of turning these scores around, not incrementing by 1, but by an increasing factor which gets bigger and bigger, exponentially increasing. (also in the video): if you normalize these scores that go up, the delta gets bigger and bigger (exponentially), and you normalize it to 0 and 1, then you see the video from last time: “normalized VSIDS” as Biere called it.
You can see here what happens to one count, one variable, and $n$ is the number of conflicts.
You always have to compute an update to all normalized scores $s_n$. We need to update them all the time for all variables. That’s costly, e.g., a million variables. For every conflict, you must review and update all variables.
Niklas’ Idea: use exponential VSIDS, only update variables involved, and keep all others the same. In the video, you saw that the scores slowly go down, and that’s what you have in the normalized VSIDS. In exponential VSIDS, you keep all the scores but increase the delta exponentially. In this way, you only need to update the variables in the learned clause. These learned clauses are hopefully way smaller than these million variables, e.g., 20 or 30. The deltas go up.
$S_n$, $s_n$: these two scores are related linearly by factor $f^{-n}/(1-f)$. This factor changes with $n$, the number of conflicts, but all the normalized scores are just the same factor off from the exponential VSIDS scores. That allows us to compare variables according to normalized VSIDS by this exponential VSIDS. $S_n$ is what we keep in the solver, and we use this score to put variables into the priority queue, also an invention of the two Niklas.
Priority queue: Instead of resending the list to every 256 conflicts, they would have a priority queue. They’re usually implemented with a binary heap, the common way. A binary heap has the effect that you can push and pop things in logarithmic time, and you can get in constant time the maximum in our case. We need to get the max variable out of the binary heap. Binary heap also allows updating variables: if there’s a variable in the heap and you increase the score by adding this $f^n$ , this one would be increased, and it bubbles up in the binary heap.
A binary heap is a real array, and it’s only partially sorted. You have an invariant that an element in this heap has (in our case) a higher score than all its children. That’s what you need to maintain. If you have one variable in a binary heap that increases a score, you bubble it up: you compare it to the score of its parent, and if it’s bigger, you swap it. This needs, at most, a logarithmic number of swaps, so binary heap is the best implementation.
Two Niklas: http://minisat.se/
There’s an important trick that Niklas Sörensen invented. When you have a binary heap, you need to get the variable with the maximum score (highest priority), which needs to be unassigned. The natural way of doing this: whenever you assign a variable, you pop it from the heap, but that’s completely redundant. Why don’t we do this lazily? Whenever we have to decide, we go to the binary heap, take the maximum variable, and if that’s unassigned, you have a decision. If it’s not, then you pop it. The one with the next highest score bubbles up to the top, and you check that one. If that’s unassigned, we’ve found our decision. Otherwise, we pop. The point is that you lazily remove variables on the binary heap, which greatly improves.
EMA explained last time, so it is omitted here.
Goldberg: came up with BerkMin. Name: Berkeley and Minsk. https://www.sciencedirect.com/science/article/pii/S0166218X06004616
VSIDS was patented in 2000 by Princeton, but not exponential VSIDS. The patent is almost gone now; after 25 years, it vanishes. One company bought it for 1 million dollars. It is questionable whether that was a good decision, though. There are also ways around VSIDS.
Their idea: Why don’t we focus on recent clauses instead of focusing on recent variables? We learn these clauses anyhow. We have them in memory on a stack; maybe, why don’t we go through this stack in reverse order and try to find good decisions? It worked. However, it’s not as good in Biere’s experiments, so focusing on variables is better.
Variable move to front: implemented in Siege SAT. The source code is not available because it was sold.
HaifaSAT: combines variable move to the front with BerkMin. What they do: when you have learned clauses, which variable do you bump? VSIDS: only literals in the learned clause. Conflict analysis: resolve away literal variables on the highest decision levels. Then those variables would not be bumped, and their score is not increased. That’s not good because they might be as important as those which do end up in the clause. During this resolution, you will also resolve away these clauses. Remember the reasons and the colorful illustrations in slides. Those reasons are also resolved. They had an idea: why don’t we organize the BerkMin clauses, which are at the stack of learned clauses, in such a way if you resolve away this clause, take out the clause, and put it at the end of the queue? Otherwise, they’d do the same as BerkMin does: they would start at the end of the queue and just pick the first clause, which is not assigned yet. Clauses on queue cannot be unsat because you’d have a conflict. So the only possibility is that these clauses are either already or not yet satisfied because two literals are unassigned in that clause, and the rest is false. And then you would pick just one of the two in there.
Have this queue, but now the queue consists of variables (variable index on top). The value is stored somewhere else (assignment, just for picture). Every queue element has a timestamp (will become important). You maintain the invariant that the time stamp (Biere’s idea) is increasing. Ryan didn’t have this idea. Otherwise, it’s the same idea as explained before.
You have this queue, but now variables. If a variable is resolved away or occurs in the learned clause, you take it out of the queue and put it at the end. That’s why we have a doubly linked list, and you need to do this in constant time. So bumping now means taking out variables of the queue and putting them at the end.
This is way faster than doing binary heap because binary heap taking out variable or increasing score is $\log n$. In practice, this is quite a bit. Think of million variables, then not just 10 or 20 bubble operations. Have one million variables in big memory. If you take one variable and compare it to the parent that’s far apart, it will be randomly apart in two different cache lines. Then you need to swap the two. Ultimately, it gets smaller, but you’ll have at least 2 or 3 bad memory accesses. This bubbling up in a binary heap with a priority queue for million variables is slow. So this here is faster than binary heap.
In particular, if you think about variables that are used recently are probably in the cache anyhow. Just take them out and reconnect those links. For instance: if you take out 4 and put it to the end, you take two links here, and it’s like two local and two non-local pointer operations. Bumping means taking variables out and moving to the front.
MTF: very generic algorithm in computer science. There’s a part in CS where you consider online algorithms: you don’t just get input and run it once, but where input comes piece-wise. Then this idea of moving to the front gives you very good accumulated runtimes. Example: looking up in a container the largest string. You’re adding and removing strings. You could make a linear list, but then you have accumulated quadratic runtime. A way to get out of that is: whenever you look up an element in the list and move it to the front. Then with the idea that things that are looked up frequently, the next time you look it up, it’ll be very fast. This is the idea of moving to the front. It has a very good accumulated runtime, and we will use this as motivation: variables at the front are more important than variables at the tail of the queue.
“Next search” is needed because you don’t want to always start at the end of the queue and search for the unassigned next variable. Why? Just think of a formula with 10000 variables, and you have a queue, then go to the end of the queue and start searching. The first variable is not assigned, so you assign it. Then maybe because the formula is easy to satisfy, take the second, third, etc., until 10000. The runtime of this thing? You will have a hundred million traversals because you always start at the end. If you search this thing over and over again, then you should cache the search position, which is “next search.” Whenever you find 2, you put “next search” at 2, then when you search next time, you don’t start at the end, but start at the next search, and so on. In this way, you linearly go over this list, and that’s completely fine, so you have not accumulated the worst case.
This was already in the original Chaff code. They resorted the variables every 256 conflicts, then kept the position up to which variable they searched before.
However, it has one flaw, this argument, and it’s in this scenario on the slide. What if we had searched before until 3, and this was the decision, so we set it to 1? And now, we need to backtrack because there’s a learned clause, and we flip the value of 3. And maybe 3, 4, and 7 remain assigned. But now, 9 has become unassigned. Then next search is not valid anymore. Because it should have an invariant that right to the next search, everything should be unassigned. Then the original Chaff code did that whenever you backtrack in a SAT solver. You just reset the next search to the end of the queue. It’s a simple solution, but not perfect.
No idea presented in this paper: we have a timestamp here, so whenever we enqueue something to the end of the queue, whenever we move something, we get a new timestamp. This is a 64-bit counter, so we never run over it, and it makes sure that all the elements of the queue are sorted with respect to this timestamp. Whenever you put something here to the end, you increase the timestamp (timestamp is global). Now what we’re doing: when we need to unassign 9, we’ll check its timestamp (12 in this case in constant time). Then we look up the timestamp of the variable to which the current next search point is 8, and if this guy is larger than that one (12 larger than 8), we need to move the next search pointer because the next time, we have to restart from 12. In this way, we don’t need to start at the end again, but we can be lazy and start from 12.
Down below, bumping is shown. Variable 4 is bumped because it’s resolved away or it’s in a learned clause, then we’ll take it out here (between 3 and 9, which are kept), put four at the end, and then we have to increase the time stamp
This thing is way easier to implement than a binary heap.
This is a flipped cactus plot, with the $y$ and $x$ axes reversed. In this plot, a higher position on the $x$-axis indicates better performance.
- $x$-axis: Number of solved instances.
- $y$-axis: Time.
The plot’s time limit is set to 1000 seconds, but now they are using 5000 seconds, which is almost equivalent to lecture time.
These solvers differ only in how they handle the next variable. Evisids, avg, and sc13 perform similarly. Exponential VSIDS is the default approach. The main point is that evisids and avg perform similarly while the others are trailing behind. The most challenging aspect is the static variable ordering.
The green line represents a 64-bit counter for everyone, with an unsmoothed increase. The difference on the right side is a low-pass filter applied to the top performers.
VSIDS and vmtf fall somewhere in the middle. It is preferable to have a smoothed version.
Vmtf needs refinement. When variables are moved to the front, the important ones should be prioritized and placed at the end. They should not be shuffled; instead, they should retain their original order.
Changsek Oh’s work proposes switching between vmtf and evisids/avg. Vmtf is highly effective in finding short proofs and requires frequent restarts, while evisids/avg benefit more from smoothing. In the case of evisids/avg, even a very low smoothing factor, such as $\alpha$ set to 0.01 (1%), can significantly improve performance. This would result in $e$ being 1.1 or $f$ being 0.99, thereby increasing the low moving score. Consequently, restarts are rarely performed in this case. Changsek Oh discovered that vmtf is superior for unsatisfiable instances with frequent restarts, while evisids/avg excel for satisfiable instances with fewer restarts. Therefore, he incorporated two phases into the solvers, each with its priority queue. In one phase, vmtf (very fast-moving EMA) is used, and in the stable phase (as we now call it), a low-moving average is employed. In Biere’s new solvers, rather than using vsids twice with different alphas, vmtf is used in the mode requiring frequent restarts, while evisids is used for other modes. Currently, this is considered the most effective approach.
The picture illustrates that vmtf alone is insufficient; its effectiveness varies depending on the benchmark. It is best to switch between modes dynamically during runtime.
How are these strategies described? We will focus on one variable at a time and observe how its score changes. Then, we consistently select the variable with the highest score based on this criterion.
- Static: The score remains unchanged.
- Increment: Only the variables that are bumped are incremented.
In this context, $\beta_i$ represents the important and relevant variables bumped and assigned a score of 1, while the other variables receive a score of 0. Previously referred to as $\delta$, it is now denoted as $\beta$ to represent the concept of bumping.
Why is $s_{i+1} = i$ the same as a vmtf (virtual move-to-front)? This similarity relates to the concept of timestamps. All variables involved are placed at the end of the queue in the $i$-th conflict (where ii is the conflict number). Consequently, they are assigned a score of $i$. The first conflict removes the variables from the front and moves them to the end.
The sum strategy is quite effective, surpassing the increment strategy. It incorporates a slight degree of smoothing. How is this achieved? The bumps are modified to consider the conflict index (i.e., the number of conflicts encountered). Subsequent conflicts receive greater bumps compared to earlier conflicts.
VSIDS (Variable State Independent Decaying Sum) involves a decay process similar to normalized VSIDS. All variables experience decay, while bumped variables receive an increase in score.
Additionally, there is the exponential VSIDS approach, where the bump is increased exponentially using the formula $e = 1/f$ ($f$ being a factor like 0.9).
In previous experiments, the average strategy functions similarly. It calculates the distance between the current score and the conflict index, then adds this difference to the current score. The resulting value is an average of the two. This method also includes a smoothing factor.
Check Your Knowledge
- What is Variable State Independent Decaying Sum (VSIDS) and how does it work in a SAT solver?
- What is the importance of variable selection in SAT solving and how does it affect the overall performance of a solver?
- What is the role of timestamps in the revised Chaff algorithm presented in the lecture?
- Explain the concept of “bumping” in the context of variable selection. Why is it crucial?
- How does the implementation of a binary heap compare to the revised Chaff algorithm?
- What is the idea of exponential VSIDS (evisids) and how does it differ from traditional VSIDS?
- Explain the concept of vmtf and how it improves the performance of the solver.
- Discuss the rationale behind switching between vmtf and evisids/avg, as suggested by Changsek Oh. In what scenario does each method excel?
- Discuss the proposed strategies for handling variable scores. How do they work and what are their impacts?