Llama 3.1: Same model, different results. The impact of a percentage point.
July 31, 2024
・
By
Together AI
Llama 3.1 has taken the world by storm, as the first open model to rival the best models available today. One of the advantages of open models is that anyone can host them. And, at first glance, given the same model, all providers appear to provide an equivalent service. Deciphering any differences, and properly benchmarking models as powerful as Llama 3.1 is challenging. However, once you do, you find that inference services are not fungible. Differences in implementation decisions, optimizations, and quality testing processes can mean the difference of a percentage point or more on a benchmark, and a success or a failure for your application task.
Recently there has been considerable discussion on Twitter regarding these differences.
Wanna know whether different LLM providers serve the same LLama 3.1 70B? I sure did!
So I ran a quick eval to get some surprising results + open sourced my code 👇
A quick independent evaluation of Llama-3.1-405B-Instruct-Turbo (on @togethercompute) ⬇️ 1️⃣ It ranks 1st on GSM8K! 2️⃣ Its logical reasoning ability on ZebraLogic is quite similar to Sonnet 3.5, and much better than the others. (note that ZebraLogic is a very new dataset). 3️⃣ It… pic.twitter.com/Kgie2IzGb6
In this post we first look at some of these differences and what they mean to real-world applications. Then, we share details of our quality testing approach we use to try to ensure models are served with their full reference quality.
The impact of a percentage point
As shown above, when tested in challenging benchmarks like AlpacaEval 2.0 quality metrics can vary significantly across providers for the same model. Foundation model creators expend tremendous effort to seek out a couple percentage points on benchmarks like these. But what is the impact of one or two percentage points? Let's look at some examples. For reproducibility purposes, for each of these examples we used temperature=0 and the same config on both services.
Example 1: High school programming test
“Write code for a red black tree class, insert 4 42 45 6 54 66 67 and display the tree with graphviz, also add functions that verify code correctness, and run these functions to illustrate the code is correct.”
Below is an implementation of a Red-Black Tree class in Python, along with functions to insert nodes, display the tree using Graphviz, and verify the correctness of the tree. This implementation includes the insertion of the specified values (4, 42, 45, 6, 54, 66, 67) and the display of the resulting tree.
First, ensure you have Graphviz installed in your environment. You can install it via pip if you haven't already:
bash
pip install graphviz
Here's the Python code for the Red-Black Tree:
```python
import graphviz
class Node:
def __init__(self, data, color, left=None, right=None, parent=None):
self.data = data
self.color = color
self.left = left
self.right = right
self.parent = parent
class RedBlackTree:
def __init__(self):
self.NIL = Node(None, 'black')
self.root = self.NIL
def pre_order(self, node):
if node != self.NIL:
print(f"({node.data}, {node.color})")
self.pre_order(node.left)
self.pre_order(node.right)
def in_order(self, node):
if node != self.NIL:
self.in_order(node.left)
print(f"({node.data}, {node.color})")
self.in_order(node.right)
def post_order(self, node):
if node != self.NIL:
self.post_order(node.left)
self.post_order(node.right)
print(f"({node.data}, {node.color})")
def left_rotate(self, x):
y = x.right
x.right = y.left
if y.left != self.NIL:
y.left.parent = x
y.parent = x.parent
if x.parent == None:
self.root = y
elif x == x.parent.left:
x.parent.left = y
else:
x.parent.right = y
y.left = x
x.parent = y
def right_rotate(self, x):
y = x.left
x.left = y.right
if y.right != self.NIL:
y.right.parent = x
y.parent = x.parent
if x.parent == None:
self.root = y
elif x == x.parent.right:
x.parent.right = y
else:
x.parent.left = y
y.right = x
x.parent = y
def insert_fix(self, k):
while k.parent.color == 'red':
if k.parent == k.parent.parent.right:
u = k.parent.parent.left
if u.color == 'red':
u.color = 'black'
k.parent.color = 'black'
k.parent.parent.color = 'red'
k = k.parent.parent
else:
if k == k.parent.left:
k = k.parent
self.right_rotate(k)
k.parent.color = 'black'
k.parent.parent.color = 'red'
self.left_rotate(k.parent.parent)
else:
u = k.parent.parent.right
if u.color == 'red':
u.color = 'black'
k.parent.color = 'black'
k.parent.parent.color = 'red'
k = k.parent.parent
else:
if k == k.parent.right:
k = k.parent
self.left_rotate(k)
k.parent.color = 'black'
k.parent.parent.color = 'red'
self.right_rotate(k.parent.parent)
if k == self.root:
break
self.root.color = 'black'
def insert_node(self, data):
node = Node(data, 'red')
node.parent = None
node.left = self.NIL
node.right = self.NIL
y = None
x = self.root
while x != self.NIL:
y = x
if node.data < x.data:
x = x.left
else:
x = x.right
node.parent = y
if y == None:
self.root = node
elif node.data < y.data:
y.left = node
else:
y.right = node
if node.parent == None:
node.color = 'black'
return
if node.parent.parent == None:
return
self.insert_fix(node)
def visualize(self, dot):
if self.root != self.NIL:
self._visualize(dot, self.root, None)
def _visualize(self, dot, node, parent):
dot.node(str(node.data), f"{node.data}, {node.color}")
if parent is not None:
dot.edge(str(parent), str(node.data))
if node.left != self.NIL:
self._visualize(dot, node.left, node.data)
if node.right != self.NIL:
self._visualize(dot, node.right, node.data)
def verify_property_1(self, node):
if node != self.NIL:
assert node.color in ['red', 'black']
self.verify_property_1(node.left)
self.verify_property_1(node.right)
def verify_property_2(self):
assert self.root.color == 'black'
def verify_property_3(self, node):
if node != self.NIL:
if node.color == 'red':
assert node.left.color == 'black'
assert node.right.color == 'black'
self.verify_property_3(node.left)
self.verify_property_3(node.right)
def verify_property_4(self, node, path):
if node == self.NIL:
return path
left_path = self.verify_property_4(node.left, path + [node.color])
right_path = self.verify_property_4(node.right, path + [node.color])
assert left_path.count('black') == right_path.count('black')
return left_path
def main():
rbt = RedBlackTree()
values = [4, 42, 45, 6, 54, 66, 67]
for value in values:
rbt.insert_node(value)
print("Pre-order Traversal:")
rbt.pre_order(rbt.root)
print("\nIn-order Traversal:")
rbt.in_order(rbt.root)
print("\nPost-order Traversal:")
rbt.post_order(rbt.root)
dot = graphviz.Digraph(comment='Red-Black Tree')
rbt.visualize(dot)
dot.render('rbt', view=True)
rbt.verify_property_1(rbt.root)
rbt.verify_property_2()
rbt.verify_property_3(rbt.root)
rbt.verify_property_4(rbt.root, [])
if __name__ == "__main__":
main()
```
This code defines a Red-Black Tree class with methods for insertion, traversal, visualization, and verification of the tree's properties. The `main` function demonstrates how to use these methods by inserting a series of values into the tree, performing traversals, visualizing the tree, and verifying that the tree satisfies the properties of a Red-Black Tree.
Terminal Output:
“Pre-order Traversal:
(42, black)
(4, black)
(6, red)
(54, red)
(45, black)
(66, black)
(67, red)
In-order Traversal:
(4, black)
(6, red)
(42, black)
(45, black)
(54, red)
(66, black)
(67, red)
Post-order Traversal:
(6, red)
(4, black)
(45, black)
(67, red)
(66, black)
(54, red)
(42, black)
Provider B Llama 3.1 405B:
Running code fails to generate any image and verification fails with error “Property 3 (All leaves are black) is not satisfied.”
Full output (click to expand):
import subprocess
class Node:
def __init__(self, data, color, left=None, right=None, parent=None):
self.data = data
self.color = color
self.left = left
self.right = right
self.parent = parent
class RedBlackTree:
def __init__(self):
self.NIL = Node(None, 'black')
self.root = self.NIL
def pre_order_helper(self, node):
if node != self.NIL:
print(f"{node.data} {node.color}")
self.pre_order_helper(node.left)
self.pre_order_helper(node.right)
def in_order_helper(self, node):
if node != self.NIL:
self.in_order_helper(node.left)
print(f"{node.data} {node.color}")
self.in_order_helper(node.right)
def post_order_helper(self, node):
if node != self.NIL:
self.post_order_helper(node.left)
self.post_order_helper(node.right)
print(f"{node.data} {node.color}")
def preorder(self):
self.pre_order_helper(self.root)
def inorder(self):
self.in_order_helper(self.root)
def postorder(self):
self.post_order_helper(self.root)
def left_rotate(self, x):
y = x.right
x.right = y.left
if y.left != self.NIL:
y.left.parent = x
y.parent = x.parent
if x.parent == None:
self.root = y
elif x == x.parent.left:
x.parent.left = y
else:
x.parent.right = y
y.left = x
x.parent = y
def right_rotate(self, x):
y = x.left
x.left = y.right
if y.right != self.NIL:
y.right.parent = x
y.parent = x.parent
if x.parent == None:
self.root = y
elif x == x.parent.right:
x.parent.right = y
else:
x.parent.left = y
y.right = x
x.parent = y
def insert_node(self, data):
node = Node(data, 'red')
node.parent = None
node.left = self.NIL
node.right = self.NIL
y = None
x = self.root
while x != self.NIL:
y = x
if node.data < x.data:
x = x.left
else:
x = x.right
node.parent = y
if y == None:
self.root = node
elif node.data < y.data:
y.left = node
else:
y.right = node
if node.parent == None:
node.color = 'black'
return
if node.parent.parent == None:
return
self.fix_insert(node)
def fix_insert(self, k):
while k.parent.color == 'red':
if k.parent == k.parent.parent.right:
u = k.parent.parent.left
if u.color == 'red':
u.color = 'black'
k.parent.color = 'black'
k.parent.parent.color = 'red'
k = k.parent.parent
else:
if k == k.parent.left:
k = k.parent
self.right_rotate(k)
k.parent.color = 'black'
k.parent.parent.color = 'red'
self.left_rotate(k.parent.parent)
else:
u = k.parent.parent.right
if u.color == 'red':
u.color = 'black'
k.parent.color = 'black'
k.parent.parent.color = 'red'
k = k.parent.parent
else:
if k == k.parent.right:
k = k.parent
self.left_rotate(k)
k.parent.color = 'black'
k.parent.parent.color = 'red'
self.right_rotate(k.parent.parent)
if k == self.root:
break
self.root.color = 'black'
def generate_dot(self):
dot = "digraph RedBlackTree {\n"
dot += "node[shape=box, style=filled];\n"
dot += self._generate_dot(self.root)
dot += "}\n"
return dot
def _generate_dot(self, node):
if node == self.NIL:
return ""
dot = f"{node.data}[label=\"{node.data}\\n{node.color}\", fillcolor={node.color}];\n"
if node.left != self.NIL:
dot += f"{node.data} -> {node.left.data};\n"
if node.right != self.NIL:
dot += f"{node.data} -> {node.right.data};\n"
dot += self._generate_dot(node.left)
dot += self._generate_dot(node.right)
return dot
def verify_rb_properties(self):
if not self._verify_property_1(self.root):
print("Property 1 (Each node is either red or black) is not satisfied.")
return False
if not self._verify_property_2(self.root):
print("Property 2 (The root is black) is not satisfied.")
return False
if not self._verify_property_3(self.root):
print("Property 3 (All leaves are black) is not satisfied.")
return False
if not self._verify_property_4(self.root):
print("Property 4 (If a node is red, then both its children are black) is not satisfied.")
return False
if not self._verify_property_5(self.root):
print("Property 5 (For any node, all paths from the node to its descendant leaves contain the same number of black nodes) is not satisfied.")
return False
print("All properties of the Red-Black Tree are satisfied.")
return True
def _verify_property_1(self, node):
if node == self.NIL:
return True
if node.color not in ['red', 'black']:
return False
return self._verify_property_1(node.left) and self._verify_property_1(node.right)
def _verify_property_2(self, node):
if node == self.NIL:
return True
if node.parent == None and node.color != 'black':
return False
return True
def _verify_property_3(self, node):
if node == self.NIL:
return True
if node.left == self.NIL and node.right == self.NIL and node.color != 'black':
return False
return self._verify_property_3(node.left) and self._verify_property_3(node.right)
def _verify_property_4(self, node):
if node == self.NIL:
return True
if node.color == 'red' and (node.left.color == 'red' or node.right.color == 'red'):
return False
return self._verify_property_4(node.left) and self._verify_property_4(node.right)
def _verify_property_5(self, node):
if node == self.NIL:
return True
left_black_height = self._count_black_nodes(node.left)
right_black_height = self._count_black_nodes(node.right)
if left_black_height != right_black_height:
return False
return self._verify_property_5(node.left) and self._verify_property_5(node.right)
def _count_black_nodes(self, node):
if node == self.NIL:
return 0
count = 1 if node.color == 'black' else 0
return count + self._count_black_nodes(node.left)
# Example usage
rbt = RedBlackTree()
rbt.insert_node(4)
rbt.insert_node(42)
rbt.insert_node(45)
rbt.insert_node(6)
rbt.insert_node(54)
rbt.insert_node(66)
rbt.insert_node(67)
dot = rbt.generate_dot()
with open("rbt.dot", "w") as f:
f.write(dot)
subprocess.run(["dot", "-Tpng", "rbt.dot", "-o", "rbt.png"])
rbt.verify_rb_properties()
Example 2: High school math problem
“A positive integer x is equal to twice the square of a certain positive integer y, and it is also equal to three times the cube of another positive integer z. What is the smallest positive integer x that satisfies these conditions”
Together Turbo Llama 3.1 405B:
The final answer is: $\boxed{648}$ (correct!)
Full output (click to expand):
Step 1: Express x in terms of y and z
We are given that x is equal to twice the square of a certain positive integer y, which can be expressed as x = 2y2. We are also given that x is equal to three times the cube of another positive integer z, which can be expressed as x = 3z3.
Step 2: Set the two expressions for x equal to each other
Since both expressions are equal to x, we can set them equal to each other: 2y2 = 3z3.
Step 3: Find the smallest positive integer values of y and z that satisfy the equation
To find the smallest positive integer x, we need to find the smallest positive integer values of y and z that satisfy the equation 2y2 = 3z3. We can start by trying small values of y and z.
Step 4: Try y = 1, 2, 3, ... and z = 1, 2, 3, ...
We can try different values of y and z to find the smallest combination that satisfies the equation. Starting with y = 1 and z = 1, we get 2(1)2 = 2 and 3(1)3 = 3, which are not equal. Trying y = 2 and z = 1, we get 2(2)2 = 8 and 3(1)3 = 3, which are not equal. Trying y = 3 and z = 2, we get 2(3)2 = 18 and 3(2)3 = 24, which are not equal. Trying y = 4 and z = 2, we get 2(4)2 = 32 and 3(2)3 = 24, which are not equal. Trying y = 5 and z = 3, we get 2(5)2 = 50 and 3(3)3 = 81, which are not equal. Trying y = 6 and z = 3, we get 2(6)2 = 72 and 3(3)3 = 81, which are not equal. Trying y = 7 and z = 3, we get 2(7)2 = 98 and 3(3)3 = 81, which are not equal. Trying y = 8 and z = 4, we get 2(8)2 = 128 and 3(4)3 = 192, which are not equal. Trying y = 9 and z = 4, we get 2(9)2 = 162 and 3(4)3 = 192, which are not equal. Trying y = 10 and z = 4, we get 2(10)2 = 200 and 3(4)3 = 192, which are not equal. Trying y = 11 and z = 5, we get 2(11)2 = 242 and 3(5)3 = 375, which are not equal. Trying y = 12 and z = 5, we get 2(12)2 = 288 and 3(5)3 = 375, which are not equal. Trying y = 13 and z = 5, we get 2(13)2 = 338 and 3(5)3 = 375, which are not equal. Trying y = 14 and z = 5, we get 2(14)2 = 392 and 3(5)3 = 375, which are not equal. Trying y = 15 and z = 6, we get 2(15)2 = 450 and 3(6)3 = 648, which are not equal. Trying y = 16 and z = 6, we get 2(16)2 = 512 and 3(6)3 = 648, which are not equal. Trying y = 17 and z = 6, we get 2(17)2 = 578 and 3(6)3 = 648, which are not equal. Trying y = 18 and z = 6, we get 2(18)2 = 648 and 3(6)3 = 648, which are equal.
Step 5: Calculate x using the values of y and z
Now that we have found the values of y and z that satisfy the equation, we can calculate x using either expression: x = 2y2 or x = 3z3. Using y = 18, we get x = 2(18)2 = 648. Using z = 6, we get x = 3(6)3 = 648. The final answer is: $\boxed{648}$
Provider B Llama 3.1 405B:
Fails to generate correct answer, stuck in infinite loop attempting to brute force options.
Full output (click to expand):
## Step 1: Express x in terms of y and z
We are given that x is equal to twice the square of a certain positive integer y, which can be expressed as x = 2y^2. We are also given that x is equal to three times the cube of another positive integer z, which can be expressed as x = 3z^3.
## Step 2: Set the two expressions for x equal to each other
Since both expressions are equal to x, we can set them equal to each other: 2y^2 = 3z^3.
## Step 3: Find the smallest positive integer values of y and z that satisfy the equation
To find the smallest positive integer x, we need to find the smallest positive integer values of y and z that satisfy the equation 2y^2 = 3z^3. We can start by trying the smallest possible values for y and z.
## Step 4: Try y = 1 and find the corresponding value of z
If y = 1, then 2y^2 = 2(1)^2 = 2. However, there is no positive integer z that satisfies 3z^3 = 2.
## Step 5: Try y = 2 and find the corresponding value of z
If y = 2, then 2y^2 = 2(2)^2 = 8. However, there is no positive integer z that satisfies 3z^3 = 8.
## Step 6: Try y = 3 and find the corresponding value of z
If y = 3, then 2y^2 = 2(3)^2 = 18. However, there is no positive integer z that satisfies 3z^3 = 18.
## Step 7: Try y = 4 and find the corresponding value of z
If y = 4, then 2y^2 = 2(4)^2 = 32. However, there is no positive integer z that satisfies 3z^3 = 32.
## Step 8: Try y = 5 and find the corresponding value of z
If y = 5, then 2y^2 = 2(5)^2 = 50. However, there is no positive integer z that satisfies 3z^3 = 50.
## Step 9: Try y = 6 and find the corresponding value of z
If y = 6, then 2y^2 = 2(6)^2 = 72. Since 72 = 3(2)^3(3), we can see that z = 2(3)^(1/3) is not an integer.
## Step 10: Try y = 7 and find the corresponding value of z
If y = 7, then 2y^2 = 2(7)^2 = 98. However, there is no positive integer z that satisfies 3z^3 = 98.
## Step 11: Try y = 8 and find the corresponding value of z
If y = 8, then 2y^2 = 2(8)^2 = 128. However, there is no positive integer z that satisfies 3z^3 = 128.
## Step 12: Try y = 9 and find the corresponding value of z
If y = 9, then 2y^2 = 2(9)^2 = 162. Since 162 = 3(3)^3(2), we can see that z = 3(2)^(1/3) is not an integer.
## Step 13: Try y = 10 and find the corresponding value of z
If y = 10, then 2y^2 = 2(10)^2 = 200. However, there is no positive integer z that satisfies 3z^3 = 200.
## Step 14: Try y = 11 and find the corresponding value of z
If y = 11, then 2y^2 = 2(11)^2 = 242. However, there is no positive integer z that satisfies 3z^3 = 242.
## Step 15: Try y = 12 and find the corresponding value of z
If y = 12, then 2y^2 = 2(12)^2 = 288. Since 288 = 3(2)^3(4)^2, we can see that z = 2(4)^(2/3) is not an integer.
## Step 16: Try y = 13 and find the corresponding value of z
If y = 13, then 2y^2 = 2(13)^2 = 338. However, there is no positive integer z that satisfies 3z^3 = 338.
## Step 17: Try y = 14 and find the corresponding value of z
If y = 14, then 2y^2 = 2(14)^2 = 392. However, there is no positive integer z that satisfies 3z^3 = 392.
## Step 18: Try y = 15 and find the corresponding value of z
If y = 15, then 2y^2 = 2(15)^2 = 450. However, there is no positive integer z that satisfies 3z^3 = 450.
## Step 19: Try y = 16 and find the corresponding value of z
If y = 16, then 2y^2 = 2(16)^2 = 512. However, there is no positive integer z that satisfies 3z^3 = 512.
## Step 20: Try y = 17 and find the corresponding value of z
If y = 17, then 2y^2 = 2(17)^2 = 578. However, there is no positive integer z that satisfies 3z^3 = 578.
## Step 21: Try y = 18 and find the corresponding value of z
If y = 18, then 2y^2 = 2(18)^2 = 648. Since 648 = 3(3)^3(8), we can see that z = 3(8)^(1/3) is not an integer.
## Step 22: Try y = 19 and find the corresponding value of z
If y = 19, then 2y^2 = 2(19)^2 = 722. However, there is no positive integer z that satisfies 3z^3 = 722.
## Step 23: Try y = 20 and find the corresponding value of z
If y = 20, then 2y^2 = 2(20)^2 = 800. However, there is no positive integer z that satisfies 3z^3 = 800.
## Step 24: Try y = 21 and find the corresponding value of z
If y = 21, then 2y^2 = 2(21)^2 = 882. Since 882 = 3(7)^2(6), we can see that z = 7(6)^(1/3) is not an integer.
## Step 25: Try y = 22 and find the corresponding value of z
If y = 22, then 2y^2 = 2(22)^2 = 968. However, there is no positive integer z that satisfies 3z^3 = 968.
## Step 26: Try y = 23 and find the corresponding value of z
If y = 23, then 2y^2 = 2(23)^2 = 1058. However, there is no positive integer z that satisfies 3z^3 = 1058.
...... (infinite looping)
Example 3: AI Researcher knowledge question (we couldn’t resist)
“What is group query attention?”
Together Turbo Llama 3.1 405B:
Factually correct, and detailed answer:
Full output (click to expand):
Group Query Attention is a technique used in the field of deep learning, particularly in the context of attention mechanisms within neural networks. Attention mechanisms are a key component of many modern neural network architectures, especially in natural language processing (NLP) and computer vision tasks. They allow the model to focus on the most relevant parts of the input data when generating outputs.
In traditional attention mechanisms, the model computes attention weights by comparing a query (usually a representation of the current output being generated) against a set of keys (representations of the input elements). The attention weights reflect the relevance of each input element to the current output element being generated. The model then uses these weights to compute a weighted sum of the values (input elements), which is used to generate the output.
Group Query Attention extends this concept by grouping the queries together before computing the attention weights. Instead of computing attention weights for each query individually, queries are grouped into sets, and attention weights are computed for each group as a whole. This approach can offer several advantages:
Efficiency: By reducing the number of attention computations, Group Query Attention can be more efficient than traditional attention mechanisms, especially when dealing with a large number of queries.
Capturing Group-Level Dependencies: It allows the model to capture dependencies and relationships at a group level, which can be particularly useful in tasks where the group structure is meaningful. For example, in NLP, grouping words into sentences or paragraphs can help the model understand the context better.
Reducing Overfitting: By aggregating queries, the model might be less prone to overfitting to individual queries, potentially leading to better generalization performance.
Group Query Attention can be applied in various domains, including but not limited to NLP, where it can be used to improve the efficiency and effectiveness of models in tasks such as machine translation, question answering, and text summarization. It's also applicable in computer vision for tasks like image captioning and visual question answering, where focusing on groups of features rather than individual ones can lead to better understanding and generation capabilities.
The implementation details of Group Query Attention can vary depending on the specific application and the structure of the data. Researchers and practitioners might experiment with different grouping strategies, attention functions, and integration methods to adapt Group Query Attention to their specific use cases.
Provider B Llama 3.1 405B:
Generally close, but including factually incorrect statements (marked in red).
Full output (click to expand):
Group query attention is a technique used in deep learning models, particularly in the field of natural language processing (NLP) and computer vision. It's an extension of the self-attention mechanism, which allows the model to focus on specific parts of the input data when generating output.
Self-Attention
Self-attention, also known as intra-attention, is a mechanism that allows a model to attend to different parts of the input sequence simultaneously and weigh their importance. This is useful for tasks like machine translation, question answering, and text summarization.
Group Query Attention
Group query attention is an extension of self-attention that allows the model to attend to groups of input elements (e.g., words or tokens) rather than individual elements. The key idea is to divide the input sequence into groups, called "query groups," and compute attention weights for each group separately.
Here's a high-level overview of the group query attention process:
Grouping: Divide the input sequence into query groups, which can be done using various methods, such as:
Tokenization: group adjacent tokens together.
Part-of-speech tagging: group tokens with the same part-of-speech tag.
Named entity recognition: group tokens that belong to the same named entity.
Query representation: Compute a representation for each query group, typically by concatenating or averaging the embeddings of the group members.
Key representation: Compute a representation for each input element (e.g., token) in the input sequence.
Attention computation: Compute attention weights between each query group representation and each key representation.
Weighted sum: Compute a weighted sum of the key representations based on the attention weights to obtain the final output.
Benefits
Group query attention offers several benefits:
Improved efficiency: By attending to groups of input elements, the model can reduce the number of attention computations, leading to faster inference times.
Better handling of long-range dependencies: Group query attention can capture long-range dependencies between input elements more effectively than traditional self-attention mechanisms.
Enhanced interpretability: By attending to groups of input elements, the model can provide more interpretable results, as the attention weights can be visualized and understood in the context of the input groups.
Applications
Group query attention has been applied to various NLP tasks, including:
Machine translation
Question answering
Text summarization
Sentiment analysis
It has also been used in computer vision tasks, such as:
Image captioning
Object detection
I hope this explanation helps! Do you have any specific questions or would you like me to elaborate on any of these points?
Our quality testing methodology
When hosting models for inference and optimizing for performance and scale, it can be difficult to determine if the model is providing the correct results. Unlike receiving a compilation error when something is off when coding, differences in output of these large models may be subtle, and unfortunately may not be significant enough to impact results of all benchmarks.
Therefore, we have developed a five step approach to evaluating if our model serving is providing the true reference quality:
Reference matching
Perplexity
Analytic capability testing
Generative capability testing
Qualitative testing
Reference matching. The first pillar is to match the distribution exactly with the reference implementation (often those uploaded to official HuggingFace repositories, both for model weights and reference inference code) and our implementation. Since each operator has its own optimized kernel in our inference engine, it is important to make sure we catch any mismatch of semantics and accuracy gaps between our implementation and the reference implementation, to reduce negative error propagation to the end-to-end impact. Even though getting bitwise exact match of the distribution is hard because of factors such as difficult floating point reproducibility through reordering or even different tensor-parallel degrees, we make sure that the distributions produced by our implementation are within a close radius compared with the original reference model, for the all models that are released under the Together Reference tier.
Perplexity. The second pillar is to test the perplexity, which is often the direct objective that we optimize for during model approximation (e.g., during the calibration process of model quantization). Perplexity measures the ability of the model predicting the next token over a given corpus and any significant quantitative gap (e.g., 0.05) on perplexity often indicates qualitative difference of the model behavior.
Analytic capability testing. The third pillar is an ever growing collection of downstream benchmark tasks. The goal here is to directly measure the capability of the model in an automatic way to test its capacity on knowledge (closed book QA), in-context learning, long context retrieval, hallucination, etc. We run HELM classic on almost all of our models before their release and compare these numbers with those from the official full-precision implementations. Furthermore, Together’s Eval team continues to build up a large collection of benchmarks through the learning from the feedback of our customers on what matters to their production workloads. Figures above show the downstream task accuracy from our Together Turbo tier (FP8) offering compared to Meta’s reference implementations (F8 mixed precision for 405B and BF16 for 8B and 70B). For 8B and 70B models, our Turbo offering closely tracks the BF16 model quality (i.e., +0.1% on average for Llama-3.1-70B and -0.3% for Llama-3.1-8B). For the 405B model, Together Turbo achieves almost identical accuracy as the reference model.
Generative capability testing: One limitation of the downstream benchmark tasks is that they rarely measure the quality of long generations and the style of outputs, which are also important for many of our customers. The fourth pillar of quality testing relies on benchmarks such as AlpacaEval and Arena-Hard-Auto, which use powerful language models such as GPT-4 as judges. This is a really important category of benchmarks that are crucial for evaluating services for real-world production workloads.
Qualitative testing: The fifth pillar of quality testing is qualitative. We maintain a collection of test cases (and continue to grow them with our experience and customer feedback) and compare the model behavior with the reference model in a manual, qualitative way. For example, we found that the prompt “What is flash attention in LLMs” is quite powerful in testing the knowledge capacity of the model – if the model behaves significantly differently on these prompts compared with the reference implementation, it indicates room for improvements. Other good qualitative testing prompts include math and coding tasks, such as the examples shown in the prior section.
Why compromise?
A common mantra is that differences solely come down to a tradeoff between quality, performance, and price. While there is of course a tradeoff, this also assumes that no innovation is possible, that all techniques for generative AI inference are known. At Together AI, we are working hard to push the boundaries of inference optimization, to enable fast performance and the highest quality in a single competitively priced package we call Together Turbo. Together Turbo endpoints empower businesses to prioritize performance, quality, and price without compromise. We offer the most accurate quantization available for Llama-3.1 models, closely matching full-precision FP16 models. These advancements make Together Inference the fastest engine for NVIDIA GPUs and the most cost-effective solution for building with Llama 3.1 at scale. The below two figures demonstrate this combination of performance and quality for Llama-3.1-405B:
In terms of performance, the results from Artificial Analysis show the low latency and fast output performance of Together AI.
In terms of quality, we report the results using the data AlpacaEval 2.0 shown above to demonstrate real-world automatic evaluation. AlpacaEval 2.0 is a leading benchmark designed to assess the alignment of LLMs with human preferences. It includes 805 instructions that are representative of real-world use cases. Each model's response is compared against that of GPT-4 (specifically the gpt-4-1106-preview version). A GPT-4-based evaluator is then used to determine the likelihood of preferring the evaluated model's response. To ensure fairness, the evaluation employs length-controlled (LC) win rates, which effectively neutralize bias related to response length. This metric aligns well with human preferences, achieving a Spearman correlation of 0.98 with actual human evaluations. To minimize variance, we set the temperature to 0.0 and frequency penalty to 1.0 with a max output tokens of 2048. No system prompt was provided during the evaluation. The same configuration was consistently applied across all API providers. The figure demonstrates the superior model quality achieved by Together Turbo which faithfully implements the FP8 mixed precision quantization from Meta’s reference model, achieving superior model quality than other providers on the market.
In other words, the Together Turbo endpoint for Llama-3.1-405B successfully achieves the best of both worlds: superior performance and model quality.
A transparent approach
We believe in a transparent approach that allows you to understand what you’re getting with each endpoint we provide. Therefore, with the launch of Together Inference Engine 2.0 we introduced three tiers of endpoints that models are made available with:
Together Reference endpoints are provided at the same precision the model was trained at and denoted either with “-Reference” in the model name or without any suffix.
Together Turbo endpoints represent our flagship implementation for a given model, providing a near negligible difference in quality from the reference implementation with faster performance and lower cost, currently using FP8 quantization.
Together Lite endpoints provide the most cost-efficient and scalable optimizations of a given model, currently using INT4 quantization.
As optimization techniques evolve these endpoints may be served with different levels of quantization but will be clearly stated in our documentation. Additionally, we will release new benchmarks over time demonstrating the quality and performance tradeoffs between these endpoints so that you can make an educated choice.
Moving Forward
Providing models with the highest quality is our deepest commitment to you. We strive to help you navigate the complex tradeoffs between quality, performance, and price, while working on innovations to enable you to get the best of all three whenever possible. We will continue to invest in research that further optimizes quality, performance, and cost and work to provide even greater transparency in the future.
Lower Cost
20%
faster training
4x
network compression
117x
Q: Should I use the RedPajama-V2 Dataset out of the box?
RedPajama-V2 is conceptualized as a pool of data that serves as a foundation for creating high quality datasets. The dataset is thus not intended to be used out of the box and, depending on the application, data should be filtered out using the quality signals that accompany the data. With this dataset, we take the view that the optimal filtering of data is dependent on the intended use. Our goal is to provide all the signals and tooling that enables this.