Mesa Graphics Testing Services

The LunarG test system is a service that provides regular regression testing on Mesa releases for Intel and AMD OpenGL and Vulkan graphics drivers. It also enables users to test their own Mesa builds and compare results to LunarG baselines, LunarG test runs, or user test runs.

Objectives of the LunarG test system:

Detection of regressions (both image rendering and performance) from one Mesa release to the next.
Provide a service to Mesa Developers to test out their development branches for regressions before merging.
Provide ongoing testing on Mesa Releases and maintain a history of results that is helpful in continued monitoring of Mesa driver release quality as it relates to Steam games.
Provide a maintained baseline of known good images and performance for various Steam game frames. Note: Known good images are not “Golden”, but rather are a snapshot or baseline taken in time to enable detection of regressions (changes) in future test runs. If it is determined that a future run is better than the snapshot, the baseline will be updated. So over time to baselines will improve but will not be perfect.
NOT Objective: Performance benchmarking test suite. The tests in the Mesa testing system are not a measure of overall game performance since game logic will be removed at test time. The tests are focused on Mesa driver functionality and performance through the GPU.

The test suite is a collection of trace files created from 100’s of Linux games from Steam. These trace files are used to automate testing of rendering correctness as well as game performance. There are 6 graphics platforms available for test (3 configurations each of Intel and AMD graphics). For Vulkan, there are 2 graphics platforms available for test (one each of Intel and AMD). The full suite of games typically runs overnight with results posted the next day.

Types of Tests

The test suite is a collection of trace files created from 100’s of linux games from Steam. These trace files are used to automate testing of rendering correctness as well as game performance. Each trace represents a game play sequence that is a reasonable representation of a game play.

For each Steam game, the following tests exist:

Game.replay - The game replay action. While making the trace, the game is actually played and reasonable plays are traced that will likely be a good representation of playing the game.
Game.loadtime - The time from game startup, to when game playing actually begins.
Game.loopXX - this is a loop inside of the game of a frame (or perhaps multiple frames) that is repeated for many iterations and FPS is measured. Note: Loops are not yet available for all Vulkan based tests.
Game Image # - For tracking rendering correctness, each game will also have a series of screenshots captured as images.

How performance is measured

Starting with the Mesa 18.1.0 branch, each test type is run twice and the the best result of the two is taken and reported. Previously, (before Mesa version 18.1.0) we reported the average of the two test results.

What is a Performance Regression?

When a performance test is compared against the chosen baseline, if there is a loss of performance a regression is flagged.

You can view the performance tab for the detailed results and sort on performance differences if you are wanting to see more finer detailed performance changes.

Regarding High FPS tests

We use a simple >10% slow down metric for tagging regressions. However if the performance test baseline is running at 3000FPS and the test run was 2500FPS, how important is this regression? Keep that in mind when viewing the performance regressions.

It is possible that at the time of tracing the test the FPS was much lower due to the game doing calculations. When the trace was captured (during game play), the calculation results were stored in the trace file as resulting OpenGL/Vulkan API calls. So at the time of replay, the FPS could be very high since CPU calculations no longer being done.

What is a Rendering Regression?

Technically, any change in pixel value from one run to the next is a regression. But we all know that there can be slight changes to drivers and algorithms that could lead to variations in pixel values that are still “correct”. So for simplicity sake, the detailed test results tag anything with a non-zero pixel difference. The truth of the matter, the images need to be examined and judgment needs to be applied to determine if it is a regression.

Benign changes (False Negatives) - pixel differences that can’t be perceived by the human eye or in LunarG’s judgment, are not incorrect. For each Mesa release test run, these False Negatives are identified by test suite admins and flagged as “Only Noise”. In the detailed image report a “~” will be used to identify these situations.

LunarG Baselines

Baseline Definition:

For a specific graphics HW/CPU combination, the baseline represents a snapshot of images and performance results for all the tests in the LunarG test library.
AMD and Intel Mesa open source drivers are what are used to create the baselines.
When a new test is added, it’s baseline is derived from running the last/oldest release on the previously completed Mesa branch.
The baseline over time continues to evolve and are tagged with each Mesa release to identify the good baseline at “that” point in time. For example, if you wanted to see the known good baseline at the time that Mesa 17.0.6 was released, there will be a baseline name “17.0.6 ‘HW’ Baseline” to identify the baseline at that point in time.

Maintaining the Baseline

For each new Mesa release, the previous release baseline is used as a starting point and updated with “better” results from the current mesa release test run.

A baseline is updated for a test under the following situations:

If performance continuously improves for a performance based test over 3+ consecutive test runs
If an image regression is determined to be better than the current image in the baseline

LunarG prefers to get guidance from a Mesa developer in this situation.

If an image change is considered to be benign and has had the same difference occur consistently for 3 or more consecutive test runs.

Note: If you are a Mesa developer with good insight into what the baseline values should be, you can request to be added to the “Baseline Maintainer” group which will give you the privilege to update baselines. To make this request, send email to info@lunarg.com with your request.

Test Runs

On the left panel, there are three categories:

Mesa Releases - This test run is a run of all tests against a Mesa release. The name is “‘Mesa Release’ ‘HW’”. These runs are created by LunarG and available for users of the system for viewing. With each of these test runs there is an associated “published test report” that shows the test result relative to the maintained baseline. This published test report will be visible once LunarG has moved the Mesa Release test run to a status of “done” and is accessed by clicking on the “View Test Report” button.
User Tests - These are test runs specified by a user of the Mesa test system. These test runs specify a URL and build SHA (commit) that identifies a Mesa driver build to be tested. The name of the test run is automatically generated and is a concatenated string composed of “user date HW commit”. By default, the logged in user will see only their test runs. If a user chose to make their test run public, they will also be displayed as selectable for viewing.
Baselines - These are the LunarG maintained baselines for each Mesa release.

Viewing Test Results

Click on a test run in the left list. Once viewing the test run, the user can select to “Compare Test Results”.

If the test run selected was a Mesa Release, the user can

“Auto Generated Report” to see the test results relative to the maintained baseline
“Compare Test Results” to compare the test run results to another test run or baseline.

If the test run selected was a “User Test”, the user can

“Compare Test Results” to compare the test run results to another test run or baseline.

The “Compare Test Results” function allows the user to select a baseline to which these test results will be compared. The user can select from any existing completed test run or from the LunarG baselines associated with Mesa releases. Once a baseline for comparison is chosen, a test report is created that shows image regressions and performance regressions.

The “Auto Generated Report” function displays the test run results compared to the maintained LunarG baseline. History has been maintained over consecutive Mesa release runs, allowing this report to have some additional report capabilities:

Summary Report: You have access to a summary report for performance and image regressions as well as all the detailed results. Any performance degradation of 10% or more is included in the summary report. Any non benign image differences are included in the summary report (False Negatives are not logged as failures in the summary report).
Regression/improvement patterns over time

Images - You can see how many consecutive test runs have had this same pixel count difference. You can see benign changes (False Negatives) flagged as “~”
Performance - you can see the performance trend of a test over time.

Detailed Report - Images

You can view baseline images vs. test run images with overlay and difference tools to visually examine the regressions. You can sort any column in the detailed report to select failures of a specific range of interest.

Explanation of Fields

Name - name of the game
Image - screen shot number within a given game/test
Difference - Difference between the baseline and the test run. For performance tests this is a percentage. For image regressions this is the count of different pixels.

The following fields are only available to published test reports associated with a Mesa Release:

Comment - LunarG will put in notes that we want to be seen by Mesa developers.
Regression History - For the last 6 consecutive Mesa Releases, you will see the regression values.
Baseline Version - Mesa version for the baseline.

Detailed Report - Performance

You can see the performance result for every test in the test run. You can sort any column in the detailed report to select failures of a specific range of interest.

Explanation of fields

Name - name of the game.
Description - Indicates if it is a loadtime, replay, or loop performance test.
Baseline - the FPS value for the test in the baseline.
Test Run - FPS measurement for the test with “this” test run.
Difference - A percentage that represents how much different the performance of this test was relative to the baseline.

The following fields are only available to published test reports associated with a LunarG Mesa Release Run:

Comment - LunarG will put in notes that we want to be seen by Mesa developers.
Regression/Improvement - A bar chart graph that shows the FPS over consecutive Mesa releases for this test.
Baseline Version - Mesa version for the baseline.

Performance Dashboard

You will notice a “Dashboard” button top and center. This dashboard displays by default when you enter the test system and is providing some performance data over time.

The values on the Y axis are the normalized FPS values for all games and their tests. The X axis is the Mesa releases in the order they were run. The normalization on the Y axis is done by finding the fastest FPS value for a test and marking this as 100%. Then all remaining FPS values are a percentage of this fastest value.

Game traces are updated or modified periodically for the following reasons:

The game itself has updated and the game trace was updated to reflect more current API calls being used by the game.
A necessary modification to the tracing and replaying utility was identified which required a game to be retracted.

Updating a trace file results in new loop (performance) tests being created. The new loop tests can not be guaranteed to be looping on the same API calls as the previous loop tests. Thus the new loops could have faster or slower FPS values depending upon the API calls within the loop.

Consequently:

At the dashboard highest level (DASHBOARD - Platforms):

The lines on each graph are dotted lines. The reason is to send a hint that increases or decreases in performance can't be guaranteed to be real. Trace files may have been changing which doesn't allow for an "apples to apples" comparison overtime. As such, you should drill down into the Games for more precise detail.
At the 2nd level of the dashboard (DASHBOARD - Games): A contiguous line is when all games and their tests represent trace files that were not modified. As soon as any game has a modified trace file, a break in the line is created and it is shown as a new point. If all you are seeing are points on the graph, then the trace files were changed frequently enough to result in no contiguous lines. At this level of the dashboard, increases and decreases in performance are accurate.
At the 3rd level of the dashboard (Game Tests): Each game is a group of tests that are loops, replay, and load-time tests:

Replay - Once the game is loaded, this is the FPS value for the entire game play while tracing
Loop - Areas within the game are chosen for looping. These are smaller and tighter loops focused on a few frames.
Load-time - this is the clock time it took to complete the initial loading of the game.

Trace Library Maintenance

The trace library is continuously maintained over time:

Traces are from the top 200 games played on Steam. This top 200 list is evaluated every quarter to determine if games should be added or culled.
We keep the number of games (tests) to a number that allows a test run to complete within 24 hours. Hence culling may be required.
Games will update periodically as well and we will periodically refresh traces for games as the game updates on Steam so that we are capturing new usages of OpenGL from the game updates.

Note: Due to the explicitness of the Vulkan API, a games trace is created for each GPU. For example, it is not reasonable to create a trace of a game on an Nvidia GPU and then be able to replay it on an AMD GPU (although there is some portability in the trace/replay tools across GPUs, you will be modifying the behavior of the application and may not be fully testing the specific GPU as used by the game). The fact that there are unidque traces for each GPU for a game is mostly invisible to the user of the system. However it is possible that "game A" may not be traced on all available GPUs and hence on some hardware configurations "game A" may not be an available test.

Receiving notifications

If you would like email notifications when a new test result is ready for a Mesa release, create a user account and sign-up for the notifications.

Creating an account will also allow you to post questions/issues on the Mesa release test runs.

Requesting Test Runs

Creating an account will also allow you to request private test runs on your private Mesa builds.