24 Debugging
Alistair_Findlay edited this page 2024-02-18 13:04:01 +01:00

Debugging

As debugging is a very complex topic, this page will compile tips specifically for debugging 0 A.D. problems. For info about profiling the game, see EngineProfiling.

Tools

The tools you use for debugging depend on your operating system of choice.

Windows

  • Visual Studio - The basic tool for debugging the game on Windows. Break into the debugger on a breakpoint, on a crash or assertion failure, or any other time. Visual Studio Community Edition is free and contains similar debugging features. Can be used to analyze crash dumps and get a useful call stack.
  • WinDbg - Part of the Windows SDK, a very powerful debugging suite which is primarily command line driven, unlike Visual Studio. Analyze crashes in more detail than VS.
  • DebugView - If you don't run the process in a debugger, DebugView lets you view its normally hidden debug output. Users can install and run this much more easily than a full debugging suite.
  • VMMap - Free tool from Microsoft to analyze the virtual memory usage of a process; shows fragmentation, can be useful for observing memory leaks or finding why a large allocation fails.
  • OpenGL Debugger - Debug and profile OpenGL applications. Useful for debugging GL errors and finding unexpected behavior.
  • Notepad++ - Small, simple, powerful text editor. You need a decent text editor on Windows.
  • A hex editor, like Bless - Useful for examining binary simulation state dumps, either for saved games or serialization errors.
  • VirtualBox - Virtual machines are very convenient for testing the game on other operating systems, or for testing multiplayer games locally. VBox is FOSS and works quite well.
  • Dependencies - Replacement for Dependency Walker, which allows viewing linked DLLs and symbols for Windows binaries. Very useful for troubleshooting missing DLLs and symbols, and verifying the game isn't depending on old VC runtime libraries.

TODO: I know there are more...

Linux

  • gdb - The basic tool for debugging in the GNU toolchain. gdb lets you e.g. break into the engine on a breakpoint or when a seg fault or assertion failure occurs. You can start your process in gdb or attach gdb to a running process.
  • valgrind - Debugging and profiling suite. Find memory leaks, invalid memory accesses and more.
  • OpenGL Debugger - Debug and profile OpenGL applications. Useful for debugging GL errors and finding unexpected behavior.
  • A hex editor - Useful for examining binary simulation state dumps, either for saved games or serialization errors.
  • VirtualBox - Virtual machines are very convenient for testing the game on other operating systems, or for testing multiplayer games locally. VBox is FOSS and works quite well.

macOS / OS X

  • Xcode - Free IDE for development on macOS, also has suite of debugging tools.
  • gdb
  • lldb - Part of the LLVM project, replacement for gdb in recent Xcode. LLDB equivalents for GDB commands can be found here.
  • VirtualBox - Virtual machines are very convenient for testing the game on other operating systems, or for testing multiplayer games locally. VBox is FOSS and works quite well.

Debugging Crashes and Assertion Failures

A crash, generally speaking, is when the game encounters an error and ceases to run at some point. In some crashes, the process will stop running and can no longer be debugged (e.g. a segmentation fault on *nix), or it will enter a debug state and wait for user action. An "assertion failure" may also be referred to as a crash, and often leads to a crash if ignored, but they are simply conditions that have been programmed as necessary for the software to run correctly. As 0 A.D. is still in development, there are many possible assertion failures and crashes.

You want two things to debug a crash: 1) steps to reproduce, and 2) a call stack (i.e. back trace).

Reproducing the crash

Important info to gather:

  • Build environment - custom build, SVN autobuild, or release package? Which compiler version?
  • Hardware (e.g. system_info.txt)
  • Operating system (e.g. system_info.txt)
  • Which version of the game was the user playing?
  • What was the user doing when the crash occurred?
  • Where there any errors or visible problems before the crash? (e.g. interestinglog.html)
  • What are the minimal steps to get the crash? Is it consistent?

See ReportingErrors, for the recommended process of reporting crashes and errors.

There are many different causes of crashes (running out of memory, heap corruption, invalid pointers or iterators, buggy drivers - to name only a few). Sometimes they are only reproducible with one OS, one type of GPU drivers, one version of a compiler, one type of CPU, or only after following a complex series of steps. Collecting the above information can point you and others in the right direction.

Sometimes, but not always, when the game crashes it will create a crashlog.txt in the log directory (see GameDataPaths). This file contains basic system info and often a call stack. On Windows, a crashlog.dmp may also be created, from which a call stack can be obtained.

Call stack on Windows

Reading crashlog.dmp:

  • It can be opened in Visual Studio or WinDbg. For this to be useful, you need to have the debug symbols and source code matching the affected build of the game. Most users use the autobuild version of the game, you can simply download the correct autobuilt binaries from SVN. Or for a release, install that particular version of the game and acquire the matching source package from http://releases.wildfiregames.com/. In the future we should automate this process, see #290.
  • You also need to set up symbol paths and a cache location for Microsoft symbol server, see Use the Microsoft Symbol Server to obtain debug symbol files.
  • In Visual Studio, after setting up your debug symbol paths, open the crashlog.dmp and choose to debug natively. You should get a crash of some kind, then you can break into the debugger. The call stack window will show you which functions were being called at that point. Note that in release builds, some data will be optimized out and not easily viewable.
  • In WinDbg, after setting up your debug symbol and source code paths, open the crash dump and use the ~*kp command to get a full call stack of each thread. See this extremely helpful article for more useful commands in WinDbg. For example, .frame 3 lets you set the current stack frame to !#3 (the 3rd from the top of the call stack), then you can e.g. use the source code window to see exactly the line of code matching this function call, and locals window to see the variables in that function. Note that WinDbg can often open dump files that VS fails to open.

Without crashlog.dmp, the process is slightly more difficult:

  • If crashlog.txt was created and it contains a call stack, you may have enough information there to find the source of the crash.
  • However, if the crashlog or stack dump failed for some reason, you either need to reproduce the crash locally, or get the user to run the game in a debugger (e.g. WinDbg).
  • A third option on newer versions of Windows is to have the user create a memory dump from Windows task manager. The user can find pyrogenesis.exe in the task manager, right-click it and choose Create Dump File. Beware the resulting MEMORY.DMP will be very large as it contains all memory pages being accessed by the process at the time, but it may be compressed with e.g. 7-Zip down to a more reasonable size.

Call stack on Linux / macOS

On Linux or macOS, you won't have a crash dump, so the best tool for getting your call stack is gdb. Like Windows, you need symbols to be set up properly to make sense of what gdb tells you. If you just see a bunch of hex numbers (addresses of functions) and no names of functions, then you don't have symbols set up correctly. If you're using a release package of the game on Linux, the symbols may have been omitted to reduce the package size, but there may be an optional debug package that can be installed (e.g. on Debian and Ubuntu). Note: debug symbols do NOT require a debug build of the game. A debug build disables optimizations to make some debugging easier (at the expense of running very slowly), but a release build also includes debug symbols, optionally.

Once you have debug symbols, if you can reproduce the crash, run the game in gdb (e.g. gdb ./pyrogenesis, or use lldb on recent macOS) and make it crash. Then you will return to the gdb command line. The most helpful command is bt to get the backtrace (call stack) at the moment of the crash. Often it's even more helpful to use t a a bt full to get the full backtrace of all running threads in the game. set height 0 is useful if gdb keeps prompting you to continue, when viewing a long backtrace.

gdb is quite powerful and has more features than can be reasonably explained here, so check the manual or search for tutorials. One thing you might find useful is selecting the current stack frame with frame n, e.g. frame 0 is the top of the stack. Then you can use info locals to view local variables in that stack frame. Note that in a release build, many of these will be optimized out, or the structure may be too complex for gdb to understand.

Debug symbols

Debug symbols relate a binary build of the game (and its libraries) to the source code used to compile the binary. Obviously, if the source code changes, the debug symbol info will also change. Debug symbols are invaluable in debugging the game, as they provide a "friendly" view of what a running process is doing or was doing. Function calls, parameters and variable contents can be inspected with the aid of debug symbols. Acquiring and setting up the matching symbols is critical to debugging crashes. This is a brief overview of where symbols come from and how to get them.

  • Debug symbols can contain a lot of data (10+ MB each is not uncommon), and most users aren't interested in debugging software, so often the symbols are omitted from release packages. This is very common with Linux packages.
  • For Windows builds, WFG manages the distribution of the game via SVN and official releases (see #290, #1684). Some but not all debug symbols are distributed as PDB files, which Visual Studio and WinDbg can read. These are generated and committed by the autobuild process.
    • Symbols for Windows libraries are distributed by Microsoft and can be acquired from their public symbol server.
    • Symbols for proprietary drivers (e.g. graphics drivers) are typically not publicly distributed.
    • If symbols are missing: this is generally the case with the game's libraries (SpiderMonkey, FCollada, NVTT, etc.) if they were built by WFG. In this case, you may have no choice but to rebuild the library in question and the game, then try to reproduce the crash once debug symbols are obtained. See #1684.
  • For Linux builds, the package maintainers handle the distribution of the game. It is up to them to choose how or whether they will distribute the debug symbols.
    • If symbols are missing: first check if there is a "debug" package of the module available. If not, the same advice applies as for Windows: try to build the library with debug symbols and reproduce the crash.

Note: It is still possible to debug a crash without symbols, by inspecting the disassembled machine code from the binary. In other cases, it is useful to see what exactly the given source code has compiled into (to see the precise effect of optimizations, for example). However, this is a complex and difficult process, and well beyond the scope of this page.

Debugging Out of Sync and Serialization Errors

Replay mode

Replays are created for every game session, see replays in the userdata directory (GameDataPaths). They are organized by date. commands.txt stores not only commands sent by each player each turn, but periodically the hash of the simulation state on a given turn. Replay mode is activated with the following command line option: -replay=/path/to/commands.txt. It's best to run the replay in a debugger (Windows) or command prompt (*nix) to view its output, no separate window is created.

Important: Don't forget to specify which mod(s) to load with the command mod=public (or whatever mod you want to load) when running replays. Else the loader won't be able to find the mod-dependant data (like maps, pathfinding codes, etc).

When replaying a commands.txt, the hashes are checked periodically. This can also be used to verify that no breaking changes are introduced by a new patch. Replay mode also creates a profile.txt with profiling information every 20 turns, this can be processed into a nice graph as explained here.

Out of sync

Out of sync (OOS) and serialization errors are generally difficult to debug, but knowing where to look can make this process simpler. An OOS error occurs in a multiplayer game when one player's serialized simulation state isn't identical to another player's serialized simulation state (breaking the concept of network synchronization). This is detected in-game by hashing the serialisation data. There are two types of hashes: quick-hashes that only hash positions and the "full" ones.

The following data are useful to collect to debug OOS:

  • oos_dump.txt - a human readable snapshot of the simulation state near the point of OOS detection, created on each player's computer. Found in the logs folder, see GameDataPaths. Each player should zip these files and send them to the person troubleshooting the bug. The most useful one to have is the OOS player. Note that OOS logs may not represent the exact turn of the OOS (because of network delays): that can be accounted for, see below.
  • commands.txt - this is the commands issued by each player during the game, which can be used to replay the game exactly as it happened. The commands should also contain the hashes: you should be able to replay the OOS or non-OOS track with a given revision.

Other useful information:

  • Each player's game version - SVN revision, recompilation status, presence of mods. Note that players saying they are on the same revision can be reasonably doubted, as there's more than a few moving parts.
  • OS and hardware info for each player (system_info.txt) - Some serialization bugs are platform specific, so knowing the systems involved is key to reproducing the error.

Known causes of OOS

OOS errors can be tricky, and can have a wild range of causes. Sometimes the cause is improbable or unexpected, so it pays to keep an open mind. It also pays to know what the most common causes are:

  • Rejoining a multiplayer game with AIs - because the AIs don't fully serialize their state, the rejoining player's state will differ and cause an OOS, see #1089.

Common causes:

  • For the past few alphas OOS have mostly been "OOS on rejoin". If that's the case, the most likely explanation is a bad serialisation (something is forgotten or incorrect). In those cases, one should check the OOS_dump for deltas (see below), and the most obvious candidates are the code that has been changed recently. The -rejointest (see below) can usually make debugging these easy.
  • A small part of all OOS issues are actually caused by Undefined Behaviour in C++. This is a long topic, but the most likely cause is that some variable was not initialised properly. Again, this should be detectable via the OOS dump, and the recent commits are the most suspicious.

Less likely causes:

  • Actual non-determinism is possible, though we have not seen one for years. This could be caused by e.g. the computer architecture differing.
  • SpiderMonkey has some JIT compiler issues (see #2000 for example). If you want to rule out the JIT compiler, you can disable it by uncommenting the lines in ScriptInterface.cpp that set the options JSOPTION_JIT and JSOPTION_METHODJIT. If it is the JIT compiler, you should still try to find the exact location of the problem because disabling JIT compiling completely is bad for performance.
  • Data affecting the simulation is kept past the runtime of one game (see #2285 for example). In this case OOS errors typically occur if one or more players have previously played another game without shutting down the engine afterwards. It's difficult to troubleshoot with replays because they always start a fresh instance of the engine.
  • Due to SpiderMonkey internals, floating point operations can differ slightly on different machines and different architectures (see Bugzilla #531915).

Diffing the OOS dump

The easiest place to begin is doing a simple text diff of the oos_dump.txt files to see where they differ. In Windows, you can use the diff tool built into TortoiseSVN, TortoiseGit, or some other tool, on *nix you simply use diff. Your diff tool may not like the binary data spit out by the CCmpAIManager component, so you can remove that by editing manually in a text editor. Note that in multiplayer games, a hash of the full simulation state only occurs every 20 turns for performance reasons, so it's possible that the states began diverging earlier

In the best case, you will see a small diff of changes comparing two or more of the dump files. This can point you to the component and property that differ, then by analyzing the code that writes to that property, you can see if it does anything unsafe, or (in the case of a C++ component) if it's not serialized correctly.

The diff will contain false positives: range visualisations & local entities (waypoints...) appear, but are not part of the actual serialized data. Those can be ignored.

It is possible that the difference only happens in the 'binary' OOS dump. Usually, this happens because the JSON representation of the dump doesn't allow the actual value or (for a JavaScript value) the difference is in SpiderMonkey's internal representation of the data. This has been reported before with e.g. NaN, having a single bit difference depending on the JIT behavior (see #1879). Note that NaN is no longer serialized.

You can apply a patch to manually generate OOS dumps on a replay, and save/load to cut down time: https://code.wildfiregames.com/P237#2058

Rejointest & Serialization Test

Two options can be used to reproduce OOS: the -rejointest=turn and the -serializationtest option.

The former is much faster and often sufficient: it starts a second instance of the game at a given turn (as if a new player rejoined the game) and the runs it, comparing hashes. If you have an OOS on rejoin, a little exploration will usually trigger it.

The -serializationtest command line option tells the simulation to do a full test of the simulation state every turn. This will be extremely slow compared to a normal run of the game, but can reveal problems the moment they occur. Note: it will currently always fail with AI players, since they don't serialize properly.
Note: sometimes, OOS only happen over several turns. This makes them unreproducible with the basic serializationtest.

When the serialization test fails, an error window will be shown, but the more useful data is created in the game's log folder (see GameDataPaths) inside oos_log. In the following filenames, .a means data from the primary simulation state (e.g. the one used in a typical game), while .b is data from the secondary simulation state (e.g. the one being reconstructed every turn to compare with the primary state). When they differ, it's a serialization test failure.

  • debug.after.a / debug.after.b - A good place to begin, these are the debug output of the serializer after the current turn updates and can be compared with a diff tool (like oos_dump.txt for OOS errors).
  • state.after.a / state.after.b - Binary dump of the simulation state after the current turn update occurs. If an error occurs here, it is probably because some data that affects the simulation state isn't being serialized.
  • state.before.a / state.before.b - Binary dump of the simulation state before the current turn update occurs. If an error occurs here, it's probably a bug in the (de)serializer or the way it is (de)serializing the data.
  • hash.after.a / hash.after.b / hash.before.a / hash.before.b - hash values used to compare the above states.

If the debug dumps aren't helpful, the binary dumps can be viewed with a hex editor. Perhaps even more useful is a binary diff tool that can show exactly the offset of each difference. Because some ASCII data gets encoded in the binary simulation state, it is often possible to determine which component and property contained the difference.

OOS logging mode

There is another useful option for debugging serialization errors, the -ooslog option. This will dump the simulation state in both binary and debug form every turn. The game will run much slower with this option, but for multiplayer testing, it can be useful to determine exactly when the states differed between two players, and how they differed.

OOS caused by the AI

An out of sync error is caused by the AI if the following conditions are both true:

  1. The AI sends different commands to the simulation on different clients or even on the same client when running the game multiple times.
  2. The simulation state does not differ before and the AI gets the same input data.

You can add some debugging code to binaries/data/mods/public/simulation/helpers/Commands.js in the function ProcessCommand to print all the commands coming from the AI. When using a replay you know which player number is the AI player and can filter the output like that:

	if (player == 3 || player == 4)
		warn(uneval(cmd));

Now you can add more debugging output in the part of the AI that is responsible for sending the differing commands and try to figure out why it's happening. If you add the same debugging output on multiple machines, you can use diff to easily find the place in the log where the first difference is found.

Other tips for serializer debugging

  • Debug annotations can prove very helpful when viewing the binary simulation state in a hex editor. The dump will be much larger but it will also contain more textual data. Set DEBUG_SERIALIZER_ANNOTATE to 1 in [StdSerializer.h]source:/ps/source/simulation2/serialization/StdSerializer.h. ** Check hashes more frequently. By default, the game balances e.g. OOS checks with the performance impact of serializing and hashing the state. You might find it helpful to change e.g. CReplayPlayer* Replay in [Replay.cpp]source:/ps/source/ps/Replay.cpp to check hashes more frequently in replay mode, you can also generate hashes more frequently by changing CNetTurnManager::TurnNeedsFullHash (multiplayer games) or CNetLocalTurnManager::NotifyFinishedUpdate (single player games) in [NetTurnManager.cpp]source:/ps/source/network/NetTurnManager.cpp.

Debugging script errors

Note: The JS Debugger was removed (see #410).

A generic JavaScript debugger wouldn't really work for the game's scripts, so instead we created our own debugger, see JavascriptDebugging for more details.

If the JS debugger doesn't help, you can always fall back to the basic debugging "by hand" technique of inserting logging functions into the suspected problematic code, and inspecting data to narrow down the cause.

Out of memory errors

Our JavaScript engine, SpiderMonkey, uses a simplistic garbage collection (GC) to manage script memory. Scripts run in contexts, with one or more contexts per runtime (data can be passed between contexts but not trivially between runtimes - they might be in different threads). Each runtime is initialized with a fixed size, like 16MB or 128MB, and the GC runs periodically to free up allocated memory. Because of this, it's entirely possible that a script runs out of available memory and begins failing. This is an out of memory (OOM) error and is also an uncatchable error.

TODO: document heap dump?

Debugging C++ / engine errors

Because C++ has to be compiled when changes are made, the debugging "by hand" technique is still useful but slightly more annoying than when debugging scripts. It's often more convenient to simply run the game in a debugger like Visual Studio or gdb, either a release or debug build, and set breakpoints to break into the debugger and get a sense of what is happening in the program. The process for this is similar to that described above for debugging crashes.

TODO: links to C++ debugging tutorials/books?

Test suite

Always run the test suite. It gets built along with the game, as either test.exe on Windows or test on *nix. The test suite contains simple cases that might make it more obvious where a bug is located and why it occurs. Of course a failure in the test suite might also mean the test suite needs updating :)