PinSim::FrontEndControls - Screen Capture Additions

Michael Roberts
September 27, 2024
Status: Draft

This is a starting point for possible future additions to the PinSim::FrontEndControls protocol, with new features to help a pinball simulator and a front-end launcher program coordinate capturing videos and screen shots from a running game. The goal is to make it possible for the simulator to supply the video frames, while the front-end program handles the user interface to initiate and control the capture process.

The front-end launcher programs all show preview graphics for the available games they can launch, which are basically screen shots and video captures of the games in action. In the old days, users had to hunt around on the Web for pre-made media packs with the preview media, but the newer launchers simplify things by creating the media files themselves, by actually launching the games, while simultaneously running a screen-capture program such as FFMPEG to create a video or still image of the live game session.

Currently, all of the front ends perform the media capture without the simulator being aware that a capture is taking place. The capture program simply grabs frames from the Windows or GPU video buffers, so the only thing the simulator has to do is run a game session as normal, rendering its frames as it normally would to the system video buffers. This proposal describes an extension of the PinSim::FrontEndControls protocol that would let the front-end program explicitly bring the simulator into the capture process as an active participant. The simulator would know that a capture is taking place, and would do some additional work to send the frames to the capture program itself, rather than letting the capture program read frames out of the system video buffers. This would make it possible for the simulator supply the video frames from any suitable point in its rendering pipeline, which could potentially make the capture process more efficient and improve the captured video quality. Feeding frames directly from the simulator to the encoder could improve video quality by perfectly synchronizing the video encoder's frame sequence to the simulator's rendering cycle, which isn't possible under the traditional method, where the capture program reads from the live display buffers asynchronously.

Background
Second (latest) draft proposal
First draft proposal
Thoughts towards a second proposal

Background

Vincent Bousquet (vbousquet on github) and I had a brief conversation about this in relation to my initial FrontEndControls pull request for VPX, where Vincent suggested that the protocol could include commands to coordinate capture with the front end. The front end is clearly the right place to initiate a capture operation at the user interface level, since the point of the capture is to add the captured files to the front end's collection of preview media. There's a lot of information about file locations and video encoding settings that's all determined by the front-end program, and has nothing to do with the simulator, so the front end should be in control of setting up all of the encoding parameters. The front-end program is also the place to put the entire user interface, since the whole point of the front ends is to provide a uniform UI across all simulators for everything outside of the actual game play. But the simulator is the optimal place to perform the capture, at the level of grabbing the raw video frames, since the simulator is the source of the graphics (for the main playfield, at least; other windows might be generated by separate helper programs). The ideal division of labor is to have the front-end program provide the UI to initiate a capture, and handle all of the details of setting up the video encoder; and to have the simulator do the frame grabbing.

As it stands now, with the current front ends and their capture techniques, the simulator has no knowledge that a capture is even taking place. The front end simply launches the simulator as normal, and at the same time runs a screen capture program such as FFMPEG to capture from the system video buffers.

The main benefit of explicitly inviting the simulator into the process, and making it the source of frames for the video encoder, is that the simulator could precisely synchronize its generated frame rate with the captured video file's frame rate. In the current setup, where the simulator doesn't have any knowledge of the capture process, the capture program and the simulator don't have any way to synchronize their frame rates, so there's no way to ensure that the final video file contains exactly one video frame per simulator-generated frame. The resulting video file therefore might (and, in practice, does) show motion artifacts from the asynchronous frame rates: some frames are captured twice, some aren't captured at all, so you get the sort of syncopated video artifacts that you often see when re-coding a video with a new frame rate. And since the two rates in question are mutually asynchronous, the syncopation is random, so it's not even the kind of regular artifacts you get when converting between rates that are merely dissimilar but still synchronized, such as converting a 24fps video to 30fps.

I think it's a great idea to coordinate the capture somehow, but we didn't have enough of a fleshed-out design at the time to include anything about it in the initial FrontEndControls proposal. I wrote this note to document the thinking on it so far, as a starting point towards a complete design.

Second (latest) draft proposal

This section presents a second draft of a concrete specification for the feature. This attempts to improve on my rather limited first proposal and address the shortcomings Vincent identified. This proposal envisions the following procedure for capturing video:

The user initiates a capture through the front-end program's user interface. This includes making any necessary selections for which screens to capture, where to save the output files, what video encoding parameters to apply, how long the videos should run, and so forth. After making the necessary selections, the user hits the GO button to kick off the capture.
The front-end program launches the simulator (if it's not already running) to start the subject game.
The front-end program creates an anonymous pipe. It uses DuplicateHandle() to create a foreign-process pipe handle to the WRITE end of the pipe for the simulator, and passes this to the simulator as the LPARAM of a PREPARE_CAPTURE command.
The simulator accepts the handle, sets up its internal state to indicate that capture is in progress, and returns a success code as the message result.
The simulator constructs a struct, defined in the protocol, that contains information on the raw video frame format it will use for the capture, and writes this to the pipe.
The front-end program, upon seeing the successful return code from the PREPARE_CAPTURE command, reads the frame format struct from the pipe. It uses this, in combination with its own settings for the output video file format, to launch an FFMPEG child process (or it could launch some other comparable video encoder program as a child process, or it could just start an in-process thread to do the encoding). It sets up the FFMPEG command line to read the input stream from stdin, and it sets the child's stdin to the READ handle to the anonymous pipe (this redirection is done via the Windows process startup parameters). The front end also configures the FFMPEG command line to let FFMPEG know what kind of video frame format to expect to read from the pipe, based on the information that the simulator passed back to the front end through the special struct data it sends before any video data.
Once the front end has launched FFMPEG, and detected that the new process has finished starting up, it sends a BEGIN_CAPTURE command to the simulator. This triggers the simulator to start capturing frames from its graphics renderer and writing them to the pipe.
FFMPEG is now reading its input from the pipe that the simulator is writing its raw frames into, so the capture and encoding process will proceed automatically. The front end simply stands back and lets the frames flow across the pipe, into FFMPEG, and out to the encoded video file.
When the front end decides that it's time to finish the capture, it sends an END_CAPTURE command to the simulator. The simulator resets its internal capture state and closes its end of the pipe. FFMPEG will see this as an EOF on read, which should trigger it to finish the video file and exit.
The front-end program now waits for the FFMPEG child process to terminate, at which point the capture is complete.

Note: this procedure isn't meant to be a prescription for precisely how the front-end must do the encoding. It's more of an example, and a rationale for how we get to the protocol. What we're really specifying here is the protocol, which is embodied in the SendMessage() commands outlined below. The front end can use it in whatever way is most suitable for its own design.

I think this has all of the desired properties identified so far:

Capture is initiated and terminated entirely by the front end, without the need for any separate user interface in the simulator, so that the user interface for performing the capture is entirely within the front end
The front end is in control of all of the output media details (file name and location, encoding format, container format, etc)
The frame generation process is fully controlled by the simulator, including the real-time frame generation cadence
The video encoding program reads the exact series of frames that the simulator generates, so the recorded video is perfectly synchronized with the generated video
It works within the confines of the SendMessage() protocol
It foists fairly minimal work onto the simulator, which only has to be able to capture its own video frames and write them to a pipe in a straightforward format, with some flexibility to choose a format that represents the easiest and/or most efficient conversion from the native format that the simulator uses to generate the graphics for regular display output

FFMPEG vs other capture programs/libraries

Everything above superficially assumes that the actual video encoding is done via FFMPEG. But I'm really only using this as a concrete reference point, and trying to keep the design agnostic as to the encoding technology. Nothing here should be taken as specifying that the FFMPEG must be used as the encoder. I think that the protocol is abstractly defined enough that any other encoder could be substituted - and the substitute need not even take the form of FFMPEG, which is to say, a command-line program spawned as a child process. It could just as well be a video library that's integrated into the front-end program as a DLL or static library, or a program running on a network peer, or even an outboard USB device. It could simply be a regular disk file handle, where the output will be saved in raw format to be encoded later during a second pass run by the front end. The pipe could really go anywhere that can accept the raw video frame stream as input. In the framework procedure outlined above, the step where the front end "launches FFMPEG" could be handled instead by feeding the pipe into its own internal encoder library calls, doing everything in-process. What we're specifying here is only the protocol, and at that level, the simulator doesn't have any idea who's consuming the other end of the pipe, and doesn't care. It just knows that it has to write raw video frames there. The front end and whatever encoder "frame sink" it chooses likewise don't have to know and shouldn't care where those frames are coming from specifically.

SET_CAPTURE_SUBJECT command

WPARAM: 5
LPARAM: target window type code (see table)
Return: 1 on success, 0 on failure

This lets the front-end program establish which type of window it wishes to capture in a subsequent PREPARE_CAPTURE command. Simulators running in "pin cab mode" might display multiple windows on different monitors, and the front end will typically want to capture media from all of these different windows to create its full preview. Each capture stream applies to only one window, though, so the front end needs a way to select the subject window for each new capture. It does so by sending this command prior to a capture; the simulator records the window selection internally, and applies it to the next PREPARE_CAPTURE, and any subsequent captures after that, until a superseding SET_CAPTURE_SUBJECT command changes the selection to a different window.

The LPARAM specifies which type of window to capture from:

LPARAM	Window
1	Main playfield
2	Backglass
3	DMD/score panel
4	Topper
5	Apron instruction card
6	Apron score card

Other values are reserved for future use and should not be used.

The setting affects all following PREPARE_CAPTURE commands, until the next SET_CAPTURE_SUBJECT command changes it.

If the simulator doesn't provide the type of window requested, or it's simply not possible technically to capture from that window, the simulator returns failure.

PREPARE_CAPTURE command

WPARAM: 6
LPARAM: a Windows file system write HANDLE, in the simulator's process address space, coerced to LPARAM
Return: 1 on success, 0 on failure

This prepares to start a new capture in the window currently selected via SET_CAPTURE_SUBJECT. The simulator sets up its internal state to indicate that capture is in progress on the current subject window, and stores the handle as the capture output receiver. The simulator initializes the new capture, but it doesn't actually start writing any frames to the handle until it receives a corresponding BEGIN_CAPTURE command. The recording is placed "on pause" until then. This allows the front end to get its side of the pipe ready to receive frames, so that the simulator doesn't get stuck waiting for the pipe to clear by writing frames before the receiver is ready to read them.

The handle which represents an open Windows HANDLE object with write access. The handle can be attached to any Windows object that can be represented with a file system HANDLE that can be used in a WriteFile() call, so it could be simple disk file handle open for writing, the write handle to an anonymous pipe, a network socket, or a handle to a USB endpoint for an outboard encoder device.

Important: The handle provided in the LPARAM must be within the simulator's process address space. Windows HANDLE objects are process-specific, so a HANDLE created by one process for its own use can't be used directly in another process. We expect this command to be sent from a front-end program process to a separate process running a pinball simulator like Visual Pinball, and since the front end is responsible for creating the output handle object, it's also responsible for providing a copy of the handle within the simulator's address space. Windows has an API specifically for this purpose, DuplicateHandle(), which allows one process (in this case, the front end) to create a copy of one of its own handles (the output handle opened by the front end) that's within the address space of a separate process (the simulator).

The simulator returns 1 to indicate that the capture has started successfully, and writes the following struct to the pipe, to communicate the raw video format back to the front-end program:

   struct PinSimFrontEndControls_CaptureHeader
   {
      uint32_t structSize;   // structure size in bytes, for version detection
      float frameRate;       // frame rate in Hz
      uint32_t width;        // frame width in pixels
      uint32_t height;       // frame height in pixels
      uint32_t rowStride;    // row stride - number of bytes per row
      uint16_t bpp;          // bytes per pixel
      uint16_t format;       // pixel format code, taken from a table specified below (TO DO)
   };

Note that the capture header is intended to be consumed by the front-end program, not by the video encoder (e.g., FFMPEG). It's not part of the video stream, and it's not meant to resemble any standard video container format's file header or stream header. The front-end program is expected to read this structure out of the pipe before handing off the pipe to the encoder program or library. It's just a convenient way of conveying this large struct back to the front-end program, since we can't convey this much information through the LRESULT.

To be determined: Format codes. We'll have to look at FFMPEG's raw video frame format list to come up with a list of formats acceptable to FFMPEG and any other encoders that front ends might want to use. The set of allowable formats might constrain the buffer layout in such a way that the rowStride and bpp elements aren't needed (since these may be fully constrained in all available formats and thus would only be redundant), and on the other hand, might call for new elements to be added to fill in parameters that an encoder would require for certain formats and that aren't inherent in the format code itself.

An alternative to enumerating a list of known format codes by arbitrary integer ID would be to use a string identifier, perhaps with a small fixed-size char[] buffer in the struct. FFMPEG's formats all have names that we could specify in this fashion. But that would be another factor cementing FFMPEG as the only usable encoder, which I'd like to avoid if possible.

To be determined: It might be necessary to negotiate the frame format, rather than just allowing the simulator to choose one unilaterally. Without any negotiation, we're sort of cementing the idea that FFMPEG is the only allowable encoder, or we're at least insisting that any substitute must support exactly the same set of format inputs that FFMPEG accepts, and no others. Negotiation would presumably just be a matter of the front end sending a list of acceptable formats, and the front end choosing from one of these proposed formats, rather than choosing from all possible formats. This would require a further extension of the protocol because of the limitations on SendMessage() parameters (in that we've already used up the whole LPARAM with the pipe handle). Perhaps another pre-capture message could be sent with the LPARAM containing a bit mask of acceptable frame formats. That's not ideal in that it has very limited extensibility, limiting us to 32 total formats ever, but that might be adequate anyway given that they're really not inventing a lot of new pixel formats every year. Alternatively, the front end could call an ADD_CAPTURE_FORMAT command repeatedly with one format at a time as the parameter, in its preference order, and the simulator accumulates an internal list of all formats that have been sent so far.

To be determined: What happens if a PREPARE_CAPTURE arrives while a capture is already in effect? An easy answer would be that it either fails with an error, or it cancels the previous capture and starts a new one. However, we could define it such that multiple simultaneous captures are allowed, so a new PREPARE_CAPTURE simply starts a new one in parallel with any already running.

I don't think it would actually add much implementation complexity to allow multiple captures at once. On the simulator side, it would just be a matter of marking each graphics output surface as being a capture subject and writing frames to its pipe, and on the front-end side, it would mostly be a matter of launching and tracking multiple FFMPEG child processes. I think the only thing we'd have to change at the interface level is a way to specify which stream to end in an END_CAPTURE command. Perhaps the front end simply passes the same pipe handle again in END_CAPTURE to identify the stream being closed.

Right now, the front ends (PinballY, at least) do capture one window at a time, and just run captures serially if the user wants to grab from multiple windows. So the front ends probably wouldn't think to use the feature this way right now. And it would take a seriously beefy machine to encode two or three video streams while also running a real-time physics simulation and 3D rendering. But maybe machines that can handle that will become commonplace before too long, and it would certainly save the user some time, so maybe we ought to bake it into the protocol as an option that future front ends can exploit.

To be determined: Is there anything we can do to integrate audio capture as well? If we don't, the capture program will presumably just capture from the audio loopback device (which is what the capture programs all do now), but I'm not sure if it would be possible to maintain good audio sync with that combination. In the old setup, the capture program was reading both streams (video and audio) out of system buffers in real time, and could therefore sync both to some shared real-time hardware clock, such as the video refresh cycle. It's not clear to me how this would work when video is coming from a program source, but the audio is coming from the real-time system buffers. On the other hand, audio sync can already be pretty crappy when all you're doing is capturing directly from the screen buffer via GDIGRAB or DDAGRAB, so maybe it won't make much difference.

BEGIN_CAPTURE command

WPARAM: 7
LPARAM: the pipe handle from PREPARE_CAPTURE, coerced to LPARAM
Return: 1 on success, 0 on failure

This releases the "pause" on the capture set up on the same handle with PREPARE_CAPTURE. After receiving this command, the simulator is free to start writing frames to the pipe at its convenience.

The point of separating PREPARE_CAPTURE and BEGIN_CAPTURE into separate phases is to allow the front-end program to fully initialize the encoder that will receive the frames on the pipe, so that the encoder is ready to start reading frames immediately as soon as the simulator starts sending them. The front-end program can't fully initialize its encoder until it receives the response to the PREPARE_CAPTURE, since it needs to know the pixel format the simulator will send on the pipe first. This setup time could be non-trivial, especially if the front end implements encoding by launching a child process such as FFMPEG. If the simulator started writing frames before the encoder is ready, it would potentially block on the write for the duration of the encoder startup, which could cause noticeable glitching at the start of every captured video. The two-phase startup should avoid this by controlling the start of frame writing from the front end, which should have a good idea of when the encoder is finally ready to accept input.

END_CAPTURE command

WPARAM: 8
LPARAM: the pipe handle from PREPARE_CAPTURE, coerced to LPARAM
Return: 1 on success, 0 on failure

Terminates the capture in progress identified by the given pipe handle. The simulator closes its pipe handle and resets its internal state so that it's no longer capturing frames.

The inclusion of the pipe handle as a parameter is meant to allow for the possibility that PREPARE_CAPTURE allows multiple simultaneous captures.

First draft proposal

(This section is for historical reference. This was my first draft proposal, to follow up on Vincent's initial idea posted in the PR thread with a concrete interpretation of what the interface design might look like.)

Vincent noted that this design still completely leaves the simulator out of the capture process, other than informing it that a capture is taking place. Vincent's more specific vision for the interface is that the front end would essentially put the simulator into single-step video mode, asking the simulator to generate one frame at a time at a given frame rate, returning the raw frame to the front end. At each request, the simulator would evolve the physics simulation by the amount of elapsed time in one video frame at the negotiated video frame rate for the captured video file, generate the graphics buffer for the new physics state, and return the raw graphics frame buffer to the front-end program. The front-end program would be responsible for delivering this raw graphics frame buffer to a video stream encoder to add into the final video file.

The immediate challenge with the idea of returning the video frame to the front-end program is that the FrontEndControls protocol has no way to send back a large data structure like a video frame, since the protocol is based on the simple Win32 SendMessage() API, which only allows passing back an LRESULT (a native platform integer, essentially, 32 or 64 bits wide, with no ability to interpret this as a pointer because the message is being sent across process boundaries). That's why my first draft below left the capture process up to the simulator: there was simply no way within this protocol to return video frames. So my version of the command does nothing more than instruct the simulator to lock the generated frame rate to match the desired capture program frame rate.

I think there's still some slight value in coordinating the frame rate like this. Since the capture program and simulator are running on the same machine, they're referenced to the same system clock, so it's likely that they'd stay in perfect lock step if they're both explicitly using the same frame rate. But Windows has many internal clocks and many APIs to access them, so there's really no guarantee. Vincent's vision of explicitly coordinating the frame source and sink steps would clearly be superior.

START_CAPTURE command

WPARAM: 5
LPARAM: High byte = frame rate in Hz; low byte = window selection bit mask; bit 0x100 = disable exclusive mode; all other bits reserved, set to 0
Return: On success, a bit mask of the updated windows, using the same bits as in the LPARAM; 0 on failure or if not implemented

This command lets the front-end program notify the simulator that it's about to start capturing and recording the video display output from the simulator via a screen-capture process, and ask the simulator to set a desired target video frame output rate. The quality of the captured video will generally be best if the capture program can sample frames at exactly the same rate at which the simulator is generating them, so that there's a 1:1 cadence between the generated frames and the recorded video.

The LPARAM in the request is treated as a 32-bit DWORD that's composed of a collection of bit fields. Note that we treat this as a 32-bit value even on 64-bit systems so that all features are accessible on x86 systems.

The high byte ((lparam >> 24) & 0xFF) contains the desired frame rate in Hz, as an unsigned 8-bit value. The most typical values are probably 30 and 60. (This isn't capable of representing fractional Hz frame rates such 29.997. Although such frame rates are common at the video hardware level, it doesn't seem necessary to support them for the purposes of capture, since capture output goes to a digital video file, where the frame rate is presumably at the discretion of the capture software.)
Bit 0x100 is set to 1 to request disabling "exclusive" mode in the selected window(s). Exclusive mode is a graphics mode giving the simulator direct access to the GPU for a given monitor, bypassing the Windows CPU-side GDI graphics buffers that are normally used to compose graphics from multiple windows and processes. Exclusive mode makes it impossible for other processes to capture video via the Windows GDI layer because of its ability to bypass the GDI buffers. It's still possible for other programs to capture from exclusive windows by using direct GPU access themselves (which Windows 8 and later expose via the "ddagrab" API), but not all capture programs are so equipped, so a caller might wish to disable exclusive mode for the duration of a capture.
If none of the selected windows are running in exclusive mode, the simulator can simply ignore this flag, since no changes are required to satisfy the request.

The low byte is a mask of the windows that the front-end program wishes to capture from:

Bit	Meaning
0x01	Main playfield
0x02	Backglass
0x04	DMD/score panel
0x08	"Topper" (typically an extra monitor placed on top of the backbox)
0x10	Apron instruction card
0x20	Reserved
0x40	Reserved
0x80	Reserved

On success, the simulator returns a bit mask of the updated windows, using the same bit positions as the low byte of the LPARAM. The window bit mask on return is only meant to indicate which windows the simulator actually displays. It's not an error if the caller includes a bit for a window that the simulator doesn't actually display, but it is an error if the simulator displays a window and can't make the requested frame rate change or exclusive-mode change in that window.

On failure, returns 0. If the simulator can't satisfy the frame rate request or the exclusive mode change, it should make no changes to its internal state and simply return 0 to indicate failure.

END_CAPTURE command

WPARAM: 6
LPARAM: Reserved, must be 0
Return: 1 on success, 0 on failure or if not implemented

Cancels the video mode changes made by START_CAPTURE, returning to the game's normal display operation.

Thoughts towards a second proposal

(This is for historical reference and context - I actually do have a revised second proposal that addresses all of the points below, outlined above.)

I don't yet have a concrete proposal for something better than the above, but I can at least document the desired properties it should have.

We'll need some sort of IPC, alongside SendMessage(), to allow the two programs to exchange more complex data types. In Vincent's original proposal, this would mean passing back a whole video graphics frame buffer from the simulator to the front end, which could be accomplished with shared memory or a pipe, among other things.
Vincent's initial idea was to have the front-end program control the generation of each successive video frame by calling a "get next frame" API via the protocol. However, it seems better to me to leave the simulator in control of the frame generation, so that it can maintain synchronization with real-time user input. I'd propose that we replace the notion of a "get next frame" API with something more like a "put next frame" callback. This would leave the front-end in charge of collecting and encoding the raw frames into a video stream, but it would no longer be in control of the real-time pacing. A callback could be provided, for example, via a COM object, or a pipe write handle. We'd probably still need something more sophisticated than SendMessage() just to pass this callback information, so more work is needed there.
If we're going to go that far, maybe we should just delegate the entire video capture job to the simulator, including creating the video file. Moving the video frames across process boundaries in real time could pose performance challenges that might more easily be solved by handling the whole job in-process in the simulator. If the simulator has to capture the frames and feed them into some kind of interface, the best interface might be a video encoder API. In this case, the protocol between front end and simulator might be something like this:
- Start Recording: parameters include a media type (image, video, audio), codec selection, codec parameters, container format, and output filename
- Finish Recording
I'll have to look at the available libraries derived from the FFMPEG project to see if there's a good encoder interface. It's possible that we could come up with a nice general-purpose encoding library that simulators like VPX could use to implement the encoding work in-process without doing much more than work than the frame capture part.