OpenGL Optimizations in an FFmpeg Filter

Using Pixel Buffer Objects and YUV input data.

In an earlier post, I walked through an ffmpeg video filter which runs frames through a pair of GLSL shaders. Building on that, I’d like to talk about a couple of performance optimizations, using the previous example, vf_genericshader.c, as the substrate.

In practice, the most dramatic performance gains are likely tied up in the transcoding portions of the runtime, and I’ll cover hardware accelerated decoding & encoding in a future post - but there remain significant1 performance gains we can exploit within the filter.

1 Around 15% for my use case, which mostly involves very large videos, processed on AWS g2 instances.

Background

Pixel Buffers

The two sets of functions we use to transfer data to/from the GPU - glTex*Image* & glReadPixels are synchronous. Depending on hardware, texture dimensions, etc. both may take a frustrating amount of time to execute. Due to the architecture of ffmpeg filters - implicit decoding and encoding at either end of a filter_frame callback - the use of equivalent, asynchronous GPU transfer operations would allow us to make progress between calls to filter_frame.

Mechanics

When a buffer is bound to GL_PIXEL_PACK_BUFFER, glReadPixels returns immediately, triggering the asynchronous transfer of pixels to a region of memory we can directly access via glMapBuffer. If called before the asynchronous transfer completes, glMapBuffer stalls, in a similar manner to vanilla glReadPixels.

On the other side, if a buffer is bound to GL_PIXEL_UNPACK_BUFFER, the pointer returned from glMapBuffer becomes the destination for our texture data. Rather than passing a memory address to glTex*Image*(), we keep the buffer bound and pass offsets into whatever data we previously copied into memory. Again, if the buffer hasn’t matured, glMapBuffer stalls.

YUV

For the purposes of demonstration, the first pass of the filter only accepted inputs in AV_PIX_FMT_RGB24, which, while convenient, isn’t a great choice for a couple of reasons:

  • If a video’s native pixel format isn’t RGB (hint: it’s not), ffmpeg’ll use libswscale to convert the frames before throwing them at our filter. As we’ve already got a live GL context, this is something we could handle ourselves in a fragment shader.

  • The native pixel format of the video is likely lower bandwidth than RGB.

To keep things manageable, we’ll only support a single input pixel format - YUV420P. While this is a lot more useful than RGB24, more useful still would be to accept frames in a range of common formats, both packed & planar. It should be relatively clear how to do that from a single example, and the changes oughtn’t be far-reaching1.

On the output side, we’ll still read RGB data from the GPU (which libswscale will likely be converting to YUV, depending on what’s happening at runtime). Changing that’d be pretty awkward, and a less obvious victory than addressing the input.

1 You ought to benchmark this before going wild. Depending on your platform & the spread of formats in your input sample, there may be a negligible (or negative) performance delta between FFmpeg’s accelerated format translation and whatever you’re doing in the shader.

Walkthrough

We’ll be covering only the chunks of the filter which have changed significantly since the previous post.

#define OUTPUT_BUFFERS 2
#define INPUT_BUFFERS  2
#define PIXEL_FORMAT   GL_RED

typedef struct {
  const AVClass *class;
  GLuint        program;
  GLuint        pbo_in[INPUT_BUFFERS];
  GLuint        pbo_out[OUTPUT_BUFFERS];
  GLuint        tex[3];
  AVFrame       *last;
  int           frame_idx;
  GLFWwindow    *window;
  GLuint        pos_buf;
} GenericShaderContext;

pbo_in and pbo_out are going to be treated as circular arrays of input (unpack) and output (pack) pixel buffers.

Rather than poll/synchronize to avoid stalling, we’ll stay one frame behind (i.e. first call to filter_frame outputs nothing, second call outputs first frame, etc.). We’re assuming that the interval between filter_frame callbacks is sufficiently long that we can avoid stalling with only a small number of pixel buffers. If that wasn’t the case, we could increase the buffer count constants, memory permitting.

Setup

static void pbo_setup(AVFilterLink *inl) {
  AVFilterContext     *ctx = inl->dst;
  GenericShaderContext *gs = ctx->priv;

  glGenBuffers(INPUT_BUFFERS, gs->pbo_in);
  for(int i = 0; i < INPUT_BUFFERS; i++) {
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, gs->pbo_in[i]);
    glBufferData(
      GL_PIXEL_UNPACK_BUFFER, inl->w*inl->h*1.5, 0, GL_STREAM_DRAW);
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
  }

  glGenBuffers(OUTPUT_BUFFERS, gs->pbo_out);
  for(int i = 0; i < OUTPUT_BUFFERS; i++) {
    glBindBuffer(GL_PIXEL_PACK_BUFFER, gs->pbo_out[i]);
    glBufferData(
      GL_PIXEL_PACK_BUFFER, inl->w*inl->h*3, 0, GL_STREAM_READ);
    glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
  }
}

We’re inputting YUV, and outputting RGB, hence the differing buffer dimensions.

static void tex_setup(AVFilterLink *inl) {
  AVFilterContext     *ctx = inl->dst;
  GenericShaderContext *gs = ctx->priv;

  glGenTextures(3, gs->tex);

  for(int i = 0; i < 3; i++) {
    glActiveTexture(GL_TEXTURE0 + i);
    glBindTexture(GL_TEXTURE_2D, gs->tex[i]);

    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);

    glTexImage2D(GL_TEXTURE_2D, 0, GL_R8,
                 inl->w / (i ? 2 : 1),
                 inl->h / (i ? 2 : 1),
                 0, PIXEL_FORMAT, GL_UNSIGNED_BYTE, NULL);
  }

  glUniform1i(glGetUniformLocation(gs->program, "tex_y"), 0);
  glUniform1i(glGetUniformLocation(gs->program, "tex_u"), 1);
  glUniform1i(glGetUniformLocation(gs->program, "tex_v"), 2);
}

One texture for each of the first 3 texture units, with the large lumunance/Y texture on unit 0, and the smaller chrominance/UV textures on the following two.

Frame Processing

Let’s look at the per-frame callback before expanding the functions it’s calling:

static int filter_frame(AVFilterLink *inl, AVFrame *in) {
  AVFilterContext     *ctx = inl->dst;
  AVFilterLink       *outl = ctx->outputs[0];
  GenericShaderContext *gs = ctx->priv;

  AVFrame *out = ff_get_video_buffer(outl, outl->w, outl->h);
  av_frame_copy_props(out, in);

  input_frame(inl, in, gs->pbo_in[gs->frame_idx % INPUT_BUFFERS]);
  process_frame(inl, in,
                gs->pbo_in[gs->frame_idx % INPUT_BUFFERS],
                gs->pbo_out[gs->frame_idx % OUTPUT_BUFFERS]);

  int ret = 0;
  if (gs->last) {
    ret = output_frame(
      inl, gs->last,
      gs->pbo_out[(gs->frame_idx-1) % OUTPUT_BUFFERS]);
  }

  av_frame_free(&in);
  gs->last = out;
  gs->frame_idx++;

  return ret;
}

After the standard frame allocation boilerplate, we first ask input_frame to map the buffer for one of the entries in pbo_in, before copying the input AVFrame‘s data to it:

static void input_frame(AVFilterLink *inl, AVFrame *in, GLuint pbo) {
  glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo);

  GLubyte *ptr = glMapBuffer(GL_PIXEL_UNPACK_BUFFER, GL_WRITE_ONLY);
  memcpy(ptr, in->data[0], inl->w*inl->h);
  ptr += inl->w*inl->h;
  memcpy(ptr, in->data[1], inl->w/2 * inl->h/2);
  ptr += inl->w/2 * inl->h/2;
  memcpy(ptr, in->data[2], inl->w/2 * inl->h/2);

  glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER);
  glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
}

We pack the planar YUV data (Y in data[0] through V in data[2]) into a single buffer in the obvious order. While we have 3 different textures, there’s no benefit to using a distinct buffer for each texture’s data.

It’s the responsibility of process_frame to adjust the 3 textures to point at the current pbo_in:

static void process_frame(
  AVFilterLink *inl, AVFrame *in, GLuint pbo_in, GLuint pbo_out) {

  AVFilterContext     *ctx = inl->dst;
  GenericShaderContext *gs = ctx->priv;

  glBindBuffer(GL_PIXEL_UNPACK_BUFFER, pbo_in);

  int w, h, offset = 0;
  for(int i = 0; i < 3; i++) {
    glActiveTexture(GL_TEXTURE0 + i);
    glBindTexture(GL_TEXTURE_2D, gs->tex[i]);

    glPixelStorei(GL_UNPACK_ROW_LENGTH, in->linesize[i]);
    glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0,
                    w = inl->w / (i ? 2 : 1),
                    h = inl->h / (i ? 2 : 1),
                    PIXEL_FORMAT, GL_UNSIGNED_BYTE, offset);
    offset += w * h;
  }
  glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);

  glDrawArrays(GL_TRIANGLES, 0, 6);

Each call to glTexSubImage2D with the bound unpack buffer causes the data copied in process_frame to be attached to the bound texture, starting from offset, and ending at offset + w*h.

The function ends:

  glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo_out);
  glReadPixels(0, 0, inl->w, inl->h, GL_RGB, GL_UNSIGNED_BYTE, 0);

  glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);
}

As a pack buffer is bound and no memory address supplied, glReadPixels operates asynchronously. It’s the job of output_frame to actually access the pixel data:

static int output_frame(
  AVFilterLink *inl, AVFrame *out, GLuint pbo) {

  AVFilterContext *ctx  = inl->dst;

  glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo);

  out->data[0] = (GLvoid*)glMapBuffer(
    GL_PIXEL_PACK_BUFFER, GL_READ_ONLY);

  glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
  glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);

  return ff_filter_frame(ctx->outputs[0], out);
}

As we passed GL_RGB to glReadPixels, the buffer will contain RGBRGB... samples, which is what we told ffmpeg we’d be giving it - we adjust the out frame to point at the mapped buffer.

YUV-specific

The fragment shader, which appears as a string constant in our filter’s C source file, reproduced for readability:

uniform sampler2D tex_y;
uniform sampler2D tex_u;
uniform sampler2D tex_v;
varying vec2 tex_coord;

const mat3 bt601_coeff = mat3(1.164,  1.164, 1.164,
                                0.0, -0.392, 2.017,
                              1.596, -0.813,   0.0);
const vec3 offsets     = vec3(-0.0625, -0.5, -0.5);

vec3 sampleRgb(vec2 loc) {
  float y = texture2D(tex_y, loc).r;
  float u = texture2D(tex_u, loc).r;
  float v = texture2D(tex_v, loc).r;
  return bt601_coeff * (vec3(y, u, v) + offsets);
}

void main() {
  gl_FragColor = vec4(sampleRgb(tex_coord), 1.);
}

At the top-level we’ve got some coefficients for converting values from our Y (luminance) & U, V (chrominance) textures to a vector of R, G, B. The particular values here are for the BT.601 colorspace - in real life, you may want a coefficient uniform, rather than a constant, with a matrix of format-dependent values supplied by the filter1.

offsets is used to scale the YUV values pre-conversion, such that luminance values fall between 16/256 (picture black) and 235/256 (picture white), and chrominance values in the range 0-1.

1 I don’t know a huge amount (i.e. I know nothing) about colorspaces.

And finally the modified query_formats definition:

static int query_formats(AVFilterContext *ctx) {
  AVFilterFormats *formats = NULL;
  int ret;
  if ((ret = ff_add_format(&formats, AV_PIX_FMT_YUV420P)) < 0 ||
      (ret = ff_formats_ref(formats, &ctx->inputs[0]->out_formats)) < 0) {
    return ret;
  }
  formats = NULL;
  if ((ret = ff_add_format(&formats, AV_PIX_FMT_RGB24)) < 0 ||
      (ret = ff_formats_ref(formats, &ctx->outputs[0]->in_formats)) < 0) {
    return ret;
  }
  return 0;
}

Hopefully that made some kind of sense. As above, the code is on Github - suggestions welcomed.