PBCore Audiovisual Tricks

This post, written by Dave Rice, outlines how various tools like MediaInfo and ffmpeg can be used in audiovisual description, access, and preservation workflows.

Basic MediaInfo outputs

A general mediainfo output can be produced with a command such as:

    mediainfo your-file-here.mkv

Note that if your instance of MediaInfo was compiled with cURL support then you could also use this with a URL such as:

    mediainfo http://samples.ffmpeg.org/flac/Yesterday.ogg

A more verbose version of MediaInfo’s output can be generated by adding the -f (for full) option.

    mediainfo -f your-file-here.mkv

And for a more parseable output, the --language=raw option ensures that all metadata labels are unique.

    mediainfo -f --language=raw your-file-here.mkv

MediaInfo supports many different types of metadata outputs such as HTML, XML, JSON, EBUCore, and … PBCore! … and PBCore2. Unless legacy versions of PBCore are needed, PBCore2 is recommended. This can be generated with the --Output=PBCore2 option such as:

    mediainfo --Output=PBCore2 your-file-here.mkv

This output can be written to a file with redirection, by adding > outputfile.xml to the end of the command, such as:

    mediainfo --Output=PBCore2 your-file-here.mkv > now-i-have-a-pbcore.xml

To create a batch of sidecar PBCore xml files, a loop can be run such as:

    find /look/here -name '*.mov' | while read file ; do mediainfo --Output=PBCore2 "${file}" > "${file}_pbcore.xml" ; done

Just replace /look/here with a path to a directory in which you’d like to look for files and possibly change -name '*.mov' to another pattern to match the files you want to make PBCore records for.

Customizing PBCore output

The PBCore output from MediaInfo is constructed from information that is generated gathered by MediaInfoLib. In some cases, one may want to replace some of that data or add information that is covered by PBCore’s Instantiation element but it not reportable by MediaInfo (such as instantiationGenerations). Metadata can be added in the process of calling MediaInfo by using the --ExternalMetadata command such as:

    mediainfo --Output=PBCore2 --ExternalMetadata="instantiationLocation;my house" file-at-home.mp4

The syntax here is that within the --ExternalMetadata option a value is provided that contains the PBCore metadata name then a semicolon and then the value for that metadata (instantiationLocation set to ‘my house’ in this example). This will replace the instantiationLocation provided by MediaInfo (which is the filepath of the file), with the user-supplied information. This can be useful when the file path is not wanted for instantiationLocation or when it makes more sense to supply a general value, such as ‘WXXX Archive’ rather than a specific value such as that file’s path in its file system.

Note that --ExternalMetadata will work with instantiationGenerations and instantiationLocation.

Using PBCore to supply description metadata in derivative creation

When generating derivate audiovisual files to facilitate access to media, it is helpful to supply embedded audiovisual metadata within that derivative to facilitate search and access for the user. Several audiovisual distribution formats such as m4a, mp3, mp4 and others offer methods to store descriptive metadata in a manner that audiovisual players can use. This information is important for keeping our iTunes libraries and other personal media collections from turning into a complete mess.

In a usual process of transcoding an audiovisual file to make a derivative, we have a process like this:

    preservation_master.mkv -> access_file.mp4

For preservation_master.mkv it may have some embedded metadata, but in most collection management environments the descriptive metadata is found in a separate database (perhaps a database which can represent descriptive records in PBCore format). In the scenario above the resulting access_file.mp4 will share some of the technical metadata as the source file but does not reflect any of the descriptive metadata that may be stored in the database. With a PBCore XML is introduced into the transcoding process, it could look more like:

    preservation_master.mkv + pbcore.xml -> access_file_with_metadata.mp4

There are several methods to accomodate this. If using FFmpeg to facilitate transcoding, then one method is to convert the PBCore into an ffmetadata document which could be referenced in the transcoding. The attached bash script called pbcore2ffmetadata extracts select PBCore data in order to create an ffmetadata file. It can be used in the following way.

    pbcore2ffmetadata your-pbcore.xml
ffmpeg -i your-media.mkv -i your-pbcore.xml.ffmetadata -map_metadata 1 access-file.mp4

In the above command, when pbcore2ffmetadata runs with your-pbcore.xml as an input, it will create a new file called your-pbcore.xml.ffmetadata. That ffmetadata file can then be used as a 2nd input in the subsequent ffmpeg command. Since that ffmpeg command has two inputs (a source audiovisual file and a metadata file), the option -map_metadata 1 is added to clarify that the metadata of input number 1 (counting of inputs starts from zero) should be used in the transcoding, thus adding -i your-pbcore.xml.ffmetadata -map_metadata 1 after the first input will incorporate that metadata from PBCore into the output file.

Embedding PBCore into Matroska

Matroska supports arbitrary attachments which provides a system for PBCore XML records to be embedded into any Matroska file. The following provide two recommendations for embedding PBCore into Matroska, while creating the Matroska with FFmpeg or adding PBCore xml to an existing Matroska file.

Embed PBCore into Matroska while using FFmpeg

When creating a Matroska file with FFmpeg this options can be used in order to embed a PBCore XML as an attachment as the transcoding is occuring.

    -attach pbcore.xml
-metadata:s:t:0 mimetype=text/xml
-metadata:s:t:0 title="PBCore"

The -attach pbcore.xml option clarifies which file to use an an attachment. Replace pbcore.xml with the filepath to the PBCore XML file that you would like to embed.

The -metadata:s:t:0 mimetype=application/xml option fulfills a requirement in Matroska that the mime type be stored for all attachments. The s:t:0 in the metadata option is a stream signifier where s means stream, t means attachment, and 0 means the first (counting from zero), so this option will set the mime type metadata for the first provided attachment.

The -metadata:s:t:0 title="PBCore" option will store the word PBCore in the resulting Matroska’s FileDescription Element in that Attachment Element. This helps describe the attachment more specifically than the mime type value can.

Embed PBCore XML into an existing Matroska file

The prior scenario can be used to adapt an existing FFmpeg command that creates a Matroska file to embed PBCore XML in the same process. If the Matroska file is already generated, FFmpeg does not support a method to attach a PBCore XML quickly without rewriting the entire file. To attach PBCore XML into an existing file, mkvpropedit can attach the XML more quickly by only rewriting the parts of the Matroska file that need adjustment and leaving the audiovisual data as is.

    mkvpropedit --attachment-description "PBCore" --add-attachment pbcore.xml existing_file_that_could_use_a_pbcore_attachment.mkv

Similar to the FFmpeg command above this mkvpropedit command supplies a description and filename for the utility to use when adjusting the output file. In this case mkvpropedit can correctly guess the mimetype of the XML as “text/xml” so this option does not need to be specifically supplied.

To extract PBCore XML from Matroska

For a file named test.mkv the following commands with FFmpeg and mkvextract may be used to extract the PBCore xml from the file to recreate the original XML document.

Using FFmpeg:

    ffmpeg -dump_attachment:t "" -i test.mkv

Using mkvextract

    mkvextract test.mkv attachments 1

Gathering PBCore samples

The attached script pbcoresamples is written to help use online media to generate a set of PBCore sample file from MediaInfo. This script includes an array called samples which contains a comma-delimited list of values that include a url followed by a bracketed name to be used for the output PBCore document. By running this command, a set of PBCore samples will be generated from the sample list for review. Examining PBCore data from a wide variety of files is helpful in generating feedback and suggestions for MediaInfo’s PBCore export.

Giving Feedback on MediaInfo’s PBCore support

Feedback on MediaInfo’s PBCore export should be written into the issue tracker at MediaInfoLib at https://github.com/MediaArea/MediaInfoLib/issues. MediaInfoLib is the library that MediaInfo uses for file analysis and metadata reporting. If you’d like a closer look at what’s happening behind-the-scenes, the PBCore2 exporter in MediaInfoLib is reviewable at https://github.com/MediaArea/MediaInfoLib/blob/master/Source/MediaInfo/Export/Export_PBCore2.cpp.