PROJ RFC-8: Embedding resource files into libproj

Author:

Even Rouault

Contact:

even.rouault @ spatialys.com

Status:

Adopted, implemented

Implementation target:

PROJ 9.6

Last Updated:

2024-Oct-01

Summary

This RFC provides an optional way of embedding proj.db and proj.ini files directly into libproj, either using C23 #embed pre-processor directive when supported by compilers, or falling back to a CMake-based script for older compilers.

Motivation

Most common practical PROJ use critically depends on the availability of proj.db. Locating that resource files on the file system can be painful in some use cases of PROJ, that involve relocating the PROJ binary at installation time. One such case could be the PROJ embedded in Rasterio or Fiona binary wheels where PROJ_DATA must be correctly set currently. DuckDB Spatial also patches PROJ to embed proj.db in its static build. Web-assembly (WASM) use cases come also to mind as users of PROJ builds where resources are directly included in libproj. Given the existence of several out-of-tree patches to support embedding proj.db (such as https://github.com/OSGeo/PROJ/issues/2998#issuecomment-1004185741), it makes sense to have a upstream-vetted solution the community can build around, and potentially use as a point for further works.

Technical solution

C23 #embed

The C23 standard includes a #embed "filename" pre-processor directive that ingests the specified filename and returns its content as tokens that can be stored in a unsigned char or char array.

Getting the content of a file into a variable is as simple as the following:

static const unsigned char proj_db[] = {
#embed "data/proj.db"
};

Support for that directive is still very new. clang 19.1 is the first compiler which has a release including it, and has an efficient implementation of it, able to embed very large files with minimum RAM and CPU usage.

The development version of GCC 15 also supports it, but in a non-optimized way for now. i.e. trying to include large files, of several tens of megabytes could cause significant compilation time, but without impact on runtime. There is expressed intent from GCC developers to improve this in the future.

Embedding PROJ's proj.db of size 9.1 MB with GCC 15dev at time of writing takes 18 seconds and 1.7 GB RAM, compared to 0.4 second and 400 MB RAM for clang 19, which is still reasonable (Generating proj.db itself from its source .sql files takes one minute on the same system).

There is no timeline for Visual Studio C/C++ at time of writing (it has been requested by users)

To be noted that currently clang 19.1 only supports #embed in .c files, not C++ ones (the C++ standard has not yet adopted this feature). So embedding resources must be done in a .c file, which is obviously not a problem since we can easily export symbols/functions from a .c file to be available by C++.

Fallback C99/C++11 mode

For non C23 capable compilers, a CMake script, derived from https://jonathanhamberg.com/post/cmake-file-embedding/ (which is MIT licensed) is used that generates a .c and .h file per file to embed. The C file consists of a const unsigned uint8_t content[] = { .... } array, which matches what a non-optimization implementation of C23 #embed typically does.

This script has been improved because it performed very poorly on large files such as proj.db. Its execution time is now 8 seconds for proj.db.

memvfs

Loading of the embedded proj.db involves using the SQLite3 memvfs, as done by DuckDB Spatial

New CMake options

Resources will only be embedded if the new EMBED_RESOURCE_FILES CMake option is set to ON. This option will default to ON for static library builds and if C23 ``#embed` is detected to be available. Users might also turn it to ON for shared library builds. A CMake error is emitted if the option is turned on but the compiler lacks support for it.

A complementary CMake option USE_ONLY_EMBEDDED_RESOURCE_FILES will also be added. It will default to OFF. When set to ON, PROJ will not try to locate resource files in the PROJ_DATA directory burnt at build time into libproj (${install_prefix}/share/proj), or by the PROJ_DATA configuration option.

Said otherwise, if EMBED_RESOURCE_FILES=ON but USE_ONLY_EMBEDDED_RESOURCE_FILES=OFF, PROJ will first try to locate resource files from the file system, and fallback to the embedded version if not found.

The resource files will still be installed in ${install_prefix}/share/proj, unless USE_ONLY_EMBEDDED_RESOURCE_FILES is set to ON.

Impacted code

  • cmake/FileEmbed.cmake: compatibility script for non-C23 mode to generate embedded resources

  • data/CMakeLists.txt: take into account USE_ONLY_EMBEDDED_RESOURCE_FILES to not install proj.db/proj.ini when it is ON

  • docs/source/install.rst: document EMBED_RESOURCE_FILES and USE_ONLY_EMBEDDED_RESOURCE_FILES

  • src/embedded_resources.c and .h: new files that use #embed or make a bridge to files generated by FileEmbed.cmake

  • src/filemanager.cpp: to take into account EMBED_RESOURCE_FILES for proj.ini

  • src/iso19111/factory.cpp: to take into account EMBED_RESOURCE_FILES for proj.db

  • src/lib_proj.cmake: takes into account EMBED_RESOURCE_FILES and USE_ONLY_EMBEDDED_RESOURCE_FILES in both C23 and non-C23 modes

  • src/memvfs.c and .h: code originating from https://www.sqlite.org/src/file/ext/misc/memvfs.c to handle a in-memory proj.db, with bug fixes, and adaptation for PROJ needs

  • src/sqlite3_utils.cpp and .hpp: interface layer of memvfs with src/iso19111/factory.cpp

Out of scope

Embedding of resource files in PROJ is currently limited to proj.db and proj.ini, as those are the ones which are expected to be the most needed ones in typical embedded use cases. Extension to other resources (ITRFxxxx file) could potentially be done as follow-up enhancements if the need arose, although supporting dual C23/non-C23 mode for too many files could be a bit tedious.

The sky is the limit, so potentially grid files could also be embedded. That would require developing a MemFile implementation in filemanager.cpp (in parallel to the existing FileStdio, FileWin32 or NetworkFile).

Backward compatibility

Fully backwards compatible with default settings.

Static builds will default to EMBED_RESOURCE_FILES=ON, but USE_ONLY_EMBEDDED_RESOURCE_FILES will default to OFF. So an external proj.db and proj.ini found by existing search mechanisms will still have precedence over the embedded files.

Even when EMBED_RESOURCE_FILES an/or USE_ONLY_EMBEDDED_RESOURCE_FILES is enabled, the user can still use proj_context_set_database_path() to provide an alternate database. Network based fetching of grids is also orthogonal to those settings.

C23 is not required: it is just an opportunity for faster build time when available.

Documentation

The 2 new CMake variables will be documented.

Testing

The existing fedora:rawhide continuous integration target, which has now clang 19.1 available, will be modified to test the effect of the new variables.

Local builds using GCC 15dev builds of https://jwakely.github.io/pkg-gcc-latest/ have also be successfully done during the development of the candidate implementation

Voting history

+1 from PSC members KurtS, KristianE, JavierJS, ThomasK and EvenR