Opt. 2: Removing Redundant Attribute Values¶
SpatialJSON Writer Implementation¶
It is completely up to the SpatialJSON writer to decide, which strings to add to the shared string table. Several strategies can be used. However, the current implementation in this module makes no attempt to create an optimal shared string table. In order to be fast, strings are added as they come when features are serialized. Building an optimal table would likely require iterating features several times, calculating frequencies of strings, etc.
Nevertheless, this module’s SpatialJSON writer has some simple rules for building the shared string table. Even for worst case scenarios, these try (at least) not to use (much) more bytes than needed for the same result without using a shared string table. (In theory, there are cases in which the shared string table adds some extra bytes to the result.) However, for most real world datasets, this strategy could save a moderate to significant number of bytes.
These are the rules that prevent a string from being added to the shared string table:
The string’s UTF-8 encoded byte length is less than a hard-coded minimum (currently 2, may be configurable in the future)
The shared sting table is full, that is, it contains 2,147,483,647 entries (not really expected)
The string’s UTF-8 encoded byte length (including quotes) is less than the number of digits of it’s designated index
Obviously, most savings can be achieved if a dataset contains only a few different large strings. That may be the case for attributes, that contain values of an enumeration, for example. The more often a certain string is used in the dataset, the more space can be saved by using a shared string table. In contrast, if every string in the set of encoded features is used only once (e. g. attributes that contain random or UUID-like strings), no savings will be achieved (in fact, using a shared string table in that case will produce even slightly bigger results).