RPGDX > Forums > Design and programming > Using strings for IDs

Mattias Gustavsson · Posted: Tue Jun 02, 2009 6:01 pm Post subject: Using strings for IDs

Sometimes, it's nice to be able to identify things by name, rather than using a number or an enum.

The problem is, that if you're not careful, using strings can take a lot of memory, and LOADS of CPU-power, with all the stricmp's. When I was transferred to the Tycoon City project, with the task of speeding up the game (which was *really* slow at that point), it turned out they were using about 400MB for storing strings (of which most was used as ID:s, and was mostly duplicates of duplicates), and that string comparisons and string copying was right at the top of the profiling results. I won't go into how I went about solving that particular problem - it as late in the project, and things turned out the way they turned out - but it got significantly faster and smaller, that's for sure.

But ever since then, I'm always a bit wary about using strings - it can easily get out of hand, and it feels wrong to waste both memory and CPU-power on something if you don't have to. So, I thought I'd share my own system for solving this problem, extracted from my (recently cleaned up) Pixie Game Engine. It's a solid, fast solution, ready to just drop into any project, and consists of only two classes.

At its core is a class I call StringId, which is easily created by:

DeveloperX · Posted: Tue Jun 02, 2009 8:52 pm Post subject:

Nice. finally some explanation to that cryptic mess XD
I'll have a look through it sometime later. :)
_________________
Principal Software Architect
Rambling Indie Games, LLC

See my professional portfolio

Ninkazu · Posted: Wed Jun 03, 2009 12:06 am Post subject:

Why not have string ids in the global table also have an integer id (a simple index) so when a script is loaded you can replace the ids at load time and have truly constant lookup rather than amortized? Or is this so you can hardcode using string ids rather than use scripts? *shudder*

Rainer Deyke · Demon Hunter Joined: 05 Jun 2002 Posts: 672

This seems to be a fairly common pattern. It looks like my own identifier class, and bears more than a passing resemblance to boost::flyweight.

It looks like you're preserving the hash value of the string for when the StringId is used as the key in another hash map. I think this is unnecessary. Since StringId already contains a pointer to a shared string, why not use that pointer directly as a hash value? i.e.

RedSlash · Mage Joined: 12 May 2005 Posts: 331

Rainer Deyke · Demon Hunter Joined: 05 Jun 2002 Posts: 672

Mattias Gustavsson · Posted: Wed Jun 03, 2009 10:24 am Post subject:

Yes, Mr. Deyke is quite right - you could just as well use the string pointer as a hash value.

As pointed out, my code does:
- Calculate hash value.
- Use hash value to look up shared string.
- Store the pointer to the shared string.
- Also store the calculated hash value for future lookups using the StringId as key. I want to clarify, that these future lookups has nothing to do with the StringId system itself - at this point, the shared string is already found. But sometimes, I make use of another class, called HashTable, which can store and quickly look up elements based on a hash value. In those cases, I can reuse the hash table stored in the StringId class, rather than recalculate it.

I've found that the hashing function gives me a significantly more even distribution (less collisions) in my HashTable than using the shared string pointers, so that's my reasoning for storing it along with the StringId.

But I realize that this is not useful as part of this easy-drop-in version of the StringId system, so I've uploaded another version with that bit taken out - thanks to Rainer for pointing it out :-)

Download the version without the extra hash value here
_________________
www.mattiasgustavsson.com - My blog
www.rivtind.com - My Fantasy world and isometric RPG engine
www.pixieuniversity.com - Software 2D Game Engine

tcaudilllg · Posted: Thu Jun 04, 2009 11:44 am Post subject:

That is insightful. Good job.

Hajo · Posted: Fri Jun 05, 2009 8:47 am Post subject:

Not sure how useful this is, but Java String class has a method called "intern" that puts a string into a global table, so that you can compare all interned string objects by their references, not needing to compare the contents.

All string literals in Java programs are automatically interned.

This seems to be much like what Matthias was doing. It might be a sign that his idea was good, or a sign that proper languages bring the feature with them by default ;)

Mattias Gustavsson · Posted: Sun Jun 14, 2009 7:28 pm Post subject:

After a lot of discussions, testing and optimizing (you can read the whole disussion here if you understand Swedish), I have a new version of the system, which is even faster and takes less memory (and makes a lot less dynamic allocations):

http://www.colossusentertainment.com/forumref/pixie_stringid_v2.zip
_________________
www.mattiasgustavsson.com - My blog
www.rivtind.com - My Fantasy world and isometric RPG engine
www.pixieuniversity.com - Software 2D Game Engine

cowgod · Posted: Sat Jun 20, 2009 3:08 am Post subject:

If you want it to be faster (especially for large string tables), you could try using a "trie" instead. You can get the general idea of what one is by looking it up in Wikipedia. The algorithms book I read that mentioned tries (Algorithms in C++, Volumes 1-4, 3rd edition) included a slight variation where you only have nodes up to the point at which there's no chance of the result being a different string.

Tries have an interface similar to Hash Maps, but you don't need to deal with hash codes.

Although speed might be improved, the memory usage will most likely be greater.

cowgod · Posted: Sat Jun 20, 2009 3:08 am Post subject:

If you want it to be faster (especially for large string tables), you could try using a "trie" instead. You can get the general idea of what one is by looking it up in Wikipedia. The algorithms book I read that mentioned tries (Algorithms in C++, Volumes 1-4, 3rd edition) included a slight variation where you only have nodes up to the point at which there's no chance of the result being a different string.

Tries have an interface similar to Hash Maps, but you don't need to deal with hash codes.

Although speed might be improved, the memory usage will most likely be greater.

cowgod · Posted: Sat Jun 20, 2009 3:08 am Post subject:

If you want it to be faster (especially for large string tables), you could try using a "trie" instead. You can get the general idea of what one is by looking it up in Wikipedia. The algorithms book I read that mentioned tries (Algorithms in C++, Volumes 1-4, 3rd edition) included a slight variation where you only have nodes up to the point at which there's no chance of the result being a different string.

Tries have an interface similar to Hash Maps, but you don't need to deal with hash codes.

Although speed might be improved, the memory usage will most likely be greater.

Mattias Gustavsson · Posted: Mon Jun 22, 2009 6:44 pm Post subject:

Well, we did a basic implementation of tries, and it was a lot slower. I'm sure it could be sped up quite a lot though, if proper care was taken when implementing it (ie. not using STL/Boost for example), but the current StringId implementation is pretty fast, so I don't feel a need to do change the way it works.

The timings looked like this, though it doesn't say much really, as it's comparing my optimized StringId implementation against the unoptimized Tries implementation

however, if someone wants to try and see if they can make their own faster implementation (which can be a great learning experience), there's a test-bed (including the basic Tries implementation) here:

stringid testbed (note: the testbed requires boost, for the Tries implementation, and contains an earlier version of StringId).

it creates and then locates 100,000 strings, and can either randomly generate them (so all of them is different) or read them from a text-file (the full text of "dr. jekyll and mr. hyde", which is included), resulting in lots of duplicates, which might be a more realistic test case.

I've found that it can be really quite interesting to try these sort of things, as you usually learn quite a bit about what's fast and what isn't.
_________________
www.mattiasgustavsson.com - My blog
www.rivtind.com - My Fantasy world and isometric RPG engine
www.pixieuniversity.com - Software 2D Game Engine