Why hash tags are broken, and ideas for what to do instead.
I was at Moseley Bar Camp last Sunday and there were some great sessions. Andy Mabbett stood up to lead a discussion entitled Let’s Play Tag: recent developments and emerging issues in the use of tagging for added semantic richness.
Andy was looking for discussion on how to solve the problem of ambiguity in hash tags - a popular technique for categorising community tweets on twitter. His example is classic event tagging, the tag for the event was #mbcamp which works fine for the duration of a Sunday afternoon event, but what if you want tags to be more enduring?
Andy took us step-by-step through the issue of ambiguity of usernames as tags on twitter and flickr and described some of the issues of differing tag normalisation rules.
Andy also asked why we tag?
- To add semantic richness?
- To help your friends find stuff?
- To help machines in 100 years find stuff
- Don't know
Andy's issue, then, is with the value of these tags longer term and on more enduring stuff like blog posts, photos on flickr and so on. Perhaps 100 years might be pushing it, but it's worth thinking about.
The problem with hash tags comes from the tension between finding something specific enough for the moment, something short enough to not use up too many of the 140 characters and something easy to remember. That's two forces pulling one-way (shorter) and only one pulling the other.
The shorter the tag goes the easier it is to remember and to type, and the fewer character it uses up, but it also becomes more likely to clash with others. Perhaps some mainstream trends might get away with very short tags, I thought. #fb for example means facebook, surely, but looking at the use of it apparently the references to facebook are far outweighed by the noise.
So, twitter's 140 character limit and the profusion of clients means we can only have short, easy to remember text tags, but the need for disambiguation and to be more specific means we need something longer.
We could solve the ambiguity problem by using something like a guid, but that's not easy to remember or type, and is generally quite long. The length issue could be solved by encoding it using unicode characters. Twitter counts multi-byte UTF8 characters as single characters, which is correct, and this opens up some interesting unique tags for those willing to forego the easy typing.
By long I mean cf629dc3-d425-4707-8119-1f35d35d7687 which is a fairly typical GUID and is 36 character long. That's too long if you only have 140 characters to play with. The length comes from the need to encode it as ASCII. Twitter, where our length obsession comes from, doesn't require characters to be ASCII. The 140 character limit is for 140 UTF8 characters, so we can use a much greater range of characters to represent the same degree of uniqueness in a shorter UTF8 string.
UTF8 isn't ideal as a starting point, though, as the number of bytes per character varies. The unicode definition uses nice simple 2 byte indexes, so we match 4 ASCII characters from the GUID to a unicode character, then use the UTF8 encoding for those to write it down. By using unicode and UTF8 it becomes just a handful of characters, just 8 for this GUID.
cf62 콢, 9dc3 鷃, d425 퐥, 4707 䜇, 8119 脙, 1f35 ἵ, d35d 퍝, 7687 皇
This gives us a tag of #콢鷃퐥䜇脙ἵ퍝皇 which is not easy to type, would be difficult for many to visually identify and could, for all I know, be extremely offensive to those who read CJK, Hangul or Greek. I may have got lucky with that GUID too, there may be GUIDs that don't produce valid unicode pairs.
But, as it's a GUID it gives a very high confidence that is unique, it's only 8 characters long and works as a unicode tag on Flickr and a unicode tag on Hashtags. Just don't look at the raw URLs in the source of the page...
What we lose with that approach is a good deal of ease-of-use. I certainly wouldn't try this technique at an event.
If you're prepared to lose a little usability, maybe giving people an easy place to grab a copy/paste version of the tag then you could produce something more easily readable, if not easy to type: #dɯɐɔqɯ for example. I might be tempted to do that, or add a graphic symbol or something.
There's something else that nags at me about hash tags, though. They're really not very webby. You rely on search and on hashtags.org and other specific tools to make sense of them. They can be easily abused, as Habitat showed recently.
So are there other ways to think about tagging? Ways that work with the web rather than just on the web. Examples from those applications where the 140 character limit does not apply? Blog posts, web pages, flickr images and so on?
What if we decided that our requirements for tagging were:
- A very high degree of uniqueness
- Anyone can get information about the tag easily
- Spam and content visible on the tag controlled by the tag owner
- That the tag can be enduring
- That the tag can be used anywhere on the web easily
- That content using the tag can be found with search
- That content using the tag can be found without search
- That no particular service or piece of software is necessary
A similar things happens with Google's PageRank algorithm. Words used in links to a page, as well as the content of the referring page, contribute to the way a page is indexed.
The Semantic Web bases everything on URIs (the difference between URI and URL is not important here). If you want to give something a name you don't pick a word, you use a URI.
I wonder if we could use URIs as tags? And how that would meet the needs above. Say we were to use http://wxwm.org.uk/moseleybarcamp/2009/June to mean the event that happened last weekend.
It has a very high degree of uniqueness, so it meets our first requirement. It can be put straight into a browser and can provide a page giving details of the event, so it's easy for anyone to get information about the tag. The page at that address can be as clever, or as dumb, as it likes about showing things that link to it - so tag spam can be removed. The link is under control of the domain owner, so can be as enduring as you want to make it. Almost everywhere on the web allows you to post links, so it's easy to use. Links to a specific URL can be easily searched for in Google and other search engines, and in Flickr and Twitter. Most browsers will send referring page information when requesting the URL, so content can be tracked without search - this means you can find out about unindexed and intranet sites referencing the tag. The URL can be a static page, or a script, it can monitor referrers and spam filter - or not. There is not centralised service needed nor any specific software.
Oh, and it could easily be made to work as Linked Data, the pattern for publishing data on the semantic web, to provide machine-readable information about the event and the conversation happening around it...
I think that only leaves the issue of URI length. I can't get close to the 8 characters of the guid, or the 6 of mbcamp, but using bit.ly I can make a memorable short URL such as http://bit.ly/utf8tag that redirects to a much longer one, and as bit.ly don't re-use URLs the bit.ly link remains as unique and almost as enduring (subject to bit.ly's survival) as your own.
Rob, Let me get this straight: you think hashtags are broken because they are ambiguous over time? Isn't that a bit like saying longitudes are broken because they're ambiguous over latitude? I compile some data about conference hashtags: http://go-to-hellman.blogspot.com/2009/06/conference-hashtags-dont-evolve.html Also, remember that utf8 is an encoding, not a character set, so there's no such thing as a UTF8 character. Unicode characters are 2 bytes, not 4, and please don't mix up ascii and hexadecimal. Isn't blogging wonderful, you can use your readers as fact-checkers?
You're right, ambiguity in the moment can be overcome through the use of registries, but that introduces a need to know about a registry. There are some already, I just defined mbcamp on tagdef.com: http://tagdef.com/mbcamp I've only defined it on tagdef.com though, any others out there continue to lack a definition. Defining the tag that way also doesn't prevent others from using it for another purpose, or from abusing it with spam. It confers no degree of uniqueness or ownership. And, where more than one definition exists, it provides no mechanism for deciding which bits of content were referring to which definition when they were tagged. Centralised approaches, and approaches that need prior knowledge of a specific service, are less useful on the web than mechanisms that are de-centralised and discoverable through the normal mechanism of following a link... We already have a great, well-understood, scalable, de-centralised naming and lookup mechanism. It's called the web. Why tey to invent another on top of it - which is what hashtags are. I agree with you, and with Andy Powell's comments on twitter that at this point putting URLs into twitter, or using them as tags on Flickr, is harder than using simple word tagging. But, the mechanisms for adding links to stuff are all over the place in other tools, there are well understood interaction design patterns to make that easy. With good UI design we could be using the web for tagging, rather than inventing a different linking mechanism on top of it.
One problem (which is also valid with any URL shortner) is when the UUID will yield something meaningful to a specific language, and you will be tagging a dog with a "cat"… In ASCII at least you can reduce your set to omit let sat vowels to avoid that or worse have a dictionary as blacklist, in an asian language each character has meaning and your hashtags will actually increase the ambiguity (from an human point of view). It would be interesting to restrict to graphical symbols in the unicode spectrum yielding "alien-like" tags #⍔⌊⏏⎀☢☕, looking at these symbols I wonder if people use emoji for tagging…
Hey Eric, good to hear from you. Deciding what to do now you're out of the big house? I do think ambiguity over time is sometimes an issue, but ambiguity in the moment is also an issue and comes from the tension between uniqueness and length that twitter introduces. I don't think that's the same as saying longitude is ambiguous over latitude, it's more like saying my surname alone is not enough to identify me. Or perhaps, a flight number is not enough to catch the plane, you also need to know the date and time... My reference to ASCII was simply that the usual way to represent a GUID is as an ASCII encoded hexadecimal string, but that's not the only way you can do it. I'm fully aware that UTF8 is an encoding, I'm not worried about the language being sloppy as it's not important for the point of this post. You're quite right that Unicode uses a 2 byte character id rather than a 4 byte one, my bad as I was working with the hex characters visually, I've corrected that now. Fundamentally, though, my point is that hashtags are a way of marking content that could be done better using URIs. If people started using URIs you can be sure that UIs would evolve to make it easier too.
@Laurian, yes you're quite right. I'd considered the risk of the tags being offensive but had overlooked them being misleading! I'm not really suggesting using a broader character set and GUIDs as that has some real issues. It was really an exploration of what we really want from tagging, which in my mind comes down to tagging with URIs as Linked Data.
Interesting stuff. I think the problem arises when people are using different devices (such as at conferences etc) and finding an easy way for them to add the tag. Really it would require a fundamental change in the way Twitter works. It would be :very: cool though. Oh, and the reason for the huge amount of #fb hashtags is that there's an app which allows you to post directly to Facebook from Twitter, but only if you post #fb in the body of the message, which is why so many of the tweets bare no relevance at all to Facebook.
Thanks for pointing out the #fb thing, that explains a lot. In terms of the devices, I wonder how long that issue will persist given how mobile devices are becoming more and more capable web tablets.
Rob, My gut feel is that there's a significant danger of over-engineering the solution here... In your list of requirements, 5 ("That the tag can be used anywhere on the web easily") has to come first - and it has to come first by quite a long way... because otherwise you just have something which isn't usable and therefore won't get used. Still, I suppose that one way of ensuring uniqueness is to make sure that no-one ever bothers to create tags in the first place!? :-)
I'm currently using a 22 char rendering of UUID (base64) but I was wondering if there is any other alternatives for shorter random generated unique ids; for example for Version 1 UUIDs the MAC can be dropped if you use them in an http URI (assume that domain name will substitute MAC's function)…
Rob, For now, I've decided that I'm blogging! Ambiguity in the moment can be addressed by the adoption of dictionaries/registries; you can then synthesize a GUID with a timestamp if you need it- a similar approach can be used for airline flights, which I think is a much better analogy than what I thought of. For both airline flights and twitter hashtags, the overriding consideration has to be user acceptance.
[...] Rob Styles [...]
Why hashtags are broken, and ideas for what to do instead http://bit.ly/1ENs2c #utf8tag #semanticweb: Source: #.. http://bit.ly/3jUTMt
This comment was originally posted on Twitter
Why hashtags are broken, and ideas for what to do instead http://bit.ly/1ENs2c #utf8tag #semanticweb
This comment was originally posted on Twitter