RDF and the Semantic Web are ludicrous ideas

Perspective — RDF and the Semantic Web are getting serious attention, but they are ludicrous ideas. They may be about doing the right thing, but they are doing it the wrong way.

The Semantic Web — and RDF, for that matter — is about providing information on information… But take a few steps back: How much time would you — as an everyday user — spend providing information on information by inserting tags every other word you type?

Suppose, for instance, that you are writing something about your dog Fido, and that you decide to provide some information about Fido:

:Dog rdf:type rdfs:Class
:Fido rdf:type :Dog
:name rdf:type rdf:Property
:Fido :name "Fido"
:Dog rdfs:subClassOf :Animal

The above is a piece of so-called “simple datatyping model for RDF” taken from Sean Palmer’s Introduction to the Semantic Web. It stands for: Dog is a class of information, Fido is of type Dog, name is a property, Fido’s name is “Fido”, and Dogs are a subclass of Animal.

While you’re at it, also provide information about yourself, about what you were doing with Fido, about the place you were with Fido, about why you are talking about him, etc. in the relevant rdf schema.

Let’s get real here: Beyond the syntax, which is not very inspiring to say the least, there’s not a chance in hell any user will fill this kind of information on information themselves. And no computer will fill the information for you: The very idea behind the Semantic Web would then make no sense at all… Take another few steps back.

What do you expect the computer will automatically fill? The author? Great stuff: What if your assistant is managing the master document of your entire team’s report? Is she the author? Are you? your colleagues? Did I mention she based the document on an older one? The date? Sure: the last modified date might be correct, but the date created might not be since the initial document was written a year ago. And what if you merge and/or split documents? The topic? Laugh!: Trying Yahoo!’s context search feature should make it clear we’re not even remotely close to extracting it properly. Let’s suppose… assume… that the computer actually manages to fill this common meta information by itself nevertheless. Say, 80% of the time. Will you spend the necessary time to manually check the meta information to cope with the 20% untypical cases where context had any importance? Will you pay someone to do the job for you for the sake of providing potentially relevant meta information on the documents you produce? I doubt it… As for automatically inserting relevant tags within the document, take yet another few steps back…

The very purpose of the Semantic Web is to let you interconnect related information to and from your document. As such, unless you come up with an Artificial Intelligence that understands the meaning of your words — making the Semantic Web itself an irrelevant construct — you will need the Semantic Web or a HUGE hand-made Semantic Network to automatically provide you with information on the information you are manipulating in order to automatically fill the RDF data. Yes that’s right: You need the Semantic Web to be up and running in order to automatically generate the Semantic Web itself. Sounds wrong to you too? That’s because the idea itself makes no sense at all.

You’ve certainly come up with a categorization problem in the past. As in, you’ve a document to categorize: Which category will it be? For instance, this document is a Column mainly related to Information Technology, Internet and Semantics; It might also interest someone looking for data on Computational Linguistics or on the Philosophy of Language. Hence any of these categories will do. It’s also about Strawberries and Polar Bears — because I just mentioned both. I’d categorize it as a document on Raspberries. And if you don’t agree, you are making my point.

The idea behind the Semantic Web assumes some sort of universal language scheme traverses individuals and cultures. Quoting Tim Berners-Lee himself:

Where for example a library of congress schema talks of an “author”, and a British Library talks of a “creator”, a small bit of RDF would be able to say that for any person x and any resource y, if x is the (LoC) author of y, then x is the (BL) creator of y. This is the sort of rule which solves the evolvability problems. Where would a processor find it?

If that doesn’t sound absurd to you, here are a few perspectives.

First off, you must be aware that semiotic signs and the concepts behind them are categories. When you speak and think of apples, your interest is in a class of items you categorize as apples.

Then, you need to know that it is mathematically impossible to come up with a categorization scheme that lets you consistently express everything a language will let you say using a consistent, finite set of categories. Unless you decide — as did Ludwig Wittgenstein in his Tractatus Logicus — that you are never manipulating a category, you’ll thus mechanically end up with inconsistencies — as in: “this false statement is true”.

Please note — Part of this post was lost. It was rewritten from this point onward.

Moreover, the way you categorize your environment differs from a culture to another, from a subculture to another, and from an individual to another. For instance, West Greenlandic has no less that 49 ways to say “snow” or “ice”. This lack of homogenity means you’ll readily encounter subcultures and individuals within a culture or a subculture that give different meanings to a word. And by all means, this is not hierarchical — far from it — or even consistent.

As such, the very idea of the Semantic Web becomes inane: On one side, there is no absolute scheme to tag the web. On the other, no two individuals will interpret a tag the same way.

You can now safely laugh too: RDF and the Semantic Web are ludicrous ideas.

Suggestion of related posts on this site:

Comments on RDF and the Semantic Web are ludicrous ideas

  1. I agree that the vision of the semantic web is one that will be difficult to achieve. Yet, I don’t think that means that the effort going into developing it is wasted. I think a lot of it comes down to the client/environment. The current syntax for expressing triples is generally very ugly and unwieldy. I think, however, that there are ways in which many existing systems can generate useful ‘semantic’ data automatically to allow interoperability and searchability etc. For example, your blog uses categories. You’re the author. The time you posted this article. The people who commented (just me so far) and the system you used. That’s why the syndication formats like RSS headed towards semantic web-friendliness.

    I dont think anyone is arguing that EVERYTHING on the internet will become Semantic-enabled or whatever. Honestly, I don’t think that’s really the vision of the Semantic Web — one in which the whole internet is logically semantically annotated for instant understanding by machines or whatever. Even if it only ever got as far as being adopted by academia it would still be exceptionally useful, because any enhancements to information sharing protocols and tools means that collaboration on research becomes easier and more productive.

    If your article was aimed at people who are putting MONEY on the semantic web, then yes, I think it’s not a good monetary investment unless you have tons of cash that wont run out for the next 5 years. But there are many little gains to be made by leaving room in new software projects to consider ‘semantic web’ possibilities.

    Burroughs said, “Language is a virus,” and the thing about that idea is that once there are clear foot-holds for tiny ‘semantic weblings’ to take root, the effects and techniques for making them useable will spread VERY quickly.

    BTW- I like your blog. It took me a while to figure out where your RSS/ATOM feeds were though. (maybe add a link on your page somewhere.) In any case, thanks for your time. I’ll be reading regularly.

  2. Thank you very much for your suggestion and your support!

    I think there is a very big difference between a spontaneously emerging semantic network and the semantic web discussions at the w3c. The key difference lies in the approach.

    On the one side, there’s the w3c’s top-down approach: you tell the machine what sense is all about, and feed it structured data. In the end, you are merely configuring an expert system. Even if it gets the sentence right, it will have no clue what its meaning is.

    On the other, there’s an emerging system’s bottom-up approach: you tell the machine what learning is all about and feed it mostly unstructured data. In the end, you’ll eventually tell it how humans learn, as opposed to how humans would like to learn, and you’ll be facing a genuine AI that will make you fail the Turing test.

    Note that in the end, building a Semantic Web using the w3c’s approach way may turn out to be useful nevertheless: Google’s irrelevant results remain better than no results at all.

  3. You have an RSS feed of your site. That in itself is a counterargument to your entire article.

    YES, you are right – people won’t mark up every little bit they put online, but that isn’t needed: People will mark up the bits they care about. I’d rather have info that matters marked up than all kinds of fluff.

    Companies selling online will mark up their catalogues because 1-2% extra sales for adding some extra processing of their products database is worth it, and it won’t take much of an audience to a new product search engine before they’ll be able to reach that.

    Bloggers already mark up their stuff because it helps people find it and connect.

    All kinds of organisations that have things they want to make available to the widest audience possible will start marking up their most important material as they see it driving traffic.

    Searh Engine Optimisation is big business – ultimately marking up your site with semantic information is just another form of SEO.

    Your example is contrived, because there’s no point in marking up everything – you mark up whatever will have it’s value increased through markup. That means data that it’s important for you that people can find and reason about.

    I’m thinking about putting my dads genealogical database online in RDF, for instance. Not because it matters that much to me, but he put in a tremendous effort in it, gathering data on many thousand relatives, of which about 10.000 currently living people, most of whom don’t know much about their family. Putting that data online in a format that is easily linked over the web would allow others to easily link it into the bits and pieces they have.

    It’s the kind of applications you will see because the data is already structured, and creating the semantic web from it means preserving structure instead of losing structure by generating pure HTML that is hard to mine.

    A vast amount of the web HAVE semantic information associated with it in the databases and content management systems they’re generated from – but that information is lost when it is output into a form that is only human readable and not easily machine parsable.

    Unlock 5-10% of the database content that is tied to the net and we already have the Semantic Web.

  4. > Unlock 5-10% of the database content that is tied to the net and we already have the Semantic Web.

    Yes, and agreeably, we will see interesting applications emerge from this human-generated ‘semantic’ information: Google’s sorry search results are better than anything search engines sprouted a couple of years ago.

    But the case here is not so much that the Semantic Web has no use whatsoever. It’s that AI-wise, it is not a relevant idea to start with.

    Indeed, you need to wonder what semantic is. And the idea that semantic can be thought as something entirely orthogonal to syntax is just silly.

    Consider:

    > This statement is false

    The inconsistency comes from reinjecting the meaning (= supposedly pure semantic) into the statement (= supposedly pure syntax). Are we discussing syntax, semantics, or both at the time because you cannot separate the two?

    My conclusion: The semantic web, sure. Feel free to build taxonomies and folksonomies and pile up meta data all you want. In the meanwhile, semantic-wise, structured data and unstructured data are equivalent. The only thing that really counts is context.

  5. I agree it’s not particularly useful in an AI context, but to me at least that is not the point. To me the Semantic Web is about allowing the machine assisted discovery, linking and reuse of data in a different context.

    Semantics in the context of the Semantic Web has more to do with logic programming and expert systems than it has to do with “AI”: It’s not about intelligence, it is about inferring relationship by following specific rules in a highly deterministic manner, and using those rules to allow you to reuse the data in ways it was not originally intended.

    I agree that unstructured data CAN have the same sematic content as structured data, but they don’t have to be. However the real difference is in accessibility.

    Consider even something as “simple” as parsing a webpage to guess at where the list of current items are vs. using an RSS file to get the same information. The semantic content can very well be the same, but the RSS file adds value by structuring the semantic data in ways that are more easily accessible.

    You could easily produce this information in a structured form using natural language too, so yes, syntax enters into the equation, but the value of RDF is decoupling the two discussions by providing a simple way of making generic assertions about data in a well documented format with well defined semantics (for describing the assertions themselves) so that we can get on with adding meaning instead of discussing syntax.

    This is essentially a way of helping computers get access to semantic markers that humans infer in ways that are still too complex for us to fully understand. The day someone show be a program that can process a plain text file and infer the same level of semantic information that a human would be able to infer is the day the Semantic Web becomes largely obsolete and we “just” let AI interpret the raw data. However personally I believe that day is decades away.

    Even “simple” things like interpreting mostly structured data like a product info sheet on a webpage is a task that is today too hard to be practical. That is, it’s easy to make something that is product specific or web page specific, but beyond the state of the art to make something that will extract information as reliably as a human. Add some simple semantic markers in a well defined format and the task is suddenly trivial.

    That is the value of the Semantic Web.

    Vidar

  6. Bill Gates once said that there is a tendency to overestimate 2-year away technology advances, and a like tendency to underestimate 10-year away technology advances.

    Recent advances, in spite of being critic-worthy in many ways, make me a firm believer that AI is just at the corner.

    Remember: In 1995, noone was talking about the web. And today, the web is simply everything.

  7. In 1995 I was building websites, so I was certainly talking about the web ;)

    On the other hand, people have been talking about AI for around 40 years, and while AI delivers in many areas (neural networks, for instance, are clearly useful – I use a simple one for an experimental e-mail classifier/spam filter), AI in the sense of reasoning and making conclusions from unstructured data in a natural language has proven illusive despite being considered as “just around the corner” for decades.

    I don’t doubt we’ll see improvements. And I don’t doubt many of those will be impressive. But in the meantime I’d rather make use of the wealth of semantic data that is readily available but “locked away” by lossy presentation methods, and export that information onto the web so we can do impressive stuff without needing to solve the hard problems.

    I see this as a data modelling and presentation issue more than something related to AI: When we have the data available, expose it. When we have reasons to add the data, add it. Where it’s too cumbersome or not serving an immediate need, let’s leave it, and maybe the problem will be solved once AI makes enough advances.

    It’s the classic 80/20 trade off: I’d rather have the easy but imperfect 80% solution now, than wait for the remaining 20% to be sorted out.

    Vidar

  8. AI sprouted more breakthroughs in the past decade than it did in 40 years. I’d say the main reason for this is that cognitive scientists abandoned the computational model.

    The next breakthrough, I think, will come from cognitive scientists who abandon statistic-driven models, and replace them with interactionism-driven models. This is shaping as we write. The reason this will be a breakthrough is this: Statistics (and more generally maths) suppose variable independence; whereas cognitive-wise, such as state is meaningless.

    By switching to this perspective, you actually start asking the only question that makes any sense when it comes to epistemology: Is science explaining the world, or is it describing the way we think the world?

    Imho, the real question is not so much whether a machine will pass the Turing test; it is whether a human will still pass it in a couple of years.

  9. Hi, and congrats for a very interesting bunch of articles.
    In my humble and (very) uninformed opinion, I think that the point Vidar is trying to make is that this is not about a confrontation between “semantic” web (AKA. standardized and structurable data to make it more interoperable, searchable and -statistically- measurable) and AI.
    If search engines could dispose of known, documented data publishing conventions then those Goolge “sorry” search results could be, well, less sorry. That is something you can see happenning now, by the way, just by dropping tags for tags. The former said “a font of x type and face and color and family goes here”. The latter says “a title goes here, which is more important for you, Search Engine, than any other text”. I think that’s already pretty useful for everyday users. As for these very users not using correct markup, well, tools (interoperable tools, tools that guide users in the appropriate use of markup) do certainly help.
    Granted, this is not real semantics, but I bet a future AI program will be much more pleased to parse non-presentational xhtml documents than tagsoup.
    Otherwise I find research in AI and Neural Networks really interesting.

  10. To assume that all individuals would have to add meta-data to their own text is silly. As tools evolve the meta-data will be automated much like the abstraction from c or other primitive languages to instropsective languages like java and c# which provide much more information from known context but do not require a large degree of extra effort by the programmer.