Standards and XML
Required Resources:
Outline:
Standards: terms and discussion
General:
Standard: (a) A document, established by consensus and approved by an accredited standards development organization, that provides for common and repeated use, rules, guidelines, or characteristics for activities or their results, aimed at the achievement of the optimum degree of order and consistency in a given context. (b) Something set up and established by authority, custom, or general consent as a rule for the measure of quantity, weight, extent, value, or quality as a model or example. (from: SEI Open Systems Glossary). Examples include: MP3 (for audio compression), ISO 9000 (for quality), or DVD-Video.
Specification: A document that prescribes, in a complete, precise, verifiable manner, the requirements, design, behavior, or characteristics of a system or system component (from: SEI Open Systems Glossary). A specification can be understood as a "rough draft" of a standard, as successful specifications are often made into standards.
De Facto & De Jure Standards: De facto is a Latin expression that means "in fact" or "in practice". It is commonly used in contrast to de jure (meaning "by law") when referring to matters of law, governance, or technique (such as standards), that are found in the common experience as created or developed without or against a regulation. (Wikipedia) (MSWord is an example of de facto standard for document formatting; ISO 9001 is a de jure standard for quality control.)
Interoperability: The ability of systems, units, or forces to provide services to and accept services from other systems, units or forces and to use the services so exchanged to enable them to operate effectively together. (Wikipedia)
Organizations:
IEEE: The Institute of Electrical and Electronics Engineers or IEEE (pronounced as eye-triple-ee) is an international non-profit, professional organization for the advancement of technology related to electricity. It describes itself as "directed toward the advancement of the theory and practice of electrical, electronics, communications and computer engineering, as well as computer science, the allied branches of engineering and the related arts and sciences." The IEEE is a leading developer of industrial standards in a broad range of disciplines, including electric power and energy, biomedical technology and healthcare, information technology, information assurance, telecommunications, consumer electronics, transportation, aerospace, and nanotechnology. (Adapted from Wikipedia: IEEE)
ISO is the International Organization for Standardization, a global federation of over a hundred national standards bodies with central secretariat in Geneva, Switzerland. The name ISO is derived from the Greek σος and is not an acronym for the organization's name. Although more than 15000 ISO standards have been published so far, each identified by a document number, the term "ISO" is in some fields commonly used just on its own as a short name for something defined in one of these specifications. For example, an ISO image in computing is a disc image of an ISO 9660 file system. (Adapted from Wikipedia: ISO)
OAISIS: The Organization for the Advancement of Structured Information Standards is a global consortium that drives the development of e-business and web service standards. Members of the consortium decide how and what work is undertaken through an open, democratic process. (Adapted from Wikipedia: OASIS)
W3C: World Wide Web Consortium is a consortium that produces the software specifications ("recommendations", as they call them) for the World Wide Web. (Wikipedia)
Food for Thought: Standards and the customer.
The Economist article that is required reading for this week concludes by saying that the customer --individual customers, but also businesses, large and small-- will have the final word in the development and implementation of standards for information technologies. Is this consistent with what we have learned so far in this course about network effects? Consider the connections between network effects and standards in information and network technologies.
The Anatomy of a Document (Excerpted from: Learning XML; provided as review and reinforcement for "Getting Started with SGML/XML")
Instructor comments added in italics.
XML lets you name the parts anything you want, unlike HTML, which limits you to predefined tag names. XML doesn't care how you're going to use the document, how it will appear when displayed to an end user, or even what the names of the elements mean. All that matters is that you follow the basic rules for markup described in this chapter. This is not to say that matters of organization aren't important, however. You should choose element names that make sense in the context of the document, instead of random things like signs of the zodiac.
There is also the important matter of ensuring other people and systems understand quite precisely what you intend by particular tag names. For example, does <title> refer to something that belongs to a person (e.g. Mr, Dr, Esq.) or a work (e.g. Of Mice and Men)? This can be established in a few ways:
- through a community of practice (e.g. if there are two or more people who understand and support the "time-o-gram" document format)
- thorough a more formal specification or standardization process, supported by an organization or initiative (e.g. OAISIS, Dublin Core, ISO)
Example 2.1. A Small XML Document
<?xml version="1.0"?>
<time-o-gram pri="important">
<to>Sarah</to>
<subject>Reminder</subject>
<message>Don't forget to recharge K-9
<emphasis>twice a day</emphasis>.
Also, I think we should have his
bearings checked out. See you soon
(or late). I have a date with
some <villain>Daleks</villain>...
</message>
<from>The Doctor</from>
</time-o-gram>The time-o-gram example, like all XML, consists of content interspersed with markup symbols. The angle brackets (< >) and the names they enclose are called tags. Tags demarcate and label the parts of the document, and add other information that helps define the structure. The text between the tags is the content of the document, raw information that may be the body of a message, a title, or a field of data. The markup and the content complement each other, creating an information entity with partitioned, labeled data in a handy package.
Although XML is designed to be relatively readable by humans, it isn't intended to create a finished document. In other words, you can't open up just any XML-tagged document in a browser and expect it to be formatted nicely.[1]XML is really meant as a way to hold content so that, when combined with other resources such as a stylesheet, the document becomes a finished product style and polish. (Some browsers, such as Internet Explorer 5.0, do attempt to handle XML in an intelligent way, often by displaying it as a hierarchical outline that can be understood by humans. However, while it looks a lot better than munged-together text, it is still not what you would expect in a finished document. For example, a table should look like a table, a paragraph should be a block of text, and so on. XML on its own cannot convey that information to a browser.)
We'll look at how to combine a stylesheet with an XML document to generate formatted output in Chapter 4, "Presentation: Creating the End Product". For now, let's just imagine what it might look like with a simple stylesheet applied. For example, it could be rendered as shown in Example 2-2.
Example 2.2. The Memorandum, Formatted with a Stylesheet
TIME-O-GRAM : Reminder
Important
To: Sarah From: The DoctorDon't forget to recharge K-9 twice a day. Also, I think we should have his bearings checked out. See you soon (or late). I have a date with some Daleks... The rendering of this example is purely speculative at this point. If we used some other stylesheet, we could format the same memo a different way. It could change the order of elements, say by displaying the From: line above the message body. Or it could compress the message body to a width of 20 characters. Or it could go even further by using different fonts, creating a border around the message, causing parts to blink on and off--whatever you want. The beauty of XML is that it doesn't put any restrictions on how you present the document.
Let's look closely at the markup to discern its structure. As Figure 2-1 demonstrates, the markup tags divide the memo into regions, represented in the diagram as boxes containing other boxes. The first box contains a special declarative prolog that provides administrative information about the document. The other boxes are called elements. They act as containers and labels of text. The largest element, labeled <time-o-gram>, surrounds all the other elements and acts as a package that holds together all the subparts. Inside it are specialized elements that represent the distinct functional parts of the document. Looking at this diagram, we can say that the major parts of a <time-o-gram> are the destination (<to>), the sender (<from>), a message teaser (<subject>), and the message body (<message>). The last is the most complex, mixing elements and text together in its content. So we can see from this example that even a simple XML document can harbor several levels of structure.
Figure 2.1. Elements in the memo documentNOTE: The XML declaration states this file contains an XML document corresponding to Version 1.0 of the XML specification, and the UTF-8 character set should be used (see Wikipedia for more about character sets and UTF-8). The standalone property is not mentioned, so the default value of "no" will be used.
A Tree View
Elements divide the document into its constituent parts. They can contain text, other elements, or both. Figure 2-2 breaks out the hierarchy of elements in our memo. This diagram, called a tree because of its branching shape, is a useful representation for discussing the relationships between document parts. The black rectangles represent the seven elements. The top element (<time-o-gram>) is called the root element. You'll often hear it called the document element, because it encloses all the other elements and thus defines the boundary of the document. The rectangles at the end of the element chains are called leaves, and represent the actual content of the document. Every object in the picture with arrows leading to or from it is a node.
Figure 2.2. Tree diagram of the memoThere's one piece of Figure 2-2 that we haven't yet mentioned: the box on the left labeled pri. It was inside the <time-o-gram> tag, but here we see it branching off the element. This is a special kind of content called an attribute that provides additional information about an element. Like an element, an attribute has a label (pri) and some content (important). You can think of it as a name/value pair contained in the <time-o-gram> element tag. Attributes are used mainly for modifying an element's behavior rather than holding data; later processing might print "High Priority" in large letters at the top of the document, for example.
Now let's stretch the tree metaphor further and think about the diagram as a sort of family tree, where every node is a parent or a child (or both) of other nodes. Note, though, that unlike a family tree, an XML element has only one parent. With this perspective, we can see that the root element (a grizzled old <time-o-gram>) is the ancestor of all the other elements. Its children are the four elements directly beneath it. They, in turn, have children, and so on until we reach the childless leaf nodes, which contain the text of the document and any empty elements. Elements that share the same parent are said to be siblings.
Every node in the tree can be thought of as the root of a smaller subtree. Subtrees have all the properties of a regular tree, and the top of each subtree is the ancestor of all the descendant nodes below it. We will see in Chapter 6, "Transformation:RepurposingDocuments", that an XML document can be processed easily by breaking it down into smaller subtrees and reassembling the result later. Figure 2-3 shows some examples of subtrees in our <time-o-gram> example.
Figure 2.3. Some subtrees
RSS: Really Simple Syndication, Rich Site Summary, and RDF Site Summary
RSS has several meanings: Really Simple Syndication, Rich Site Summary, and RDF Site Summary, where RDF stands for Resource Data Framework. In any case, it's a method of summarizing the latest news and information from a website, that can be easily read by many news readers or news aggregators. (FirstGov.gov)
RSS is one of the most popular and simplest technologies that combines metadata and XML in a standard semantics and syntax to connect and manipulate information in a distributed networked environment. It is an easy and accessible means of understanding the principles behind more complicated XML document types and the Web services developed to exchange and process them.
Check out these examples of RSS use by libraries (and elsewhere):
- http://www.library.ualberta.ca/rss/
- http://www.firstgov.gov/Topics/Reference_Shelf/Libraries/RSS_Library.shtml
- feed://www.nature.com/nature/journal/v436/n7050/rss.rdf
- http://www.nasa.gov/rss/rtf_news.rss
Two Activities with RSS Feeds
Activity: Subscribe to RSS feeds. Set up a Bloglines accounts to subscribe to a number of RSS feeds that interest you. These can be from classmates' or friends' blogs or from the feeds provided by Bloglines itself (or from any other source):
- Click on the "sign up now" link in the center of the www.bloglines.com home page.
- On the registration page, enter any email address you wish (you can use your utoronto email, or any other email or account available to you).
- Choose any password that you'll be sure to remember.
- After registering, you will receive an email from "Bloglines Validation" that will provide a validation link that you should click to validate your account.
- Selecting the link will take you to a page where you will be able to make a number of "subscriptions." Feel free to subscribe to any of the options or feeds listed there.
- Click on the "My Feeds" tab in the upper left corner; click on "add."
- Type (or paste) in the Web address of a RSS feed (preferably from someone in the class). You can also use the "quick pick subscriptions" option to subscribe to as many other RSS feeds that interest you.
Activity: Create your own RSS Feed.
- Follow the instructions provided at: http://learningspaces.org/1311/rss.html
- As indicated in the instructions, submit the result to colin.furness@utoronto.ca.
- Email norm_friesen@sfu.ca if you have any questions about any of the steps.