Senga Home Page Information retrieval software GNU
Savannah
FSF France
NAME
DESCRIPTION
LIBRARY MODE
URI_MODE_CANNONICAL|URI_MODE_ERROR_STDERR.
URI_MODE_CANNONICAL
URI_MODE_LOWER_SCHEME
URI_MODE_ERROR_STDERR
URI_MODE_FIELD_MALLOC
URI_MODE_FURI_MD5
URI_MODE_URI_STRICT
URI_MODE_URI_STRICT_SCHEME
URI_MODE_FLAG_DEFAULT
STRUCTURE AND ALLOCATION
FUNCTIONS
uri_t* uri_alloc_1()
uri_t* uri_alloc(char* uri, int uri_length)
uri_t* uri_object(char* uri, int uri_length)
int uri_realloc(uri_t* object, char* uri, int uri_length)
void uri_free(uri_t* object)
uri_t* uri_abs(uri_t* base, char* relative_string, int relative_length)
uri_abs_1(uri_t* base, uri_t* relative)
int uri_info(uri_t* object)
char* uri_scheme(uri_t* object)
char* uri_host(uri_t* object)
char* uri_port(uri_t* object)
char* uri_path(uri_t* object)
char* uri_params(uri_t* object)
char* uri_query(uri_t* object)
char* uri_frag(uri_t* object)
char* uri_user(uri_t* object)
char* uri_passwd(uri_t* object)
char* uri_netloc(uri_t* object)
char* uri_auth_netloc(uri_t* object)
char* uri_auth(uri_t* object)
char* uri_all_path(uri_t* object)
void uri_info_set(uri_t* object, int value)
void uri_scheme_set(uri_t* object, char* value)
void uri_host_set(uri_t* object, char* value)
void uri_params_set(uri_t* object, char* value)
void uri_query_set(uri_t* object, char* value)
void uri_user_set(uri_t* object, char* value)
void uri_passwd_set(uri_t* object, char* value)
void uri_copy(uri_t* to, uri_t* from)
uri_t* uri_clone(uri_t* from)
void uri_clear(uri_t* object)
void uri_set_root(const char* root)
const char* uri_get_root()
char* uri_furi(uri_t* object)
char* uri_uri(uri_t* object)
void uri_string(uri_t* object, char** stringp, int* string_sizep, int flags)
char* uri_escape(char* string, char* range)
char* uri_unescape(char* string)
char* uri_cannonicalize_string(char* uri, int uri_length, int flag)
uri_t* uri_cannonical(uri_t* object)
int uri_consistent(uri_t* object)
HTTP FUNCTIONS
char* uri_robots(uri_t* object)
CANNONICAL FORM
http://www.foo.com/file.html.
ERROR HANDLING
STRICTNESS
FURI
EXAMPLES
Show cannonical form of URI
Show the host and port of URI (netloc)
Change the query part of URI and show it
ADDING NEW SCHEMES
AUTHOR
SEE ALSO

NAME

uri - a set of functions to manipulate URIs

DESCRIPTION

The header file for the library is #include <uri.h> and the library may be linked using -luri.

uri is a library that analyses URIs and transform them. It is designed to be fast and occupy as few memory as possible. The basic usage of this library is to transform an URI into a structure with one field for each component of the URI and vice versa.

LIBRARY MODE

The library behaviour is controled by the flags described bellow. The default set of flag is

URI_MODE_CANNONICAL|URI_MODE_ERROR_STDERR.

URI_MODE_CANNONICAL

All objects store URI in cannonical form.

URI_MODE_LOWER_SCHEME

The scheme of the URI is always converted to lower case.

URI_MODE_ERROR_STDERR

If an error occurs, the error string is printed on the STDERR chanel.

URI_MODE_FIELD_MALLOC

Each field may have its own malloc'd space. When the caller set a field it can assume the content of the field is saved in the object. Otherwise when the caller sets a field it must make sure that the memory containing the value of the field will not be freed before the object is deallocated.

URI_MODE_FURI_MD5

Use MD5 key calculated from the URL as a path name instead of the readable path name described in FURI chapter below. For example http://www.foo.com/ is transformed into the MD5 key 33024cec6160eafbd2717e394b5bc201 and the corresponding FURI is 33/02/4c/ec6160eafbd2717e394b5bc201.

URI_MODE_URI_STRICT

Behave in strict mode (see STRICTNESS below).

URI_MODE_URI_STRICT_SCHEME

Behave in strict mode (see STRICTNESS below).

URI_MODE_FLAG_DEFAULT

The default mode of the library.

STRUCTURE AND ALLOCATION

The uri_t type is a structure describing the URI. Access functions are provided and should be used to get the values of the fields and set new values. All the fields are character strings whose size is exactly the size of the string they contain. One can safely override the values contained in the fields, as long as the replacement string has a size lower or equal to the original size. If the replacement string is larger, the caller must use a buffer of its own.

If the flag URI_MODE_FIELD_MALLOC is not set, which is the default, the allocation policy for an uri_t object is minimal. When an object is allocated using uri_alloc, memory is allocated by the library to store the object. This memory will be released when the object is freed using uri_free. When a field is set, the pointer is stored in the object and no copy of the string is kept. It is the responsibility of the caller to make sure that the string will live as long

as the object lives. This policy is designed to prevent allocation as much as possible. Let's say you have a program that will operate on 50 000 URLs, only one malloc and a few realloc will be necessary instead of 50 000 malloc/free multiplied by the number of fields of the structure. The loop will look like this:



/*
* Alloc an empty object.
*/
uri_t* uri = uri_alloc_1();

for(i = 0; i < 50000; i++) {
/*
* Reuse the object for another url, object grow
* only if needed because the url is larger than
* any previously seen url.
*/
uri_realloc(uri, url[i], strlen(url[i]));
... do something on uri ...
/*
* Print the url on stdout
*/
printf("%sn", uri_uri(uri));
}



If the flag URI_MODE_FIELD_MALLOC is set, each field will have a separatly allocated space, if necessary. The caller may assume that the object is always self contained and does not depend on externally allocated string. Each set function (uri_scheme_set, uri_host_set etc.) allocated the necessary space and duplicate the string given in argument. The info field contains flags that record which fields contain a malloc'd space and which does not (URI_INFO_M_* flags). This information is only valid between two calls of the library functions. For instance uri_cannonicalize will reorganize allocated space. This policy is used for integration of the library into scripting langages such as Perl.



info



corresponding define that have the following meaning.



URI_INFO_CANNONICAL Set if the URI is in cannonical form.

URI_INFO_RELATIVE Set if the URI is a relative URI (does not start with {http,..}://).

URI_INFO_RELATIVE_PATH Set if the URI is a relative URI and the path does not start with a /.

URI_INFO_PARSED Set if the URI was successfully parsed. If this flag is not set the content of the object is undefined.

URI_INFO_ROBOTS Set if the URI is an http robots.txt file.

URI_INFO_M_* There is such a flag for each field of the uri_t structure. If the flag is set, the memory pointed by this field has been allocated by malloc.



scheme host

port

path

The scheme of the URI (http, ftp, file or news).

The host name part of the URI.

The port number associated to host, if any.

The path name of the URI.




params

The parameters of the URI (i.e. what is found after the ; in the path).

query

frag

The query part of a cgi-bin call (i.e. what is found after the ? in the path). The fragement of the document (i.e. what is found after the # in the path).

user

passwd

If authentication information is set, the user name. If authentication information is set, the password.

FUNCTIONS

uri_t* uri_alloc_1()

Allocate an empty object that must be filled with the uri_realloc function.

uri_t* uri_alloc(char* uri, int uri_length)

The uri is splitted into fields and the corresponding uri_t structure is returned. The structure is allocated using malloc. The URI is put in cannonical form. If it cannot be put in cannonical form an error message is printed on stderr and a null pointer is returned.

uri_t* uri_object(char* uri, int uri_length)

The uri is splitted into fields and the corresponding uri_t structure is returned. The returned structure is statically allocated and must not be freed. The URI is put in cannonical form. If it cannot be put in cannonical form an error message is printed on stderr and a null pointer is returned.

int uri_realloc(uri_t* object, char* uri, int uri_length)

The uri is splitted into fields in the previously allocated object structure. The URI is put in cannonical form and URI_CANNONICAL is returned. If it cannot be put in cannonical form, nothing is done and URI_NOT_CANNONICAL is returned.

void uri_free(uri_t* object)

The object previously allocated by uri_alloc is deallocated.

uri_t* uri_abs(uri_t* base, char* relative_string, int relative_length)

Transform the relative URI relative_string into an absolute URI using base as the base URI. The returned uri_t object is allocated statically and must not be freed.

uri_abs_1(uri_t* base, uri_t* relative)

Transform the relative URI relative into an absolute URI using base as the base URI. The returned uri_t object is allocated statically and must not be freed.

int uri_info(uri_t* object)

returns the content of the info field.

char* uri_scheme(uri_t* object)

returns the content of the scheme field.

char* uri_host(uri_t* object)

returns the content of the host field.

char* uri_port(uri_t* object)

returns the value of the port field of the object. If the port field is empty, returns the default port for the corresponding scheme. For instance, if the scheme is http the 80 string is returned. The returned string is statically allocated and must not be freed.

char* uri_path(uri_t* object)

returns the content of the path field.

char* uri_params(uri_t* object)

returns the content of the params field.

char* uri_query(uri_t* object)

returns the content of the path field.

char* uri_frag(uri_t* object)

returns the content of the frag field.

char* uri_user(uri_t* object)

returns the content of the user field.

char* uri_passwd(uri_t* object)

returns the content of the passwd field.

char* uri_netloc(uri_t* object)

returns a concatenation of the host and port field, separated by a :. If the host field is not set, the null pointer is returned and a message is printed on stderr. The returned string is statically allocated and must not be freed.

char* uri_auth_netloc(uri_t* object)

returns a concatenation of the host and port field, separated by a :. If the user field is set, the user and passwd fields are prepended to the netloc, separated by a @. If the host field is not set, the null pointer is returned and error condition is set. The returned string is statically allocated and must not be freed.

char* uri_auth(uri_t* object)

returns a concatenation of the user and passwd field, separated by a : or an empty string if any of them is not set. The returned string is statically allocated and must not be freed.

char* uri_all_path(uri_t* object)

returns a concatenation of the path, params and query fields in the form /path;params?query. Note that a leading slash is only prepended to the returned value if the object is not a relative URI. The returned string is statically allocated and must not be freed.

void uri_info_set(uri_t* object, int value)

set the info field to value.

void uri_scheme_set(uri_t* object, char* value)

set the scheme field to value. The URI_INFO_RELATIVE is updated according to the new value.

void uri_host_set(uri_t* object, char* value)

set the host field to value. The URI_INFO_RELATIVE is updated according to the new value.

void uri_params_set(uri_t* object, char* value)

set the params field to value.

void uri_query_set(uri_t* object, char* value)

set the query field to value.

void uri_user_set(uri_t* object, char* value)

set the user field to value.

void uri_passwd_set(uri_t* object, char* value)

set the passwd field to value.

void uri_copy(uri_t* to, uri_t* from)

copy the content of object from into object to.

uri_t* uri_clone(uri_t* from)

creates a new object containing the same data as from. The returned object must be freed using uri_free.

void uri_clear(uri_t* object)

clear all information contained in object.

void uri_set_root(const char* root)

Set the path that uri_furi will prepend to the FURI. By default it is the empty string.

const char* uri_get_root()

Get the path set by uri_set_root or empty string.

char* uri_furi(uri_t* object)

returns a string containing the FURI (File equivalent of an URI) built from object. The returned string is statically allocated and must not be freed.

char* uri_uri(uri_t* object)

returns a string containing the URI built from object. The returned string is statically allocated and must not be freed.

void uri_string(uri_t* object, char** stringp, int* string_sizep, int flags)

Build a string representation of object in stringp according to flags. Possible values of flags is described in the uri_cannonicalize_string function. Upon return the stringp pointer points to a static array of stringp_size bytes allocated with malloc. If stringp is not null it must point to a buffer allocated with malloc and is reallocated to fit the needs of the string conversion. This function is the backend of all object to string translation functions.

char* uri_escape(char* string, char* range)

return a statically allocated copy of string with all characters found in the the range string transformed in escaped form (%xx). A few examples of range argument are defined: URI_ESCAPE_RESERVED, URI_ESCAPE_PATH, URI_ESCAPE_QUERY, and uri_escape_unsafe.

char* uri_unescape(char* string)

return a statically allocated copy of string with all escape sequences (%xx) transformed to characters.

char* uri_cannonicalize_string(char* uri, int uri_length, int flag)

returns the cannonical form of the uri given in argument. The cannonical form is formatted according to the value of flag. Values of flag are bits that can be ored together.

URI_STRING_FURI_STYLE return a FURI, URI_STRING_URI_STYLE return an URI, URI_STRING_ROBOTS_STYLE return the corresponding robots.txt URI, URI_STRING_URI_NOHASH_STYLE do not include the frag in the returned string.

Returns 0 if uri is malformed.

uri_t* uri_cannonical(uri_t* object)

returns an object containing the cannonical form of object. If the


URI_MODE_CANNONICAL flag is set, the object itself is returned.

int uri_consistent(uri_t* object)

Returns 0 if object contains unparsable URL, returns != 0 if object contains a well formed URL. Must be called after a set of field changes to reset flags and ensure that modified URL is well formed.

HTTP FUNCTIONS

char* uri_robots(uri_t* object)

returns a string containing the URI of the robots.txt file corresponding to the URI contained in object. For instance, if the URI contained in object is http://www.foo.com/dir/dir/file.html the returned string will be http://www.foo.com/robots.txt. The returned string is statically allocated and must not be freed.

CANNONICAL FORM

The cannonical form of an URI is an arbitrary choice to code all the possible variations of the same URI in one string. For instance http://www.foo.com/abc"def.html will be transformed to http://www.foo.com/abc%22def.html. Most of the transformations follow the instructions found in draft-fielding-uri-syntax-04 but some of them don't.

Additionally, when the path of the URI contains dots and double dots, it is reduced. For instance http://www.foo.com/dir/.././file.html will be transformed to

http://www.foo.com/file.html.

If the URI_MODE_CANNONICAL flag is set, the uri_t object always contains the cannonical form of the URL. The original form is lost.

If the URI_MODE_CANNONICAL flag is not set, the cannonical form of the URI is stored in a separate object. The uri_t object contains the original form of the URI. It takes more memory to store but may be usefull in some situations.

ERROR HANDLING

When an error occurs (URI cannot be cannonicalized or parsed, for instance), the global variable uri_errstr contains the full text of the error message. This variable is never reset by the library functions if no error occurs.

Additionally, the error string may be printed on the error chanel (STDERR) if the URI_MODE_ERROR_STDERR flag is set. This is the default.

STRICTNESS

The draft describing URI syntax (draft-fielding-uri-syntax-04) specifies that an URI of the
type http:g may be interpreted in two different ways. If the
URI_MODE_URI_STRICT flag
is set, the library interprets it as an absolute URI, otherwise it is a relative URI.

If the URI_MODE_URI_STRICT is not set, the URI_MODE_URI_STRICT_SCHEME
may be set so that a relative URI containing a scheme is interpreted as an absolute URI only
if the scheme is different from the scheme of the base URI.

FURI

It is sometimes convinient to convert an URI into a path name. Some functions of the uri
library provide such a conversion (uri_furi for instance). These path names are called FURI
(File equivalent of an URI) for short. Here is a description of the transformation.



http://www.ina.fr:700/imagina/index.html#queau



|
|
|
|

____________/ ________________/____/
| | lost
| |
| |



/
|
|
|

|
|
|
|

|
|
|
|



/
|

| |
/^^^^^^^^^^^^^/^^^^^^^^^^^^^^^^\


http/www.ina.fr:700/imagina/index.html

EXAMPLES

Show cannonical form of URI

char* uri = "http://www.foo.com/";
uri = uri_cannonicalize_string(uri, strlen(uri), URI_STRING_URI_STYLE); if(uri) printf("uri = %sn", uri);

Show the host and port of URI (netloc)

char* uri = "http://www.foo.com:7000/";
uri_t* uri_object = uri_object(uri, strlen(uri));
if(uri_object) printf("netloc = %sn", uri_netloc(uri_object));

Change the query part of URI and show it

char* uri = "http://www.foo.com/cgi-bin/bar?param=1";
uri_t* uri_object = uri_object(uri, strlen(uri));
if(uri_object) {



uri_query_set(uri_object, "param=2");



printf("uri = %sn", uri_uri(uri_object));


}

ADDING NEW SCHEMES

Add the name of the scheme in the SCHEMES file. If nothing else this will bind the scheme
to a generic parser following the URI parsing rules. If you want to define specific behaviour
for this scheme, mimic the uri_scheme_http.c file and recompile. If gperf(1) complains
because it has conflicts you'll have to play with the -k option in order to find a working range
that does not conflict and takes a few space as possible.

AUTHOR

Loic Dachary loic@senga.org

SEE ALSO

draft-fielding-uri-syntax-04

 
uri
Home
Description
Documentation
Download
Debian
RPM
License
CVS
Mailing lists
Tasks
Contribute
Freshmeat
Savannah
Projects
Catalog
GNU Mifluz
unac
uri
webbase
Senga
Home
Old News
Credits
Team
Ducks
XHTML Source  |   XSL Style Sheet  
 webmaster@senga.org
Copyright (C) 2002 Loic Dachary, 12 bd Magenta, 75010 Paris, France
Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved.