summary |
shortlog |
changelog |
graph |
tags |
bookmarks |
branches |
files |
changeset |
raw | zip | gz | bz2 |
help

author | Simon Chabot <simon.chabot@logilab.fr> |

Mon, 19 Nov 2012 17:13:57 +0100 | |

changeset 151 | ad4632f68727 |

parent 150 | 63f3a25ed241 |

child 152 | d4556b2354be |

[doc] Some corrections

--- a/doc.rst Mon Nov 19 16:53:08 2012 +0100 +++ b/doc.rst Mon Nov 19 17:13:57 2012 +0100 @@ -19,18 +19,17 @@ 1. Gather and format the data we want to align. In this step, we define two sets that we call the ``alignset`` and the - ``targetset``. The ``alignset`` is the set containing our data (in this case, the - Goncourt prize winners), and the ``targetset`` contains the data on which we would - like to make the links (dbpedia in this case) -2. Compute the similarity between the items gathered + ``targetset``. The ``alignset`` is the set containing our data, and the + ``targetset`` contains the data on which we would like to make the links. +2. Compute the similarity between the items gathered. We compute a distance matrix between the two sets according a given distance. -3. Find the items having a high similarity thank to the distance matrix +3. Find the items having a high similarity thanks to the distance matrix. Simple case ^^^^^^^^^^^ -Let's defining our ``alignset`` and our ``targetset`` as simple python +Let's defining ``alignset`` and ``targetset`` as simple python lists. .. code-block:: python @@ -41,14 +40,14 @@ Now, we have to compute the similarity between each items. For that purpose, the `Levehshtein's distance <http://en.wikipedia.org/wiki/Levenshtein_distance>`_ [#]_, which is well accurate to compute the distance between few words, is used. +Such a function is provided in ``alignment.distance`` module. .. [#] Also called the *edit distance*, because the distance between two words - is equal to the number of single-character edits required to change one + is equal to the number of single-character edits required to change one word into the other. -Such a function is provided in ``alignment.distance`` module. The next step -is to compute the distance matrix according to the Levenshtein distance. The -result is given in the following tables. +The next step is to compute the distance matrix according to the Levenshtein +distance. The result is given in the following tables. +--------------+--------------+-----------------------+-------------+ @@ -66,7 +65,7 @@ ^^^^^^^^^^^^^^^^^^ The previous case was simple, because we had only one *thing* to align (the -name), but it is frequent to have a lot of *thing* to align, such as the name +name), but it is frequent to have a lot of *things* to align, such as the name and the birth date and the birth city. The steps remains the same, except that three distance matrices will be computed, and *items* will be represented as nested lists. See the following example: @@ -175,7 +174,7 @@ | 1906#Jérôme et Jean Tharaud#Dingley, l'illustre écrivain (Cahiers de la Quinzaine) When using the high-level functions of this library, each item must have at -least two elements : an *identifier* (the name, or the URI) and the *thing* to +least two elements: an *identifier* (the name, or the URI) and the *thing* to compare. With the previous file, we will use the name (so the column number 1) as *identifier* and *thing* to align. This is told to python thanks to the following code: @@ -227,11 +226,11 @@ ``True`` if at least one matching has been done, ``False`` otherwise. It may be important to apply some pre-treatment on the data to align. For -instance, names can be written with extra characters as punctuation or -unwanted information in parenthesis and so on. That is why we provide some -functions to `normalize` your data. The most useful may be the `simplify()` one -(see the docstring for more information). So the treatments list can be given as -follow: +instance, names can be written with lower or upper charater, with extra +characters as punctuation or unwanted information in parenthesis and so on. That +is why we provide some functions to `normalize` your data. The most useful may +be the `simplify()` one (see the docstring for more information). So the +treatments list can be given as follow: .. code-block:: python @@ -295,9 +294,12 @@ Next, we define the treatments to apply. It is the same as before, but we ask for a non-normalized matrix (ie: the real output of the levenshtein distance). Finally, we call the ``alignall`` function. ``indexes`` is a tuple saying the -position of the point on which build the kdtree, ``mode`` is the mode used to -find neighbours [#]_, ``uniq`` ask to the function to return the best candidate -(ie: the one having the shortest distance above the given threshold) +position of the point on which the kdtree must be built, ``mode`` is the mode +used to find neighbours [#]_, ``uniq`` ask to the function to return the best +candidate (ie: the one having the shortest distance above the given threshold) .. [#] The available modes are ``kdtree``, ``kmeans`` and ``minibatch`` for numerical data and ``minhashing`` for text one. + +The function output a generator yielding tuples where the first element is the +identifier of the ``alignset`` item and the second is the ``targetset`` one.