Next: , Previous: edit_distance, Up: Top   [Contents]


24 edit_seq

%--------------------------------------------------%
% vim: ft=mercury ts=4 sw=4 et
%--------------------------------------------------%
% Copyright (C) 2019-2020 The Mercury team.
% This file is distributed under the terms specified in COPYING.LIB.
%--------------------------------------------------%
%
% File: edit_seq.m.
% Stability: medium.
%
% This module finds an edit sequence, which means that given two sequences
% of items, it finds the shortest sequence of edit operations (deletes,
% inserts and/or replaces) that will transform the first sequence
% into the second.
%
% The code is a naive implementation of the Wagner-Fischer algorithm,
% which is documented on its own wikipedia page.
%
% Given two lists of length M and N, its time complexity is O(MN),
% so it is suitable for use only on reasonably short lists.
%
% The operations in this module are intended to generate diffs to be displayed.
% Since diff traditionally has no way to display a transposition as
% anything other than an insertion/deletion pair, this module does not
% consider transpositions to be a separate kind of operation, which means
% that the distances it computes are Levenshtein distances.
% If you are after Damerau-Levenshtein distances, which *do* consider
% transpositions to be separate operations with their own costs,
% or if you want to know which of several candidate sequences is closest
% to a specific query sequence, then have a look at the edit_distance module.
%
%--------------------------------------------------%
%--------------------------------------------------%

:- module edit_seq.
:- interface.

:- import_module list.

%--------------------------------------------------%
%--------------------------------------------------%

    % Given two item sequences A and B, the edit sequence is the sequence
    % of edit operations that transforms sequence A into sequence B.
    %
    % Item numbers start at 1. The item numbers in edit operations reflect
    % the *original* position of the relevant item, i.e. they are not affected
    % by any edit operations that take place before that position.
    %
:- type edit_seq(T) == list(edit(T)).
:- type edit(T)
    --->    delete(int)
            % Delete item #N in sequence A.

    ;       insert(int, T)
            % Insert the given item from sequence B
            % after item #N in sequence A.

    ;       replace(int, T).
            % Replace item #N in sequence A with the given item
            % from sequence B.

:- type edit_params
    --->    edit_params(
                % The cost of delete, insert and replace operations
                % respectively. Only the *relative* values of the costs matter;
                % if these are fixed, their *absolute* values are irrelevant
                % (unless they are so high that they cause arithmetic
                % overflows).
                cost_of_delete      :: int,
                cost_of_insert      :: int,
                cost_of_replace     :: int
            ).

    % find_shortest_edit_seq(Params, SeqA, SeqB, Edits):
    %
    % Compute Edits as the cheapest sequence of edit operations
    % that will transform SeqA into SeqB, where the cost of each kind of
    % edit operation is specified by Params.
    %
:- pred find_shortest_edit_seq(edit_params::in, list(T)::in, list(T)::in,
    edit_seq(T)::out) is det.

%--------------------------------------------------%

    % A diff_seq represents a unified diff with unlimited context,
    % such as the output of "diff -u --context=MAXINT".
    %
    % Each line (or in general, one item) in it can be an item from SeqA
    % that is left unchanged, an item from SeqA that is to be deleted, or
    % an item (from SeqB) that is to be inserted.
:- type diff_seq(T) == list(diff(T)).
:- type diff(T)
    --->    unchanged(T)
    ;       deleted(T)
    ;       inserted(T).

    % Given an edit sequence computed by find_shortest_edit_seq, return
    % the unified diff representing that edit sequence.
    %
    % The main difference between the edit sequence and the diff sequence
    % is that given several consecutive replace edits, a naive representation
    % of those edit operations would output interleaved pairs of items
    % to be deleted and inserted, while the diff sequence would output
    % *all* of the items to be deleted by those replace operations *before*
    % printing the insertions of their replacements.
    %
:- pred find_diff_seq(list(T)::in, edit_seq(T)::in, diff_seq(T)::out) is det.

%--------------------------------------------------%

    % This type and its fields are documented below.
:- type change_hunk(T)
    --->    change_hunk(
                ch_seq_a_start      :: int,
                ch_seq_a_length     :: int,
                ch_seq_b_start      :: int,
                ch_seq_b_length     :: int,
                ch_diff             :: diff_seq(T)
            ).

    % find_change_hunks(ContextSize, DiffSeq, ChangeHunks):
    %
    % A diff_seq may contain long sequences of unchanged items, which are
    % often not of interest. This predicate computes from a diff sequence
    % a list of its *change hunks*, which are its interesting parts,
    % the parts that contain insertions and/or deletions.
    %
    % A change hunk looks like this, using the syntax of "diff -u".
    % The ContextSize of this example is 3.
    %
    %   @@ -25,6 +25,7 @@
    %    Roosevelt
    %    Taft
    %    Wilson
    %   +Pershing
    %    Harding
    %    Coolidge
    %    Hoover
    %
    % This change hunk shows the insertion of one line containing "Pershing"
    % into a list of US presidents. The "-25,6" part of the header shows that
    % the part of the original sequence (sequence A) covered by this change
    % hunk contains six lines, starting at line 25. The "+25,7" part shows that
    % the part of the updated sequence (sequence B) contains seven lines,
    % starting at line at 25 in that sequence as well. The first four fields
    % of the change_hunk type contain these two pairs of numbers.
    %
    % A change hunk consists of three parts, of which the first and/or last
    % may be empty.
    %
    % - The first part is a sequence of up to ContextSize unchanged items
    %   (the initial context).
    % - The second part is a sequence of unchanged, insertion or deletion
    %   items that
    %       * starts with an insertion or deletion item,
    %       * ends with an insertion or deletion item, and
    %       * contains at most 2 * ContextSize consecutive unchanged items.
    %   The start and end item may be the same, as in the example above.
    % - The third part is a sequence of up to ContextSize unchanged items
    %   (the trailing context).
    %
    % The idea is to surround regions of changes with ContextSize unchanged
    % items to provide context (hence the name ContextSize). The first and
    % third parts will always contain *exactly* ContextSize unchanged items,
    % unless the changed region occurs so close to the start or to the end
    % of the item sequence that there are fewer than ContextSize unchanged
    % items there.
    %
    % The reason why there may be up to 2 * ContextSize consecutive unchanged
    % items in the middle of a change hunk is that if the limit were any lower,
    % then some of those unchanged items would end up *both* in the trailing
    % context of one change hunk and the initial context of the next change
    % hunk.
    %
    % To make sense, ContextSize must be least one. This predicate throws
    % an exception if ContextSize is zero or negative.
    %
:- pred find_change_hunks(int::in, diff_seq(T)::in,
    list(change_hunk(T))::out) is det.

%--------------------------------------------------%
%--------------------------------------------------%


Next: , Previous: edit_distance, Up: Top   [Contents]