Next: enum, Previous: edit_distance, Up: Top [Contents]
%--------------------------------------------------% % vim: ft=mercury ts=4 sw=4 et %--------------------------------------------------% % Copyright (C) 2019-2020, 2023-2024 The Mercury team. % This file is distributed under the terms specified in COPYING.LIB. %--------------------------------------------------% % % File: edit_seq.m. % Stability: medium. % % This module finds an edit sequence, which means that given two sequences % of items, it finds the shortest sequence of edit operations (deletes, % inserts and/or replaces) that will transform the first sequence % into the second. % % The code is a naive implementation of the Wagner-Fischer algorithm, % which is documented on its own wikipedia page. % % Given two lists of length M and N, its time complexity is O(MN), % so it is suitable for use only on reasonably short lists. % % The operations in this module are intended to generate diffs to be displayed. % Since diff traditionally has no way to display a transposition as % anything other than an insertion/deletion pair, this module does not % consider transpositions to be a separate kind of operation, which means % that the distances it computes are Levenshtein distances. % If you are after Damerau-Levenshtein distances, which *do* consider % transpositions to be separate operations with their own costs, % or if you want to know which of several candidate sequences is closest % to a specific query sequence, then have a look at the edit_distance module. % %--------------------------------------------------% %--------------------------------------------------% :- module edit_seq. :- interface. :- import_module list. %--------------------------------------------------% %--------------------------------------------------% % Given two item sequences A and B, the edit sequence is the sequence % of edit operations that transforms sequence A into sequence B. % % Item numbers start at 1. The item numbers in edit operations reflect % the *original* position of the relevant item, i.e. they are not affected % by any edit operations that take place before that position. % :- type edit_seq(T) == list(edit(T)). :- type edit(T) ---> delete(int) % Delete item #N in sequence A. ; insert(int, T) % Insert the given item from sequence B % after item #N in sequence A. ; replace(int, T). % Replace item #N in sequence A with the given item % from sequence B. :- type edit_params ---> edit_params( % The cost of delete, insert and replace operations % respectively. Only the *relative* values of the costs matter; % if these are fixed, their *absolute* values are irrelevant % (unless they are so high that they cause arithmetic % overflows). cost_of_delete :: int, cost_of_insert :: int, cost_of_replace :: int ). % find_shortest_edit_seq(Params, SeqA, SeqB, Edits): % % Compute Edits as the cheapest sequence of edit operations % that will transform SeqA into SeqB, where the cost of each kind of % edit operation is specified by Params. % :- pred find_shortest_edit_seq(edit_params::in, list(T)::in, list(T)::in, edit_seq(T)::out) is det. %--------------------------------------------------% % A diff_seq represents a unified diff with unlimited context, % such as the output of "diff -u --context=MAXINT". % % Each line (or in general, one item) in it can be an item from SeqA % that is left unchanged, an item from SeqA that is to be deleted, or % an item (from SeqB) that is to be inserted. :- type diff_seq(T) == list(diff(T)). :- type diff(T) ---> unchanged(T) ; deleted(T) ; inserted(T). % Given an edit sequence computed by find_shortest_edit_seq, return % the unified diff representing that edit sequence. % % The main difference between the edit sequence and the diff sequence % is that given several consecutive replace edits, a naive representation % of those edit operations would output interleaved pairs of items % to be deleted and inserted, while the diff sequence would output % *all* of the items to be deleted by those replace operations *before* % printing the insertions of their replacements. % :- pred find_diff_seq(list(T)::in, edit_seq(T)::in, diff_seq(T)::out) is det. %--------------------------------------------------% % This type and its fields are documented below. :- type change_hunk(T) ---> change_hunk( ch_seq_a_start :: int, ch_seq_a_length :: int, ch_seq_b_start :: int, ch_seq_b_length :: int, ch_diff :: diff_seq(T) ). % find_change_hunks(ContextSize, DiffSeq, ChangeHunks): % % A diff_seq may contain long sequences of unchanged items, which are % often not of interest. This predicate computes from a diff sequence % a list of its *change hunks*, which are its interesting parts, % the parts that contain insertions and/or deletions. % % A change hunk looks like this, using the syntax of "diff -u". % The ContextSize of this example is 3. % % @@ -25,6 +25,7 @@ % Roosevelt % Taft % Wilson % +Pershing % Harding % Coolidge % Hoover % % This change hunk shows the insertion of one line containing "Pershing" % into a list of US presidents. The "-25,6" part of the header shows that % the part of the original sequence (sequence A) covered by this change % hunk contains six lines, starting at line 25. The "+25,7" part shows that % the part of the updated sequence (sequence B) contains seven lines, % starting at line at 25 in that sequence as well. The first four fields % of the change_hunk type contain these two pairs of numbers. % % A change hunk consists of three parts, of which the first and/or last % may be empty. % % - The first part is a sequence of up to ContextSize unchanged items % (the initial context). % - The second part is a sequence of unchanged, insertion or deletion % items that % * starts with an insertion or deletion item, % * ends with an insertion or deletion item, and % * contains at most 2 * ContextSize consecutive unchanged items. % The start and end item may be the same, as in the example above. % - The third part is a sequence of up to ContextSize unchanged items % (the trailing context). % % The idea is to surround regions of changes with ContextSize unchanged % items to provide context (hence the name ContextSize). The first and % third parts will always contain *exactly* ContextSize unchanged items, % unless the changed region occurs so close to the start or to the end % of the item sequence that there are fewer than ContextSize unchanged % items there. % % The reason why there may be up to 2 * ContextSize consecutive unchanged % items in the middle of a change hunk is that if the limit were any lower, % then some of those unchanged items would end up *both* in the trailing % context of one change hunk and the initial context of the next change % hunk. % % To make sense, ContextSize must be least one. This predicate throws % an exception if ContextSize is zero or negative. % :- pred find_change_hunks(int::in, diff_seq(T)::in, list(change_hunk(T))::out) is det. %--------------------------------------------------% %--------------------------------------------------%
Next: enum, Previous: edit_distance, Up: Top [Contents]