Portable C string and wide string replacement functions: repl_str and repl

Portable C string and wide string replacement functions: repl_str and repl_wcs

The standard C library doesn't include a string replacement function, which it might have called strrep, strrepl, strreplace or even, less likely, str_replace. The Standard requires that we do not use any of those names for our user-space implementation, because they impose on the reserved function namespace, str*. I've chosen instead the name repl_str. The purpose of this function is to replace in the input string all occurrences of a source/search ("from") substring with a target/replace ("to") substring, and to then return a post-replacement string, doing all this as efficiently as is possible without specific knowledge of the nature of the strings.

For those familiar with PHP, this function is a C equivalent for PHP's str_replace function, except for its order of parameters, its inability to accept array parameters, the possibility that it will return NULL, and the need to free the memory of the char pointer that it returns.

The ordinary-string (char *) function: repl_str

Description:	Replaces in the string `str` all the occurrences of the source string `from` with the destination string `to`. The lengths of the strings `from` and `to` may differ. The string `to` may be of any length, but the string `from` must be of non-zero length - the penalty for providing an empty string for the `from` parameter is an infinite loop. In addition, none of the three parameters may be NULL.
Returns:	The post-replacement string, or `NULL` if memory for the new string could not be allocated. Does not modify the original string. The memory for the returned post-replacement string may be deallocated with the standard library function `free` when it is no longer required.
Dependencies:	For this function to compile, you will need to also `#include` the following files: `<string.h>`, `<stdlib.h>` and `<stddef.h>.`
Portability:	Portable to C compilers implementing the ISO C Standard, any version i.e. from C89/C90 onwards. It is possible to compile this function as C++ code with the addition of casts to the returns of the `realloc` and `malloc` functions. You can click the "C++ify" link to add these casts (Javascript), or add them yourself (if your browser doesn't support Javascript: the correct types are `(ptrdiff_t )` for `realloc`'s return, and `(char )` for `malloc`'s return). Note: I am not knowledgeable about C++, so cannot say whether or not this is valid C++ code; all I know is that a reader has contacted me with this advice, having successfully compiled the code using g++ after adding these casts.
Author:	Me (Laird Shaw).
Licence:	Public domain.
Attribution:	Optional. If you choose to indicate attribution when using this function, feel free to link to this page.

C++ify the code

#include <string.h>
#include <stdlib.h>
#include <stddef.h>

#if (__STDC_VERSION__ >= 199901L)
#include <stdint.h>
#endif

char *repl_str(const char *str, const char *from, const char *to) {

	/* Adjust each of the below values to suit your needs. */

	/* Increment positions cache size initially by this number. */
	size_t cache_sz_inc = 16;
	/* Thereafter, each time capacity needs to be increased,
	 * multiply the increment by this factor. */
	const size_t cache_sz_inc_factor = 3;
	/* But never increment capacity by more than this number. */
	const size_t cache_sz_inc_max = 1048576;

	char *pret, *ret = NULL;
	const char *pstr2, *pstr = str;
	size_t i, count = 0;
	#if (__STDC_VERSION__ >= 199901L)
	uintptr_t *pos_cache_tmp, *pos_cache = NULL;
	#else
	ptrdiff_t *pos_cache_tmp, *pos_cache = NULL;
	#endif
	size_t cache_sz = 0;
	size_t cpylen, orglen, retlen, tolen, fromlen = strlen(from);

	/* Find all matches and cache their positions. */
	while ((pstr2 = strstr(pstr, from)) != NULL) {
		count++;

		/* Increase the cache size when necessary. */
		if (cache_sz < count) {
			cache_sz += cache_sz_inc;
			pos_cache_tmp = (ptrdiff_t *)realloc(pos_cache, sizeof(*pos_cache) * cache_sz);
			if (pos_cache_tmp == NULL) {
				goto end_repl_str;
			} else pos_cache = pos_cache_tmp;
			cache_sz_inc *= cache_sz_inc_factor;
			if (cache_sz_inc > cache_sz_inc_max) {
				cache_sz_inc = cache_sz_inc_max;
			}
		}

		pos_cache[count-1] = pstr2 - str;
		pstr = pstr2 + fromlen;
	}

	orglen = pstr - str + strlen(pstr);

	/* Allocate memory for the post-replacement string. */
	if (count > 0) {
		tolen = strlen(to);
		retlen = orglen + (tolen - fromlen) * count;
	} else	retlen = orglen;
	ret = (char *)malloc(retlen + 1);
	if (ret == NULL) {
		goto end_repl_str;
	}

	if (count == 0) {
		/* If no matches, then just duplicate the string. */
		strcpy(ret, str);
	} else {
		/* Otherwise, duplicate the string whilst performing
		 * the replacements using the position cache. */
		pret = ret;
		memcpy(pret, str, pos_cache[0]);
		pret += pos_cache[0];
		for (i = 0; i < count; i++) {
			memcpy(pret, to, tolen);
			pret += tolen;
			pstr = str + pos_cache[i] + fromlen;
			cpylen = (i == count-1 ? orglen : pos_cache[i+1]) - pos_cache[i] - fromlen;
			memcpy(pret, pstr, cpylen);
			pret += cpylen;
		}
		ret[retlen] = '\0';
	}

end_repl_str:
	/* Free the cache and return the post-replacement string,
	 * which will be NULL in the event of an error. */
	free(pos_cache);
	return ret;
}

Performance

In case you are curious enough to want to compare this implementation with alternative implementations, you are welcome to compile and run the code in the benchmarking file, replacebench.c. In that file, the above function is included as replace_cache, and the functions originally published on this page as replace_str2 and replace_str are included as replace_opt2 and replace_opt. Code/functions are included from the comp.lang.c thread, how to replace a substring in a string using C?, from answers to the stackoverflow question, What is the function to replace string in C?, and also from private correspondence. See the comments top of file for instructions on compiling and running it.

In most scenarios, the fastest replacement function, by around 10-20%, is Albert Chan's replace_smart, followed by the above function: repl_str aka replace_cache. There are some scenarios, though, where repl_str is faster than replace_smart, sometimes by up to 200%. These scenarios involve long strings with few matches. Why, if Albert's function is generally slightly faster than the above repl_str function, is it not the focus of this page? Because it generally uses much more memory than repl_str.

The third fastest implementation is typically replace_str2 aka replace_opt2. For longer strings in the case in which the lengths of the "from" and "to" strings differ, repl_str aka replace_cache beats it by margins of up to about 80%. For smaller strings, and in the case where the lengths of the "from" and "to" strings are identical, replace_str2 aka replace_opt2 is faster, by a maximum margin of about 35%, sometimes in those scenarios beating replace_smart too. Some of the other functions are also faster for smaller strings. The even-match point between replace_str2 and repl_str (assuming "from" and "to" string lengths differ) depends on how far into the string the final match occurs, how many matches there are, and the comparative lengths of the old and "to" strings, but roughly it occurs for strings of 700-800 bytes in length.

This analysis is based on compiling with GCC and testing on a 64-bit Intel platform running Linux, however brief testing with Microsoft Visual C++ 2010 Express (scroll down to "Additional information" at that link) on Windows 7 seemed to produce similar results.

The wide string (wchar_t *) variant: repl_wcs

The function description is as above except under "Dependencies", replace <string.h> with <wchar.h>, and, if compiling as C++ code, the cast to add for the return of malloc should be (wchar_t *) (this can be inserted automatically again by clicking the "C++ify" link if your browser supports Javascript). The function itself is also identical except with ordinary-string functions replaced by wide string functions. Given an appropriate locale, this function can be used to perform replacements on Unicode strings, including those originally encoded in UTF-8. An example usage is provided in the wreplacebench.c file.

C++ify the code

#include <wchar.h>
#include <stdlib.h>
#include <stddef.h>

#if (__STDC_VERSION__ >= 199901L)
#include <stdint.h>
#endif

wchar_t *repl_wcs(const wchar_t *str, const wchar_t *from, const wchar_t *to) {

	/* Adjust each of the below values to suit your needs. */

	/* Increment positions cache size initially by this number. */
	size_t cache_sz_inc = 16;
	/* Thereafter, each time capacity needs to be increased,
	 * multiply the increment by this factor. */
	const size_t cache_sz_inc_factor = 3;
	/* But never increment capacity by more than this number. */
	const size_t cache_sz_inc_max = 1048576;

	wchar_t *pret, *ret = NULL;
	const wchar_t *pstr2, *pstr = str;
	size_t i, count = 0;
	#if (__STDC_VERSION__ >= 199901L)
	uintptr_t *pos_cache_tmp, *pos_cache = NULL;
	#else
	ptrdiff_t *pos_cache_tmp, *pos_cache = NULL;
	#endif
	size_t cache_sz = 0;
	size_t cpylen, orglen, retlen, tolen, fromlen = wcslen(from);

	/* Find all matches and cache their positions. */
	while ((pstr2 = wcsstr(pstr, from)) != NULL) {
		count++;

		/* Increase the cache size when necessary. */
		if (cache_sz < count) {
			cache_sz += cache_sz_inc;
			pos_cache_tmp = (ptrdiff_t *)realloc(pos_cache, sizeof(*pos_cache) * cache_sz);
			if (pos_cache_tmp == NULL) {
				goto end_repl_wcs;
			} else pos_cache = pos_cache_tmp;
			cache_sz_inc *= cache_sz_inc_factor;
			if (cache_sz_inc > cache_sz_inc_max) {
				cache_sz_inc = cache_sz_inc_max;
			}
		}

		pos_cache[count-1] = pstr2 - str;
		pstr = pstr2 + fromlen;
	}

	orglen = pstr - str + wcslen(pstr);

	/* Allocate memory for the post-replacement string. */
	if (count > 0) {
		tolen = wcslen(to);
		retlen = orglen + (tolen - fromlen) * count;
	} else	retlen = orglen;
	ret = (wchar_t *)malloc((retlen + 1) * sizeof(wchar_t));
	if (ret == NULL) {
		goto end_repl_wcs;
	}

	if (count == 0) {
		/* If no matches, then just duplicate the string. */
		wcscpy(ret, str);
	} else {
		/* Otherwise, duplicate the string whilst performing
		 * the replacements using the position cache. */
		pret = ret;
		wmemcpy(pret, str, pos_cache[0]);
		pret += pos_cache[0];
		for (i = 0; i < count; i++) {
			wmemcpy(pret, to, tolen);
			pret += tolen;
			pstr = str + pos_cache[i] + fromlen;
			cpylen = (i == count-1 ? orglen : pos_cache[i+1]) - pos_cache[i] - fromlen;
			wmemcpy(pret, pstr, cpylen);
			pret += cpylen;
		}
		ret[retlen] = L'\0';
	}

end_repl_wcs:
	/* Free the cache and return the post-replacement string,
	 * which will be NULL in the event of an error. */
	free(pos_cache);
	return ret;
}

Case-insensitive matching

Both functions above match case-sensitively. If you wish to match case-insensitively, then you will need to replace strstr and wcsstr with case-insensitive versions. The Standard, unfortunately, does not provide these. Many Linux/UNIX systems, however, provide strcasestr as a case-insensitive version of strstr. If you need your code to work on a system that does not provide strcasestr, and you either will not distribute your application, or are able to distribute it under the GPL, then you could adapt the glibc implementation, in the files strcasestr.c and str-two-way.h. Otherwise, you will need to roll your own or find a variant under a licence compatible with your own code.

There does not seem to be an equivalent wide string version even on Linux/UNIX systems, so you will either need to roll your own or perform a web search for wcsistr and adapt any code you find under a licence compatible with your own code.

Changelog

03 March 2018
- Fixed the broken links in the syntax highlighting of the source code.
- Updated the links to strcasestr.c and str-two-way.h.
12 December 2016
- Moved this changelog to the bottom of the page from a messy series of "update" notices at the top.
10 September 2016
- Fixed a potential realloc memory leak.
- Fixed so gcc doesn't issue a warning when compiling under the C99 standard or higher.
28 February 2016
- Added a note on C++ compatibility, and the "C++ify" links. Renamed the function parameters from old and new to from and to respectively, because "new" is a reserved word in C++.
- Noted under Performance that Albert's function, whilst generally fastest, also generally uses more memory than repl_str.
07 May 2015
- Updated the initialisation of cache_sz_inc_factor from 2 to 3, which seems to be optimal.
21 April 2015
- Added a wide string version
- Added a section on case-insensitive matching.
02 March 2015
- Having realised that I can't be 100% sure of the copyright of the functions I had originally included on this page, due to realising that they were largely based on someone else's code as posted to the comp.lang.c newsgroup, and having discovered some old code of my own lying around which used a slightly different approach - that of caching - I decided to clean up that old code and publish it here instead of keeping the functions of more questionable copyright up. You can still find those old functions in the benchmarking code linked to above.