Testing a draft of `.replaceRegexp`

prko · February 12, 2024, 7:21am

Hello,

I am attaching a draft of .replaceRegexp. Could anyone prove this before I make a PR?

method:

+ String {
	replaceRegexp { arg findRegexp, replace;
		var founds, replaced;
		founds = this.findRegexp(findRegexp);
		founds = if(findRegexp[0] == $^) {
			founds.collect { |array| if (array[0] == 0) {array} {} }
		} {
			founds
		};
		while { founds.includes(nil) } { founds.remove(nil) };
		founds = founds.asSet.asArray.sort({ |a, b| a[0] < b[0] });
		replaced = this;
		if(founds.size > 0) {
			founds.reverse.do { |idx_str|
				var foundIndex, foundString;
				#foundIndex, foundString = idx_str;
				replaced = if (foundIndex > 0) {
					var lastString = replaced[foundIndex + foundString.size ..];
					lastString = if(lastString != nil) { lastString } { "" };
					replaced[0 .. foundIndex - 1] ++ replace ++ lastString
				} {
					replace ++ replaced[foundString.size ..]
				}
			}
		} {
			replaced
		};
		^replaced
	}
}

Test code:

/* Removal of all single numbers */
"012qW567<>?,. /".replaceRegexp("[0-9]", "") // qW<>?,. /
"012qW567<>?,. /".replaceRegexp("\\d", "")   // qW<>?,. /

/* Removal of two adjacent single numbers */
"012qW567{}|[`~]".replaceRegexp("[0-9]{2}", "") // 2qW7{}|[`~]
"012qW567{}|[`~]".replaceRegexp("\\d{2}", "")   // 2qW7{}|[`~]

/* Removal of from three adjacent single numbers */
"0q12W345;6789 :'\"\\\(\)".replaceRegexp("\\d{3,}", "")   // 0q12W; :'"\()

/* Removal of from three to five adjacent single numbers */
"0q12W345{6789}|12345[123456\\1234567~]".replaceRegexp("\\d{3,5}", "")   // 0q12W{}|[6\67~]

/* Removal other than each single number */
"123qWe456!@£$%^&*".replaceRegexp("\\D", "")    // 123456
"123qWe456!@£$%^&*".replaceRegexp("[^\\d]", "") // 123456
"123qWe456!@£$%^&*".replaceRegexp("[^0-9]", "") // 123456

/* Removal of the single number at the beginning of the string */
"123qWe456!@£$%^&*".replaceRegexp("^[0-9]", "") // 23qWe456!@£$%^&*
"123qWe456!@£$%^&*".replaceRegexp("^\\d", "")   // 23qWe456!@£$%^&*

/* Removal of the series of numbers at the beginning of the string */
"123qWe456!@£$%^&*".replaceRegexp("^[0-9]+", "") // qwe456!@£$%^&*
"123qWe456!@£$%^&*".replaceRegexp("^\\d+", "")   // qwe456!@£$%^&*

/* Removal of the single number at the end of the string */
"123QWE456rty789".replaceRegexp("\\d$", "")            // 123QWE456rty78

/* Removal of the series of numbers at the end of the string */
"123QWE456rty789".replaceRegexp("\\d+$", "")           // 123QWE456rty

/* Removal of all numbers preceded by a non-number */
":123QWE456rty789".replaceRegexp("(?<=\\D)\\d", "") // :23QWE56rty89

/* Removal of all numbers not preceded by a non-number */
":123QWE456rty789".replaceRegexp("(?<!\\D)\\d", "") // :1QWE4rty7

/* Removal of all numbers followed by a non-number */
"432:123QWE456rty789".replaceRegexp("\\d(?=\\D)", "") // 43:12QWE45rty789

/* Removal of all numbers not followed by a non-number */
"432:123QWE456rty789".replaceRegexp("\\d(?!\\D)", "") // 2:3QWE6rty

/* Removal of all single lowercase letters */
"123qWe456RTY!@£$%^&*".replaceRegexp("[a-z]", "") // 123W456RTY!@£$%^&*
"123qWe456RTY!@£$%^&*".replaceRegexp("\\l", "")   // 123W456RTY!@£$%^&*

/* Removal other than single lowercase letter */
"123qWe456RTY!@£$%^&*".replaceRegexp("\\L", "") // qe*

/* Removal of all single uppercase letters */
"123qWe456RTY!@£$%^&*".replaceRegexp("[A-Z]", "") // 123qe456!@£$%^&*
"123qWe456RTY!@£$%^&*".replaceRegexp("\\u", "")   // 123qe456!@£$%^&*

/* Removal other than single uppercase letter */
"123qWe456RTY!@£$%^&*".replaceRegexp("\\U", "")   // WRTY

VIRTUALDOG · February 12, 2024, 8:38am

It should be a C++ primitive to use boost.regex’s replace, you could use stripRtf and findRegexp as starting points.

github.com

supercollider/supercollider/blob/db7eed2a17c361503dbc7f70a557874b6001e3cd/lang/LangPrimSource/PyrStringPrim.cpp#L623


      
          #ifdef _WIN32
                  SetEnvironmentVariable(key, value);
          #else
                  setenv(key, value, 1);
          #endif
              }
          
              return errNone;
          }
          
          int prStripRtf(struct VMGlobals* g, int numArgsPushed) {
              PyrSlot* a = g->sp;
              int len = slotRawObject(a)->size;
              char* chars = (char*)malloc(len + 1);
              memcpy(chars, slotRawString(a)->s, len);
              chars[len] = 0;
              rtf2txt(chars);
          
              PyrString* string = newPyrString(g->gc, chars, 0, false);
              SetObject(a, string);
              free(chars);

github.com

supercollider/supercollider/blob/db7eed2a17c361503dbc7f70a557874b6001e3cd/lang/LangPrimSource/PyrStringPrim.cpp#L414


      
                  SetObject(array->slots + 1, matched_string);
                  g->gc->GCWriteNew(array, matched_string); // we know matched_string is white so we can use GCWriteNew
              };
          
              --g->sp; // pop the stack back to the receiver slot since we stored result_array there above
              SetObject(a, result_array); // now we can set the result in a
          
              return errNone;
          }
          
          static int prString_FindRegexpAt(struct VMGlobals* g, int numArgsPushed) {
              /* not reentrant */
              static detail::regex_lru_cache regex_lru_cache(boost::regex_constants::ECMAScript);
          
              using namespace boost;
          
              PyrSlot* a = g->sp - 2; // source string
              PyrSlot* b = g->sp - 1; // pattern
              PyrSlot* c = g->sp; // offset
          
              if (!isKindOfSlot(b, class_string) || (NotInt(c))) {

prko · February 12, 2024, 2:24pm

Thank you for your guidance.

In my draft, I basically used `.findRegexp’ method, which seems to be from one of the primitives you mentioned, and then processed data to replace the found string. As far as I have tested it, it seems to be correct and functional. However, I am not sure if it will work in any case…
I cannot configure how to use .stripRtf to replace text with regular express. If you mean a functionality that replaces the text from an RTF file with regular express, I think this should be a user-side code including a = File(path, "r"), a.readAllString.stripRTF, a.close, a.replaceRegexp… Of course, it would be convenient to have this as a single method. Do you mean this?

I have one more question to avoid confusion! Is the C++ primitive a reference to start with, or is modifying the C++ primitive avoidably necessary to implement .replaceRegexp? Modifying the C++ primitive is currently beyond my capabilities… Could it not be enough to implement this feature with SC file modification if my method draft works correctly?

VIRTUALDOG · February 12, 2024, 2:43pm

This is unreasonable when, for a complex task, there is a perfectly good library method you can use in C++, which doesn’t require additional review, testing or maintenance. If you are unable to write the primitive I would suggest leaving it as a task for someone else. You can always supply your code as a Quark.

You would want to write a new primitive, and the two I listed would be good as examples because they handle the two related tasks of (1) using a function from boost.regex and (2) performing replacement operations on a string.

shiihs · February 12, 2024, 7:49pm

Maybe interesting, although I’m a bit late to let you know: a quark happens to exist which implements a replaceRegex (and some other operations as well).

smoge · February 12, 2024, 11:04pm

If I’m not mistaken, there is a chapter on writing language primitives in the supercollider book. Maybe we can just share it?

Too bad this is a bit intrusive in the project, since there isn’t some kind of lang plugin.

prko · February 12, 2024, 11:06pm

@VIRTUALDOG
Thanks for the detailed explanation! I now understand more the importance of writing primitives using the C++ library, and also why there are so many Quarks with similar functionality.

@shiihs
Thank you for your Quarks! I have tested with my test code. It also gave me opportunity to review the problem of my method draft.

Your method returns the same result for the following two functionality:

/* Removal of the single number at the beginning of the string */
"123qWe456!@£$%^&*".replaceRegex("^[0-9]", "") // 23qWe456!@£$%^&* //<-(expected) // qWe456!@£$%^&*
"123qWe456!@£$%^&*".replaceRegex("^\\d", "")   // 23qWe456!@£$%^&* //<-(expected) // qWe456!@£$%^&*

/* Removal of the series of numbers at the beginning of the string */
"123qWe456!@£$%^&*".replaceRegex("^[0-9]+", "") // qwe456!@£$%^&*
"123qWe456!@£$%^&*".replaceRegex("^\\d+", "")   // qwe456!@£$%^&*

The following returns an error:

/* Removal of all numbers preceded by a non-number */
":123QWE456rty789".replaceRegex("(?<=\\D)\\d", "") // :23QWE56rty89 //<-(expected)

The following functionality returns unexpected result on Window (my method draft also has the same problem):

/* Removal of all single uppercase letters */
"123qWe456RTY!@£$%^&*".replaceRegexp("\\u", "")   // 123qe456!@£$%^&* // <-(expected) 
// 123qe456!@�$%^&* // <- on Window. my method also has the same problem on Window.

/* Removal other than single uppercase letter */
"123qWe456RTY!@£$%^&*".replaceRegex("\\U", "")   // WRTY // <-(expected)
// WRTY�  // <- on Window. my method also has the same problem on Window.

These oddities in Windows discourage me from continuing to work in this way.

VIRTUALDOG · February 12, 2024, 11:30pm

Writing Primitives | SuperCollider 3.12.2 Help is also a resource.

Btw neither of these SC implementations take into account match groups/match format syntax (e.g. s/ (\d)/\1/ in sed; boost’s docs are Perl Format String Syntax - 1.84.0) which is a major part of regex substitution functionality.

smoge · February 12, 2024, 11:47pm

Something interesting about Scheme as extension language, a shared library is linked to the running Guile image only when required, optimizing resource usage and flexibility.

I’m sure the devs have discussed it about sclang, and it must be challenging to do something like this.

I wrote an implementation of rational numbers that you reviewed, but you said a primitive would be better. Ok, at the time, I even compiled a version using the boost lib implementation, but it is a little intimidating to modify source files that are so central to the project. Never completed this project.

The quark is good enough for me, but people tend to make more improvised implementations than using a quark. Maybe that’s a trust issue.

VIRTUALDOG · February 12, 2024, 11:56pm

Language plugins would be great, although nobody has taken the effort for it. But let’s stick to the main topic please. (:

smoge · February 13, 2024, 12:02am

Sorry about that. I agree with you, the boost regexp is good. The part I wanted to contribute to the conversation is that if everything must be implemented as primitive (in a scheme style), there should at least be an effort to improve it. Otherwise, things just get stuck.

prko · February 13, 2024, 11:25am

I tried to write primitive using Gemini, but gave up.

It is a shame that I want to implement this feature, but personally could only do so using sclang (even though I think my draft works almost correctly), because in this case it should be implemented as a form of primitives…

So unless there is someone in the development group or a user who actively wants this feature and can handle C++, this feature won’t be implemented in the official SuperCollier build. Instead, several quarks will be written by different users. (Currently at least two…) Oh, I think this is not user friendly (for musicians even more unfriendly)…

It would be better for a user like me to ask advanced users with C++ programming skills and donate some money or a gift via Amazon for example. (I think learning C++ is too far away from music. Am I wrong?)

The philosophy of Open Source is ideal, but the state of development of SC is not ideal if a normal user wants a new feature or if all users have a similar level of programming…

Anyway, I would like to include this feature as an external method in one of my class files in my Quarks that I intend to publish, but how could I write the appropriate part of this method for the String.schelp?

VIRTUALDOG · February 13, 2024, 12:39pm

It does not cover the majority of Boost’s (or any standard) regex replace functionality. It would be bad if this were the core library’s regex-replace. A simple example of a replace which doesn’t work is "a".replaceRegexp(".", "$0") → "a".

Until Rust takes over, C and C++ are the standard languages for writing audio applications. If you aren’t going to learn it, that’s fine, but then you shouldn’t be surprised if you have difficulty contributing to a codebase which is already a huge majority C++ code.

The good news is that there are many, many things you can write in sclang which do not require C++ primitives and which would solve open issues. It’s primarily in the case of needing good performance, or functionality which is best provided by a C or C++ library. So you can always try one of those instead. Or, you could take a course on C++, since you have already identified this as an in-demand skill for this project.

Look at the Quark linked above: https://github.com/shimpe/scstringext/tree/master

jordan · February 13, 2024, 12:57pm

I had a go, seems to work as expected

prko · February 13, 2024, 1:02pm

@jordan Jordan Thank you for your kindness!

@VIRTUALDOG I had decided not to investigate learning more languages, but … I should reconsider… Anyway, thanks for the clarification. You made me understand the whole scene more deeply!

jordan · February 13, 2024, 1:27pm

Just to chime in…

this took me about 2.5 hours, which given it was the first time I’ve touched the lang side of sc wasn’t too bad.

The real issue isn’t learning the language, its reading the existing code, and figuring out all the unspoken rules… I kept getting a nasty bug because I assumed the char* of supercollider strings was null-terminated (which I think is reasonable) yet it isn’t… I still haven’t found exactly where this is stated in the code or documentation…

It is things like that, along with many of the older conventions, that make this code very challenging for a beginner. The only solution is for someone to clean it up, but that requires that the writer understand all the code in the first place… I don’t think there is anyone with enough knowledge of this code to perform such a task (might be wrong though!).

I’m not too familiar with rust, but I imagine most of the code in the vm would have to be unsafe, so perhaps not a good match here, but definitely in the audio side of things.

smoge · February 13, 2024, 1:58pm

Yes, it’s not super complicated to do it, but for the reasons you mentioned, it is. It’s not exactly the C language the problem.

jamshark70 · February 13, 2024, 2:22pm

A post was split to a new topic: Sclang extensibility

VIRTUALDOG · February 13, 2024, 2:25pm

Thanks @jordan !

Getting used to a new codebase is always a learning curve. I don’t think it is documented, and it probably should be since that would be a reasonable assumption. In SC Strings aren’t null terminated because they are defined as a type of array. So in C++ it would be more like std::vector<char> (which isn’t null terminated) than std::string (which is).

jordan · February 13, 2024, 3:13pm

I also found that symbols are null terminated, which really confused me because they store their size as well,
but it is a uint8_t and doesn’t include the null terminator.

I just had a quick look and the uint8_t causes at least one bug…

// works, produces a file with 255 'a's
a = ('a'!255).reduce('++').asString.asSymbol
f = File("~/tmp/test.txt".standardizePath, "w")
f.write(a)
f.close


// does not work, produces an empty file
a = ('a'!256).reduce('++').asString.asSymbol
f = File("~/tmp/test.txt".standardizePath, "w")
f.write(a)
f.close

There are also a few other places this occurs, although most primitives call strlen on the char*.

Definitely off topic now, but I will make a gh issue.