Wikipedia talk:Bots/Noticeboard/Attribution bot proposal
Discuss issues with the Attribution bot proposal.
Proposal: change 'Comment' lines to noinclude content
[edit]DreamRimmer, if it is all right with you, I would like to change the concept of 'Comment lines' in the §§ Input file format and input file description, and remove its definition as any string between /* these delimiters */. That creates some awkwardness in interaction with non-data lines on the page, as you can see in the wikicode of User:JeyReydar97/Attribution set 1, where it awkwardly interacts with the hatnote at the top of the page, and with the {{div col}} template that folds the list in reader view mode. The page works, it's just weird looking, and non-standard.
I still want to keep the concept of user-defined, non-data line elements, but more following the wiki-way. I suggest we simply use pairs of <noinclude>...</noinclude>
tags around whatever the user wishes to have on the page that is not one of the data lines. If you are okay with this proposal, I will change the spec; or if you have a better idea, please lmk. Thanks, Mathglot (talk) 22:19, 14 December 2024 (UTC)
User requests and administration
[edit]I've added new section § User requests and administration. Please look it over and change it in whatever way makes your life easiest. Mathglot (talk) 23:05, 14 December 2024 (UTC)
- Wow! User requests and operation section looks awesome. Thank you for all your hard work, Mathglot. I really appreciate it :) – DreamRimmer (talk) 04:20, 15 December 2024 (UTC)
Please hold off on live run...
[edit]DreamRimmer, please hold off on making a live run; I am making a change to the edit summary text, and will ask for your approval. Please stand by... Mathglot (talk) 23:05, 14 December 2024 (UTC)
- DreamRimmer, I made a bunch of changes, but almost all in the interest of clarity; very little substantive change. Only two that might affect you:
- There is no longer a concept of /* comment delimiters */; instead, please see new subsection § Comment lines, which explains the use of inclusion control as a sort of comment-line workaround. This plays much nicer with other material the user may wish to see when viewing their page, such as hatnotes, See-alsos, Intro paragraphs, templates to fold the list, and so on.
- The summary text has been slightly modified to append some sort of bot id. I have proposed
(by AttriBot)
but feel free to place anything there that will somehow identify it. Later, if/when a bot is approved, we should also link the id to the bot landing page.
- Maybe a third thing, regarding the words Dummy edit in the summary. In your test edit, you included the words,
Dummy edit to note that the
, but that all seems superfluous to me. In fact, none of the suggested wording at WP:CWW, including examples at WP:TFOLWP and at WP:RIA mention in the edit summary that the dummy edit is a dummy edit. The WP:RIA suggests starting with the wordNOTE:
, which makes sense to me, because exceptionally it is talking about *some other edit*, so I feel that is a worthwhile inclusion (and it's brief). What I do personally, when repairing insufficient attribution manually, is include the wordNOTE:
and hyperlink it to the RIA shortcut, so I end up with something like this:[[WP:RIA|NOTE:]] Content in the edit of $TIMESTMAP was translated...
- and that would be my preference; unhyperliked
NOTE:
second choice. But I don't feel strongly about it, and if you prefer explicitly identifying it as a dummy edit in the summary, I don't object. - My chief concern is the first two points about comment lines, and some kind of bot identifier tag; are you okay with those two? If so, I think we are ready for the next step: a debug run. Mathglot (talk) 01:36, 15 December 2024 (UTC)
- @Mathglot: Sure, I will run it in dry mode and only generate logs. Since I will be using my alternate account for this, I cannot use “(by AttriBot)” in this run. If we handle more than 500 articles or 3 to 4 requests in the future, I will file a BRFA, and we can include it then. Regarding the dummy edit wording in the edit summary, I noticed it being used by others, which is why I adopted it, but I am fine with your suggested format. Thank you for formatting the page. This works for now, but for future runs, I will provide a format that the bot can easily understand. There are some issues with dummy edits that automated user scripts or bots may encounter, such as being unable to make an edit by adding a space if there is already trailing whitespace or a newline. In such cases, the bot cannot perform an actual edit, so I plan to run it in supervised mode for the first few edits to compare the new content with the old content before saving any changes. – DreamRimmer (talk) 01:53, 15 December 2024 (UTC)
- DreamRimmer, Understood; we can deal with appended id later. If it's just a dry run, then we don't need to wait for confirmation from JR (see next section), but we should do, before doing a live run. A couple of questions:
- Shall I change those /* comment delimiters */ to <noinclude>s? That would be my preference, but if you are ready to go, we can change that later.
- Does your procedure make it easy enough for you to write the output log anywhere you want? In section § Logging of the spec, I suggested writing the log to a subpage named '/log' of the input file, but I don't really care where it goes as long as we can do something systematic enough to be documented.
- Thanks, Mathglot (talk) 02:05, 15 December 2024 (UTC)
- @Mathglot: Please check log at User:JeyReydar97/Attribution set 1/log. – DreamRimmer (talk) 02:41, 15 December 2024 (UTC)
- Wow, this is encouraging, thanks! I glanced at the top few, but am about to be on some other things and then offline for a bit, possibly till tomorrow. Should have it checked within a day, though. Log format is different than spec'ed, and I would prefer seeing the input line echoed in the log on top of the edit summary line, rather than just see the en-wiki article name; for one thing, because the second argument contains the lang-code of the Wikipedia in question, and without the
:fr:
I can't tell if it is printing the correct language name or not. But the log format is not as important as the edit summary, so I will start validating those, soon. - Another thing I was going to do, was to draw up a test file, as a kind of smoke test. For example, does the current version handle arg 4 (copy/translate token) and arg5 (user-supplied comment)? And how it reacts to extra white space in various places.
- In the meantime, I just wanted to acknowledge all the great work you are doing, and let you know how much I appreciate it. I think this is the core of something that is going to be a very useful and productive tool.
- JeyReydar97, don't feel you have to "validate" anything in the edit summary wording in the log file output by the tool, but you'll probably be interested to see this, as it is the first clear result to come out of all this, including the time you spent drawing up the input file, and in discussion before that. I also wanted to thank you as well for all your effort; we couldn't have done it without you. It's a pleasure to see the emergence of something new and useful that comes out of a collaborative effort like this. The output of Dream's process is here. Pretty cool, eh? (Note: this is just a dry run so far; no articles were changed.) Mathglot (talk) 03:06, 15 December 2024 (UTC)
- Dream, one other thing: how hard is it to run this? Is it a very Rube Goldberg thing, all scotch-taped together on your laptop so unworkable elsewhere, or could I maybe download some software, learn some commands from a manual, and end up being able to run it myself? If more in the latter category than the former, if you could add a new subsection to § User requests and operation with some instructions or tips, that would be nice. Or, if there is a lot to it, either a new section, or a separate page. Thanks again, Mathglot (talk) 03:10, 15 December 2024 (UTC)
- @Mathglot: Just to note, I have not implemented the full proposal you suggested because we currently do not have enough articles or requests to justify such a task. If this becomes a regular task in the future, I will develop a fully automated process. For now, the current version of the code needs to be adjusted manually based on specific requests. For example, in this task, I am tweaking the code according to the provided list and values format. We are working with four key values: article name, target project language, target article name, and timestamp. Additionally, the user part ("by JeyReydar97") has been hardcoded into the edit summary for this task, but I will provide a format later to make it easier for others to supply all the values needed for the bot to understand. If we use the current format you applied, it would require regex, which increases the risk of false positives. Therefore, for the initial requests, I prefer to handle them manually, tweaking the code to match the provided values and format to minimize mistakes. For this run, I have thoroughly checked everything and found no issues, so we can proceed with a live run. However, I have no objection if you and JeyReydar97 would like to re-check these logs/edits. Finally, since the edits will be made on my end using my account, I take full responsibility for any errors the code might make. Please don’t worry about mistakes, as I always review everything multiple times before executing any changes :) – DreamRimmer (talk) 03:31, 15 December 2024 (UTC)
- I can provide you with an easier version of the code that you can run yourself for a small number of pages (100 to 200). I will also provide the necessary documentation to help you set it up and run it. You just need to be careful when using it to ensure it doesn't make any mistakes. – DreamRimmer (talk) 03:36, 15 December 2024 (UTC)
- Log format changed: User:JeyReydar97/Attribution set 1/log – DreamRimmer (talk) 04:06, 15 December 2024 (UTC)
- Grouping replies to previous messages at various levels:
- Implementation level: understood, re: incomplete implementation, tweaks for prefixes, hard-coded userid, etc.; I'm grateful you got something going that fast, and obviously it's a proof of concept that could be generalized to something more configurable if demand pans out, as you say. I think we will get a better feel for that once we advertise this at a centralized discussion location, but I'd like to have at least one live run under our belt first, so we have something to point to. I will proofread the log file, so we can do that tomorrow, or perhaps the next day if there are issues in this run.
- Log file: that looks ideal; big thanks for that.
- Other things: one minor point: the final stats line is nice, but for a debug run, it should show 0 edits. It would be nice if the final line also echoed runtime params (or up front, if you prefer, as a header line).
- I was also thinking of failure conditions of various sorts. One type might be a typo in an article name in the input file, or conversely, an article that is correct in the input but cannot be read for some reason at run time. I was thinking it would be nice if the input line was echoed to the log first, then attempt to edit the article and add the edit summary to the history, and if it succeeds, echo the edit summary to the log, and if it doesn't, you can write some error message to the log, and we will know which input line the error belongs to, because its already identified in the log and it can be looked into later.
- Was also thinking about what to do with redirects, and I don't think we should follow them. Users aren't copying content or translating from one redirect to another, and if they specify a redirect rather than the content-bearing page in their input file, that should be an error. There are even redirects that are linked via wikidata to redirects on other Wikipedias, and I still don't think we should process them: it makes no sense to "copy" or "translate" redirect content from one page to another, and even if in some unique, wacky edge case a user claims to have done so (copying some {{Rcat}}s?), it doesn't require attribution because the content of a redirect page is non-creative content and that cannot be copyrighted, and therefore does not require attribution. So I think we should just not try to process redirects and emit an error message to the log if we encounter one.
- Would love a version of the code, but to save you unnecessary duplicate effort, maybe not yet. Let's see how the dry run validation goes tomorrow, perhaps there will be code tweaks that come out of it. After we have a good dry run and then a good live run and you feel reasonably comfortable with whatever version you are using, then I'll ask for one. I would say we were ahead of schedule if we had a schedule, but since we don't, I'll just say I'm very pleased with the way this is going.
. Mathglot (talk) 07:43, 15 December 2024 (UTC)
- Grouping replies to previous messages at various levels:
- Log format changed: User:JeyReydar97/Attribution set 1/log – DreamRimmer (talk) 04:06, 15 December 2024 (UTC)
- I can provide you with an easier version of the code that you can run yourself for a small number of pages (100 to 200). I will also provide the necessary documentation to help you set it up and run it. You just need to be careful when using it to ensure it doesn't make any mistakes. – DreamRimmer (talk) 03:36, 15 December 2024 (UTC)
- Wow, this is encouraging, thanks! I glanced at the top few, but am about to be on some other things and then offline for a bit, possibly till tomorrow. Should have it checked within a day, though. Log format is different than spec'ed, and I would prefer seeing the input line echoed in the log on top of the edit summary line, rather than just see the en-wiki article name; for one thing, because the second argument contains the lang-code of the Wikipedia in question, and without the
- @Mathglot: Please check log at User:JeyReydar97/Attribution set 1/log. – DreamRimmer (talk) 02:41, 15 December 2024 (UTC)
- DreamRimmer, Understood; we can deal with appended id later. If it's just a dry run, then we don't need to wait for confirmation from JR (see next section), but we should do, before doing a live run. A couple of questions:
- @Mathglot: Sure, I will run it in dry mode and only generate logs. Since I will be using my alternate account for this, I cannot use “(by AttriBot)” in this run. If we handle more than 500 articles or 3 to 4 requests in the future, I will file a BRFA, and we can include it then. Regarding the dummy edit wording in the edit summary, I noticed it being used by others, which is why I adopted it, but I am fine with your suggested format. Thank you for formatting the page. This works for now, but for future runs, I will provide a format that the bot can easily understand. There are some issues with dummy edits that automated user scripts or bots may encounter, such as being unable to make an edit by adding a space if there is already trailing whitespace or a newline. In such cases, the bot cannot perform an actual edit, so I plan to run it in supervised mode for the first few edits to compare the new content with the old content before saving any changes. – DreamRimmer (talk) 01:53, 15 December 2024 (UTC)
Test run prep
[edit]I believe we are almost ready for a test run. Before we do it, I just want to confirm the following points with User:JeyReydar97:
- We are talking here about the articles listed in the input file User:JeyReydar97/Attribution set 1.
- You, JeyReydar97, are the user who made all the edits identified by the articles and associated timestamps in the input file; i.e. you are not reporting edits made by any other user.
- All of the articles listed there represent translations from some other language Wikipedia; that is, none of them are content copied from one English Wikipedia article to another English Wikipedia article; they are all translated.
Can you confirm that all of these statements are true? Thanks, Mathglot (talk) 01:49, 15 December 2024 (UTC)
- Yes, I can confirm all of the points made above. JeyReydar97 (talk) 20:26, 15 December 2024 (UTC)
- Great; I'm thinking that a questionnaire like this might be good to have if & when we formalize this into a user request process for requesting bot runs, because this is a minimum bar, I would think, before running the current, semi-automated process. DreamRimmer, do you agree? (A fully automated process down the road, perhaps category-based, might require additional safeguards, but I think this is a minimum.) Mathglot (talk) 20:42, 15 December 2024 (UTC)
Input file verification
[edit]Right-to-left issues
[edit]Before I verify the debug output log (here), I am spot-checking the Attribution set 1 input file. I am finding issues with seven input lines about halfway down the page involving Hebrew originals. No doubt this is some sort of right-to-left script directional issue; I think something probably got mangled in the copy-paste out of the contributin history. Here is a copy of those lines:
Seven input lines pertaining to right-to-left script in the source page
|
---|
* [[Arlozorov Young Towers]]; [[:he:מגדלי הצעירים]]; 23:23, 27 November 2024 * [[Dan Center Tower]]; [[:he:BBC Tower]]; 19:10, 23 November 2024 * [[Eden Tower]]; [[:he:מגדל עדן (בת ים)]]; 22:54, 5 December 2024 * [[Hi Tower]]; [[:he:מגדל Hi Tower]]; 22:57, 22 November 2024 * [[Midtown Tel Aviv]]; [[:he:מגדלי מידטאון]]; 21:39 22 November 2024 * [[Nimrodi Tower]]; [[:he:מגדל נמרודי]]; 16:20, 23 November 2024 * [[Rom Tel Aviv]]; [[:he:מגדל רום]]; 11:41, 5 December 2024
62. Dan Center Tower; he:BBC Tower; 19:10, 23 November 2024 :[[WP:RIA|NOTE:]] Content in the edit of 19:10, 23 November 2024 (UTC) by [[Special:Contributions/JeyReydar97|JeyReydar97]] was translated from the Hebrew Wikipedia article [[:he:BBC Tower]]; see that article's history for attribution. 63. Eden Tower; he:מגדל עדן (בת ים); 22:54, 5 December 2024 :[[WP:RIA|NOTE:]] Content in the edit of 22:54, 5 December 2024 (UTC) by [[Special:Contributions/JeyReydar97|JeyReydar97]] was translated from the Hebrew Wikipedia article [[:he:מגדל עדן (בת ים)]]; see that article's history for attribution. 64. Hi Tower; he:מגדל Hi Tower; 22:57, 22 November 2024 :[[WP:RIA|NOTE:]] Content in the edit of 22:57, 22 November 2024 (UTC) by [[Special:Contributions/JeyReydar97|JeyReydar97]] was translated from the Hebrew Wikipedia article [[:he:מגדל Hi Tower]]; see that article's history for attribution. 65. Midtown Tel Aviv; he:מגדלי מידטאון; 21:39 22 November 2024 :[[WP:RIA|NOTE:]] Content in the edit of 21:39 22 November 2024 (UTC) by [[Special:Contributions/JeyReydar97|JeyReydar97]] was translated from the Hebrew Wikipedia article [[:he:מגדלי מידטאון]]; see that article's history for attribution. 66. Arlozorov Young Towers; he:מגדלי הצעירים; 23:23, 27 November 2024 :[[WP:RIA|NOTE:]] Content in the edit of 23:23, 27 November 2024 (UTC) by [[Special:Contributions/JeyReydar97|JeyReydar97]] was translated from the Hebrew Wikipedia article [[:he:מגדלי הצעירים]]; see that article's history for attribution. 67. Nimrodi Tower; he:מגדל נמרודי; 16:20, 23 November 2024 :[[WP:RIA|NOTE:]] Content in the edit of 16:20, 23 November 2024 (UTC) by [[Special:Contributions/JeyReydar97|JeyReydar97]] was translated from the Hebrew Wikipedia article [[:he:מגדל נמרודי]]; see that article's history for attribution. 68. Rom Tel Aviv; he:מגדל רום; 11:41, 5 December 2024 :[[WP:RIA|NOTE:]] Content in the edit of 11:41, 5 December 2024 (UTC) by [[Special:Contributions/JeyReydar97|JeyReydar97]] was translated from the Hebrew Wikipedia article [[:he:מגדל רום]]; see that article's history for attribution. |
User:JeyReydar97, could you look at those seven lines at /Attribution set 1 and attempt to fix them? If you run into problems, please lmk and I will try and take care of it. Mathglot (talk) 19:37, 15 December 2024 (UTC)
- Yes, I'll fix them in a minute! JeyReydar97 (talk) 20:27, 15 December 2024 (UTC)
- Because hebrew language is written backwards, I ran into a writing problem while fixing the dates, so I just put them under the corresponding object as subpoints. JeyReydar97 (talk) 20:34, 15 December 2024 (UTC)
- Thank you for looking into that, and attempting a fix. However, that won't work for the automated procedure; the input file must be one line per article, per the format given at § Input file; I've reverted. I will look into this (a bit later in the day) and get it fixed. Mathglot (talk) 20:47, 15 December 2024 (UTC)
- I tried everything. It simply won't get written inline. That's why I got them underpointed. By the way, can I add one more article on the list? I just translated one a couple of hours ago and I didn't give the attributions in the edit summary (maybe because I'm not yet used to it but I'm trying desperately to not repeat this mistake over and over again). JeyReydar97 (talk) 22:50, 15 December 2024 (UTC)
- JeyReydar97 I will take care of getting the Hebrew articles into the Input file; don't worry about that. Yes, you could in theory add one more article to the list, but please don't. It would be a useful skill for you to learn how to do it on your own. The reason is, as you go forward, if you have just one or two such articles that need fixing, it isn't fair to ask someone to take the time to configure and run the bot just for a handful of articles. So may I ask you to try this one manually? I assume we are talking about Rimini Skyscraper, right?
- The two things that you need to know, are:
- the text of the edit summary that needs to be added (see WP:RIA; your exact text is the following:)
[[WP:RIA|NOTE]]: Content in the edit of 22:16, 15 December 2024 was translated by [[Special:Contributions/JeyReydar97|JeyReydar97]] from the existing Italian Wikipedia article at [[:it:Grattacielo di Rimini]]; see its history for attribution.
- how to perform a dummy edit. If you change nothing on the page, the Publish button will not save the edit summary; you have to change something. Typical is to just find a blank somewhere, and turn that blank into two blanks. That is enough of a change, that when you add the edit summary and hit Publish, it will save it.
- the text of the edit summary that needs to be added (see WP:RIA; your exact text is the following:)
- That is all you need. Are you willing to give it a try? Mathglot (talk) 23:07, 15 December 2024 (UTC)
- Sure. Let me try it just now! JeyReydar97 (talk) 23:18, 15 December 2024 (UTC)
- Update It seems that the edit was saved. I added the text you pointed above. The edit history section recognizes the edit so I assume it's been done correctly. JeyReydar97 (talk) 23:22, 15 December 2024 (UTC)
- JeyReydar97, your edit added the text to the edit summary, so
that complies with the licensing requirement, so you are done. Congrats!One minor point: it looks like you did not do a dummy edit, which would have shown a +1 byte change to the article size in the History, but instead combined the attribution in the edit summary with a small addition to the article that increased it by +48 bytes. In particular, adding the words "as well as the tallest in Rimini" (diff). This doesn't invalidate the attribution, so all is well, but it is not customary to do it this way, because it leaves the addition of that phrase to the article without an edit summary. Next time, if possible just do the dummy edit (adding a blank and nothing more) along with the WP:RIA edit summary. But this is fine for now; thanks! Mathglot (talk) 23:40, 15 December 2024 (UTC) - JeyReydar97, Oh no, it's not fine, I take it back! Just noticed that you added the wrong timestamp. You should have copied the text I gave you above. I will fix it. Mathglot (talk) 23:43, 15 December 2024 (UTC)
- Oh, I thought that the timestamp should match the very time in which the edit has been made, but however it makes sense. And thanks for the tip regarding the dummy edits. I'll keep that in mind from now! Thank you for all your help and for the fact that you noticed these problems all along! JeyReydar97 (talk) 23:46, 15 December 2024 (UTC)
Done. Do you see the difference, and understand why? Mathglot (talk) 23:47, 15 December 2024 (UTC)
- No, not the present moment, but the timestamp of when you made the translation, as you have already figured out, I think. It is the translation that needs to be attributed, which is why you have to identify it later via WP:RIA, if you forgot to do it the first time. Mathglot (talk) 23:49, 15 December 2024 (UTC)
- Roger. Your explanations have been crystal clear. It's so convenient to receive such help. Thank you. JeyReydar97 (talk) 23:54, 15 December 2024 (UTC)
- Oh, I thought that the timestamp should match the very time in which the edit has been made, but however it makes sense. And thanks for the tip regarding the dummy edits. I'll keep that in mind from now! Thank you for all your help and for the fact that you noticed these problems all along! JeyReydar97 (talk) 23:46, 15 December 2024 (UTC)
- JeyReydar97, your edit added the text to the edit summary, so
- Update It seems that the edit was saved. I added the text you pointed above. The edit history section recognizes the edit so I assume it's been done correctly. JeyReydar97 (talk) 23:22, 15 December 2024 (UTC)
- Sure. Let me try it just now! JeyReydar97 (talk) 23:18, 15 December 2024 (UTC)
- I tried everything. It simply won't get written inline. That's why I got them underpointed. By the way, can I add one more article on the list? I just translated one a couple of hours ago and I didn't give the attributions in the edit summary (maybe because I'm not yet used to it but I'm trying desperately to not repeat this mistake over and over again). JeyReydar97 (talk) 22:50, 15 December 2024 (UTC)
- Thank you for looking into that, and attempting a fix. However, that won't work for the automated procedure; the input file must be one line per article, per the format given at § Input file; I've reverted. I will look into this (a bit later in the day) and get it fixed. Mathglot (talk) 20:47, 15 December 2024 (UTC)
- Because hebrew language is written backwards, I ran into a writing problem while fixing the dates, so I just put them under the corresponding object as subpoints. JeyReydar97 (talk) 20:34, 15 December 2024 (UTC)
- I believe I have solved the right-to-left script issue in this edit. It involved addition of a trailing left-to-right mark after Hebrew titles. We won't know for sure until we do another debug run (which we are not ready for, so please not yet).
- A question remains open in my mind whether the marker should be inside or outside the closing brackets; it may not matter, but I think the most logical place is inside—after the title, and before the brackets. That's how it is now in Attribution set 1 after this fix.
- Another issue is mixed LTR and RTL text in the same title, such as in the title he:מגדל Hi Tower at Hebrew Wikipedia. I think we have this one right as well, even though the Hebrew page title field shows it reversed (that is, with the English portion to the left); the url shows it with the English portion to the right. If you scroll down the input file and click on the Hebrew titles, they all bring up the Hebrew Wikipedia page, so they look right to me this way. When we get to the point of writing test cases for a bot, all of these cases should be included.
- I am now returning to input file verification, and will switch to output log verification tomorrow or the next day. Mathglot (talk) 02:55, 16 December 2024 (UTC)
- @Mathglot: There is no problem with these Hebrew entries. If you copy the log output and put it into an edit summary, then preview it, it looks correct. The script is functioning as intended, so I suggest leaving it as it is. – DreamRimmer (talk) 03:23, 16 December 2024 (UTC)
- @Mathglot, any update? – DreamRimmer (talk) 02:05, 20 December 2024 (UTC)
- Sorry for the delay, I've been stuck on a template. Will be back in a day or two. Am thinking when I finish my check we can do a small live run of many ten or twelve items, so in case there's a problem we can adjust manually. Also, I will move a couple of the hebrew ones up to near the top, to make sure we pick up a couple of RTL examples. Mathglot (talk) 04:14, 20 December 2024 (UTC)
- @Mathglot, any update? – DreamRimmer (talk) 02:05, 20 December 2024 (UTC)
- @Mathglot: There is no problem with these Hebrew entries. If you copy the log output and put it into an edit summary, then preview it, it looks correct. The script is functioning as intended, so I suggest leaving it as it is. – DreamRimmer (talk) 03:23, 16 December 2024 (UTC)
Small live test run
[edit]@Mathglot: Any update on this, or has this been abandoned? – DreamRimmer (talk) 03:54, 8 January 2025 (UTC)
- Apologies, I got distracted, and then busy. I have a shortened file consisting of just a dozen or so entries that is ready to go as a small test set for a live run, and if we can run just that one, I will check the output, looking for any anomalies. (If there are any, the number is small enough that I can adjust any problems manually after the run.) If it all looks good, I will make a second file consisting of everything else (i.e., 95% of the original file), to complete the run. The shortened file is User:JeyReydar97/Attribution set 1a. Sorry for the delay. Mathglot (talk) 07:50, 8 January 2025 (UTC)
- Thanks for the update. Someone mentioned the issue of unattributed articles on Discord yesterday, which reminded me of this proposal. I thought I'd check if there are any updates. I'll run it on the first set. – DreamRimmer (talk) 08:32, 8 January 2025 (UTC)
Done @Mathglot, @JeyReydar97: Edits – DreamRimmer (talk) 09:10, 8 January 2025 (UTC)
- Thanks. From just a quick once-through, all the wording looks right. I will check tomorrow more in depth to verify dates, links, and so on. Thanks again, this is a really great tool, and I look forward to seeing it generalized, if there is a demand for it that exceeds just the occasional request. Mathglot (talk) 09:25, 8 January 2025 (UTC)
- @Mathglot: I have checked all the dates and links, and they are all correct. There is not a single issue. I can complete this full set now if you have no problem. I am supervising, so there will be no problem at all :) – DreamRimmer (talk) 09:29, 8 January 2025 (UTC)
- Okay, but hold off a second, because the full set also includes the dozen you just did, and running the full set will generate duplicate attributions, which, while not a huge problem, is kind of a gaffe. I'll create another set for you with the 95%; stand by. (Or if you have an intersection tool that can extract them from the full set—I don't—then feel free.) Mathglot (talk) 09:46, 8 January 2025 (UTC)
- By the way, where is the log for the run you just did? I thought I would find it at User:JeyReydar97/Attribution set 1a/log, but that link is red at the moment. That would also be the place to write any problems or errors, as described at § Logging and § Error handling. (edit conflict) Mathglot (talk) 09:52, 8 January 2025 (UTC)
- Of course, I will not make duplicate edits. I have removed all the processed entries. – DreamRimmer (talk) 09:49, 8 January 2025 (UTC)
- User:JeyReydar97/Attribution set 1a#Log – DreamRimmer (talk) 09:54, 8 January 2025 (UTC)
- Oh, it's included in the file; that works! (Maybe not for the full set, though?) (post-ec) Mathglot (talk) 09:59, 8 January 2025 (UTC)
- Okay, you sound ready; in that case, go for it; please link the input file you plan to use so I know what to look at, and please leave a log file for the run so we can look up any problems that might occur: subpage */log is fine, or whereever is convenient for you. Looking forward to the next steps (looping in one or two people very interested in the attribution topic who I think will be impressed, and may have good feedback based on their patrolling and repair experiences. I probably won't respond again tonight, will check back late tomorrow. (edit conflict) Mathglot (talk) 09:58, 8 January 2025 (UTC)
Done User:JeyReydar97/Attribution set 1b (log) Edits – DreamRimmer (talk) 11:21, 8 January 2025 (UTC)
- Pinging a few folks who handle copyright-related tasks to hear their thoughts on this proposal. @Vanderwaalforces, @GreenLipstickLesbian, @Reconrabbit. – DreamRimmer (talk) 17:26, 8 January 2025 (UTC)
- The examples and results look excellent. Though I've been aware of the ongoing missing attribution issue since I learned about it some months ago and fix it when I see it, when it happens once with a user it's likely not the first time. This will be a great help in just those situations. Of course Vanderwaalforces' script has been of great help in identifying the less obvious instances of copying/translating and input from the author will probably be of more weight than mine. Reconrabbit 17:49, 8 January 2025 (UTC)
- This is superb, I checked the example edits made, they're awesome. There are thousands of unattributed translations the last time I checked, and I think this would make the work easy to a very large extent. Vanderwaalforces (talk) 13:41, 10 January 2025 (UTC)
- Great to hear these responses, and I admit to not having been aware of Vanderwaalforces's script before. When the dust settles, and after we have improved the proposal doc a bit more (notably to add instructions on how to run this, so that DreamRimmer doesn't have to be the only one to bear the burden of responding to every request, I would like to help, and I imagine y'all would, too), I will add a fairly brief note to WP:VPR to formalize this as a proposal, and link back here for the details. In the meantime, feel free to loop in other folks that take a particular interest in finding, reporting, or dealing with instances of unattributed copying or translation. Thanks, Mathglot (talk) 16:33, 10 January 2025 (UTC)
- Pinging a few folks who handle copyright-related tasks to hear their thoughts on this proposal. @Vanderwaalforces, @GreenLipstickLesbian, @Reconrabbit. – DreamRimmer (talk) 17:26, 8 January 2025 (UTC)
- @Mathglot: I have checked all the dates and links, and they are all correct. There is not a single issue. I can complete this full set now if you have no problem. I am supervising, so there will be no problem at all :) – DreamRimmer (talk) 09:29, 8 January 2025 (UTC)
- Thanks. From just a quick once-through, all the wording looks right. I will check tomorrow more in depth to verify dates, links, and so on. Thanks again, this is a really great tool, and I look forward to seeing it generalized, if there is a demand for it that exceeds just the occasional request. Mathglot (talk) 09:25, 8 January 2025 (UTC)
Dummy edits
[edit]@Mathglot: Sometimes automated scripts and bots struggle to make a dummy edit by simply adding a space due to various reasons, so if we could append a comment like: <!-- A bot made a dummy edit to attribute translation/copied material. Please see the history of this page for more information. -->
and then remove the comment after an hour, it would make dummy edits much easier. I’m not sure how the community feels about this idea, but if it is allowed it could be a great solution. – DreamRimmer (talk) 10:25, 15 December 2024 (UTC)
- I think this is allowed per Help:Dummy edit#Methods. If we file a BRFA and BAG asks us to clean it up, I will remove any comments we use for dummy edits. But for this first run, I think we should leave it as is, since making another edit just to remove this would be a cosmetic edit. We can shorten the comment if needed. – DreamRimmer (talk) 12:11, 15 December 2024 (UTC)
<!-- Dummy edit to attribute translation/copied material; can be deleted. -->
would be good. – DreamRimmer (talk) 12:17, 15 December 2024 (UTC)- Pinging @User:Primefac to confirm because Help:Dummy edit is a help page, not a policy. – DreamRimmer (talk) 12:22, 15 December 2024 (UTC)
- It would be better to find a way to have the bot make a true dummy edit than make two edits to the page - people don't like that. Primefac (talk) 18:09, 15 December 2024 (UTC)
- DreamRimmer, I am not (yet?) a bot writer, but I fail to see why this strategy would not work for a dummy edit: "Append one blank ( ) to the end of the page". I am struggling to think of a case where that could cause problems. Alternatively: "change the first newline on the page to blank + newline" (which however would not work in the rare case of a one-paragraph article stub with no newlines; however it would be very unlikely that such an article would be the result of a translation). Mathglot (talk) 19:40, 15 December 2024 (UTC)
- Btw, I think Help:Dummy edit#Methods is aimed at the individual editor making manual edits, not bots. However, what it labels "the simplest method" should work fine for bots, too (although I think my first suggestion is simpler for a bot). Mathglot (talk) 19:45, 15 December 2024 (UTC)
- Just noting that adding a space/newline to the end of a page will most likely not work as extra whitespace in that area tends to get ignored/cut off; the suggestion of adding an extra space at the end of the first line is more likely to stick. Primefac (talk) 12:10, 16 December 2024 (UTC)
- It would be better to find a way to have the bot make a true dummy edit than make two edits to the page - people don't like that. Primefac (talk) 18:09, 15 December 2024 (UTC)
- Pinging @User:Primefac to confirm because Help:Dummy edit is a help page, not a policy. – DreamRimmer (talk) 12:22, 15 December 2024 (UTC)
Edit summary suggestion: include diff
[edit]Including a Special:Diff link would identify the edit with zero ambiguity due to time zones or user renames. For example, Special:Diff/1247478351 for Galicia Central Tower. I understand that collecting the oldid
s is additional work, especially since JeyReydar97 has already compiled a list. Flatscan (talk) 05:28, 15 December 2024 (UTC)
- Thanks for your feedback. I tend to favor permalinks or diffs in a lot of situations, and I agree with you that rev ids are more precise, but honestly, I think this would be overkill. How often do you really get two edits on the same page involving translated (or copied) content within the same calendar second? And even if it is a good idea, this is the wrong place to propose it. This page is just a bot proposal page, which attempts to recreate the edit summary proposed by the editing guideline Wikipedia:Copying within Wikipedia in section § Repairing insufficient attribution. If you can get consensus at WP:CWW to make this change, then this proposal page will definitely follow suit. Mathglot (talk) 06:57, 15 December 2024 (UTC)
Input file format: article linkage
[edit]DreamRimmer, during Input file verification (which I am still working on, due to some RTL issues with Hebrew) I noticed that JR wikilinked params 1 and 2 in Attribution set 1, even though the spec currently calls for them to be unlinked (see § Input format). I think that's a good idea, but wanted to discuss it.
The dry run of your procedure appears to be expecting, or at least, handling the linked source file correctly, and placing it correctly into the into the edit summary wikilinked as is required, as shown in the Attribution set 1/log of the debug run. So that's good and shows that you and JR are on the same page. However, the param linkage doesn't match the current spec, which calls for them to be unlinked. I think we should change the spec and allow (or require) wikilinks in params 1 & 2. Not sure if your are looking for and parsing the brackets in the wikilinks in the input, or just splitting on semicolon or what, but I agree that having the two files wikilinked in the input is helpful for humans (red links for the English articles would jump out at you, and both links are helpful for vetting the input file), so the links makes the input file easier to verify.
So, I'd like to change the spec to match current usage, so that instead of saying that the input line format is unlinked, namely:
* ArticleTitle; SourceTitle; Timestamp; Type; Comment
(current spec)
we would instead say that it is linked:
* [[ArticleTitle]]; [[SourceTitle]]; Timestamp; Type; Comment
(proposed new version)
Is that okay with you? Do we want to require users to use the bracketed format, with both articles wikilinked ? I would be in favor of that, I just want to make sure it doesn't cause you any problems with the procedure. Alternatively, we could allow them to be wikilinked or not, and accept both formats, if you prefer.
As a secondary issue: am I correct in assuming that your procedure does not currently parse args 4 and 5 (Type and Comment)? If so, for the time being, I'd prefer to leave them in the spec of the Input file format as a future goal, and just mention afterward that args 4 and 5 are not implemented yet, if you are okay with that. Mathglot (talk) 01:24, 16 December 2024 (UTC)
- Both formats are good, but the wikilinked format is more helpful, so I think we should use it. Regarding args 4 and 5, they are not implemented yet, but there is no issue with keeping them in the input file format spec as a future goal. – DreamRimmer (talk) 03:50, 16 December 2024 (UTC)
Interconnect finder-bot with repair-bot (with some manual intervention)
[edit]Had an idea. Vanderwaalforces, in a section above, you said:
There are thousands of unattributed translations the last time I checked
and this raised a question in my mind, and inspired an idea for further improvement of the productivity of this process by using the output of (an upgraded version of) your script to feed DreamRimmer's procedure, which I am calling the repair-bot (based on WP:RIA) for the purposes of this discussion.
My question is, since your script operates on one article at a time (namely, the one you are viewing at a given moment), how do you know that there are thousands of unattributed translations? Does that mean you have a bot-ified version of your script that can scan an input feed or list of articles? (I'm calling such a bot a finder-bot.)
The idea that it inspired is the following: let's say you have such a finder-bot, or could create one that operates off an article list, and generates an output log. I imagine that the log is something like a list of articles, along with one of the display_messages from your script saying whether it is a likely unattributed translation or not. In that case, maybe you could create an upgraded version of it, that would also find the likely revision that wrote translated text to the English article, and scrape the timestamp of that edit, and the likely source article from the edit summary, and write them to a log file. You would also have the username available from the history entry. From that, you construct an output log with lines that might look like this:
* ArticleTitle; Userid; SourceTitle; Timestamp;
Type;Comment
This A massaged version of this file could then be used to feed an upgraded version of DreamRimmer's repair-bot, as its § Input file format is very similar to that.
Two caveats:
- The log file would have to undergo manual inspection before passing it to the repair-bot, to check for false positives. The human editor would delete all lines that did not represent an unattributed translation.
- DreamRimmer's script currently has no input file parameter for
Userid
, and assumes every line in the file belongs to one user. But presumably, DreamRimmer could make an updated version of the script, where the modified input file format was as shown above.
If your script outputted lines in that format, setting Type
= translate
andComment
equal to the display message from your script (or more likely, a short token to stand in for it), and DreamRimmer updated his script accordingly, then we would have a three- or four-step process for rapidly reducing the number of unattributed translations:
- Run finder-bot to find possible unattributed translations, generating an output log as posited above
- Volunteer editor goes through the log, eliminating false positives; the remainder represent pages with unattributed translations.
- Editor moves the edited log file to location where repair-bot is looking for work to do, which runs the file and writes a log.
- Might need occasional follow-up by human editors checking the repair-bot output log to deal with failures or other anomalies.
Despite the fact that there is some manual intervention involved, this is still a whole lot better than the way we do it (or mostly, don't do it) now. Seems to me we could make a big dent in reducing the number of unattributed translations, maybe eventually eliminating the backlog entirely, and then catching suspicious candidates almost live off the new articles feed. I know this is kind of getting ahead of ourselves, as we are just validating initial output of the repair-bot above, but I wonder what you both think about this in principle? I don't see anything a priori to prevent this; have I missed anything? Mathglot (talk) 22:29, 10 January 2025 (UTC) —updated per latest spec; by Mathglot (talk) 22:44, 12 January 2025 (UTC)
- @Mathglot This is a whole lot of ideas, but there is actually no "finder-bot". I mean, there's not bot-ified version of my script. There are users who are fond of doing this TRANSVIO and some of them have created or rather translated a lot of articles already. For example, in User:Vanderwaalforces/Sandbox, you'd see a user's entire page creations, over 700 for one and over 200 for the other, yeah and they're unattributed translations. I started working on them a while ago but haven't worked on it for a while now. Vanderwaalforces (talk) 21:39, 12 January 2025 (UTC)
- Vanderwaalforces, But you found those users somehow—how did you? Was it via a search that could be turned into a bot? Anyway, even if there didn't use to be a botified version of your scriptup until yesterday, now it appears that there is; see the last two comments in § Another test run by DR (the one starting 10:07, 12 January; diff). I don't know where DreamRimmer keeps the code for it, but maybe the two of you could collaborate on it; as I understand, that version takes a userid as an input param, but maybe it could be extended to work off a category like, say, Category:Articles with possible unattributed translations (which at least initially, we could populate manually) or by reading a new articles feed, or something. Or if the way you found those users was the result of some kind of search based on the string translat like your script does, maybe that could be the basis for it. Mathglot (talk) 21:53, 12 January 2025 (UTC)
- I forgot—actually, Category:Wikipedia articles with possible unattributed translations already exists, and is populated by Template:Unattributed translation, but someone has to place that template manually. Maybe a finder-bot could add the category as well. Mathglot (talk) 21:55, 12 January 2025 (UTC)
- Damn, that category exists... But, come to think of it, instead of a human to tag an article as a possible unattributed translation, can't they just do the attribution? I mean, that would instead reduce or not increase the backlog. Vanderwaalforces (talk) 22:40, 12 January 2025 (UTC)
- If they would just do the attribution, that would be ideal; then there would be no need for the proposal on this page, and you, me, and DreamRimmer wouldn't be here trying to figuring out how to automate the attributions as much as possible, and we could all go off and improve then encyclopedia some other way. But alas, users don't always do what they are supposed to do. Some lacunae can be just left alone, or tagged, or even ignored sometimes, but attribution being a policy with legal implications, it cannot be ignored, nor overridden by consensus or even by changing the policy; you would have to change the Wikimedia wmf:Terms#7b, and the U.S. and world copyright laws, and that isn't going to happen. So, here we are. Mathglot (talk) 22:56, 12 January 2025 (UTC)
- Oh wait, I think I misread you; you mean the person that *found* the article that lacked attribution, not the person who added unattributed content, is that right? In that case, sometimes they can, especially if there is only one other foreign Wikipedia article, and its creation time is a lot earlier than the suspect edit at en-wiki. But, what if it's not clear which language article should be tagged: maybe the English one was created first, and the Spanish one was later, and maybe we should be tagging the Spanish one? Or maybe it's clear that the English one was created later than the French, Spanish, Catalan, and Italian versions, but they are all somewhat similar (or used to be), at least in the translated part, then we can't add the attribution, at least not without a ton of investigation into past versions, results of automatic translation of those versions, and careful comparison, and maybe not even then. That is not a fair amount of work to impose on a volunteer who is just trying to tag problems, when the original author could have added the proper attribution in a few seconds. So, there are a few reasons to tag the article with the template, rather than try to repair the missing attribution yourself. Mathglot (talk) 23:31, 12 January 2025 (UTC)
- Yep, I meant the person that found the article that lacked attribution. Anyway, we're all volunteering, so, yeah. Vanderwaalforces (talk) 09:50, 13 January 2025 (UTC)
- Damn, that category exists... But, come to think of it, instead of a human to tag an article as a possible unattributed translation, can't they just do the attribution? I mean, that would instead reduce or not increase the backlog. Vanderwaalforces (talk) 22:40, 12 January 2025 (UTC)
- I forgot—actually, Category:Wikipedia articles with possible unattributed translations already exists, and is populated by Template:Unattributed translation, but someone has to place that template manually. Maybe a finder-bot could add the category as well. Mathglot (talk) 21:55, 12 January 2025 (UTC)
- Vanderwaalforces, But you found those users somehow—how did you? Was it via a search that could be turned into a bot? Anyway, even if there didn't use to be a botified version of your scriptup until yesterday, now it appears that there is; see the last two comments in § Another test run by DR (the one starting 10:07, 12 January; diff). I don't know where DreamRimmer keeps the code for it, but maybe the two of you could collaborate on it; as I understand, that version takes a userid as an input param, but maybe it could be extended to work off a category like, say, Category:Articles with possible unattributed translations (which at least initially, we could populate manually) or by reading a new articles feed, or something. Or if the way you found those users was the result of some kind of search based on the string translat like your script does, maybe that could be the basis for it. Mathglot (talk) 21:53, 12 January 2025 (UTC)
Another test run
[edit]Hi, DreamRimmer. Can you launch a debug-only run on User:Torimem/Attribution set 1? Please generate only a log file, without updating any article pages with the attribution. Thanks, Mathglot (talk) 13:58, 11 January 2025 (UTC)
- Thanks, that was quick! You may have noticed that there is discussion going on at User talk:Torimem which is resulting in changes and additions to the list of articles in the input file, so ultimately another debug run will be needed, but article discovery and changes are still in progress. I'll let you know when things have quiesced. Mathglot (talk) 00:06, 12 January 2025 (UTC)
- Status update: we now have file Attribution set 2 in progress for this user. It is a superset of set 1, containing all of the latter, minus the
translate;
token, plus another 50 articles. I'll check the new ones (they were drafted by Torimem in input file format at my request) and spot check the previous 87 from your previous run, and then we should do a new debug run. I won't be checking every one, so there might be some errors that crop up, but that will be a good test, as well. Stand by for further updates. - Oh, one other thing: he compiled the new lines in two batches headed by a caption each; I have preserved his comments in the input file within <noinclude>...</noinclude> tags (per § Comment lines) which I believe you are already handling. Mathglot (talk) 02:41, 12 January 2025 (UTC)
- @Mathglot: User:Torimem/Attribution set 2/log – DreamRimmer (talk) 06:59, 12 January 2025 (UTC)
- Hi, that was fast! The one thing I noticed at first glance is that the
Comment;
field isn't being echoed in the input line, but it is in the edit summary line. I originally imagined it the other way round—only for the echoed input line—but maybe it's a good idea to add it to the edit summary line as well, that way users could add stuff to the edit summary as I often do when I do manual attribution. So maybe it should be in both places. That does offer the user more flexibility, so probably a good idea. Will check tomorrow for validity of individual data fields. Mathglot (talk) 08:46, 12 January 2025 (UTC)- @Mathglot: User:Torimem/Attribution_set_2/log#Log_2 – DreamRimmer (talk) 08:59, 12 January 2025 (UTC)
- Final input format:
- Hi, that was fast! The one thing I noticed at first glance is that the
- @Mathglot: User:Torimem/Attribution set 2/log – DreamRimmer (talk) 06:59, 12 January 2025 (UTC)
- Status update: we now have file Attribution set 2 in progress for this user. It is a superset of set 1, containing all of the latter, minus the
== User; type == * ArticleTitle; :langcode:SourceTitle; Timestamp; Comment
- Only the comment field is optional; all others are required. – DreamRimmer (talk) 09:04, 12 January 2025 (UTC)
- Yes, I was simultaneously adding pretty much the same information to the § Input file section of the doc. The only difference, was that I includes wikilink brackets around ArticleTitle, and did not require :langcode: before SourceTitle, in the case of using the procedure for copy attribution, where the langcode would not be needed. Mathglot (talk) 09:56, 12 January 2025 (UTC)
- @Mathglot: Based on Vanderwaalforces' script, I have written a script to check for possible unattributed articles by a particular user. I have posted the test results at User:DreamRimmer/sandbox. There are some false positives, so it will need a human review to confirm. – DreamRimmer (talk) 10:07, 12 January 2025 (UTC)
- User:DreamRimmer/Possible unattributed articles – DreamRimmer (talk) 17:59, 12 January 2025 (UTC)
- Nice! Can you provide links to the code for this, as well as the other script? If only on your computer, can you upload it somewhere? See also VWF's comment above, and their list at User:Vanderwaalforces/Sandbox.Thanks, Mathglot (talk) 22:36, 12 January 2025 (UTC)
- User:DreamRimmer/Possible unattributed articles – DreamRimmer (talk) 17:59, 12 January 2025 (UTC)
- @Mathglot: Based on Vanderwaalforces' script, I have written a script to check for possible unattributed articles by a particular user. I have posted the test results at User:DreamRimmer/sandbox. There are some false positives, so it will need a human review to confirm. – DreamRimmer (talk) 10:07, 12 January 2025 (UTC)
- Yes, I was simultaneously adding pretty much the same information to the § Input file section of the doc. The only difference, was that I includes wikilink brackets around ArticleTitle, and did not require :langcode: before SourceTitle, in the case of using the procedure for copy attribution, where the langcode would not be needed. Mathglot (talk) 09:56, 12 January 2025 (UTC)
- Only the comment field is optional; all others are required. – DreamRimmer (talk) 09:04, 12 January 2025 (UTC)
Proposed spec changes to section header and input line format
[edit]- @Mathglot: I suggest keeping all parameters required, with no optional parameters. The first parameter should be the article name, the second should include the language code of the original language followed by the article name, and the third should be the date of translation. The user and type of attribution (such as "copy" or "translate") should be included but appear only once in the heading, not repeated for each entry. "Copy" and "translate" sections should be kept separate. If the "revision diff" parameter is used, it should also be required for all future sets. Final set would be:
== Username; translate ==
* [[Foreign battalions in the São Paulo Revolt of 1924]]; :pt:Batalhões estrangeiros na Revolta Paulista de 1924; 11:48, 13 August 2023
* [[São Paulo Revolt of 1924 in the interior]]; :pt:Interior de São Paulo na Revolta Paulista de 1924; 20:14, 12 August 2023
* [[Urban combat in the São Paulo Revolt of 1924]]; :pt:Combate urbano na Revolta Paulista de 1924; 15:41, 6 August 2023
– DreamRimmer (talk) 16:55, 11 January 2025 (UTC)- User:Torimem/Attribution set 1#Log – DreamRimmer (talk) 17:03, 11 January 2025 (UTC)
- Regarding proposed § Input file format modifications, I am fine at present with a section header of
== Username; translate ==
if that will help (and it is useful for documentary purposes). Based on some things Vanderwaalforces said earlier, I had the impression that they had a way to discover hundreds of missing attributions, and it sounded like it was based on a feed—or maybe AWB? not sure—which might result in articles with translate-edits by multiple editors all mixed in, in which case it would be convenient to have the username in the Input file lines. However, I am not clear what they were referring to, and in any case, we don't haveUsername;
in the Input file now, so in order not to complicate things, let's just keep it the way it is now (no username in input lines), and if needed at some point in the future as an upgrade, we can look into it then. - Regarding the revision diff: I do not want to make this required; this was just something I stuck into the
Comment;
field as § Input file format optional field #5 (would become #4, if we drop field #4,Type
which is currentlytranslate;
in all cases) so that users can comment their code lines without bothering your batch repair procedure. - So if we are on the same page, I will make the following changes:
- eliminate field #4,
Type
from the § Input file format- consequently, modify User:Torimem/Attribution set 1 to drop
translate;
from every line in their Input file
- consequently, modify User:Torimem/Attribution set 1 to drop
- modify the spec to add a required level-2 header at the top in format:
== Username; translate ==
- consequently, add
== Torimem; translate ==
, to the top of modify their Input file
- consequently, add
- eliminate field #4,
- We will retain the
Comment
field as optional field 4 (formerly field 5) where the user can add anything they want (I happened to add a diff, but it is still just a comment), which will be ignored by the procedure for processing, other than to continue to echo it to the log as part of the input line in debug runs, and similarly just before the line is processed in a live run. Are you okay with the above as written? - Also, one other thing regarding ordering: I presume the order of the input file lines does not matter to the procedure. So far, it has been easiest for me to create the file from the history page of the editor's new file creations, which is listed by mediawiki in reverse chrono order. But it may be easier for users to have the list alphabetical by en-wiki article title (or in some other order of their preference). As it is, I have proposed alphabetizing Torimem's file to make it easier for him to find what articles might be missing from the list. Just wanted to confirm with you that the order of lines makes no difference to the process. If that is the case, I will add something to the spec to that effect, so users may add lines in whatever order makes their life easier. Mathglot (talk) 00:06, 12 January 2025 (UTC)
- Regarding proposed § Input file format modifications, I am fine at present with a section header of
- Ping DreamRimmer to new section, due to my addition of a new section header which you won't have a subscription to yet. Mathglot (talk) 00:10, 12 January 2025 (UTC)
- Oh, one other thing: the section header is actually quite helpful for another reason: suppose an admin wanted to submit an input file involving missing attribution for three different users. Are you okay with their creating one input file in the admin's user space having three top-level (H2) section headings naming three different users, where each one is followed by the input lines corresponding to that user? (If yes, that should probably be restricted to admins.) Or would you require them to submit three separate files, each with one H2 header? Mathglot (talk) 00:21, 12 January 2025 (UTC)
- As a corollary, non-admin users should probably not be allowed to submit files for other users; i.e., the H2 section header would be required, and the
Username;
field should match {{ROOTPAGENAME}} of the input file (and of the submitter?) or result in an abort if no match. Mathglot (talk) 00:40, 12 January 2025 (UTC)- Based empirically on the results of the latest run, which works with the new format, and your comments above at 09:04, 12 Jan., I've gone ahead and updated the § Input file portion of the spec accordingly. Mathglot (talk) 09:59, 12 January 2025 (UTC)
Test sets
[edit]Just recording here the test sets we have been looking at:
- User:JeyReydar97/Attribution set 1
- User:JeyReydar97/Attribution set 1b
- User:Torimem/Attribution set 1
- User:Torimem/Attribution set 2
Mathglot (talk) 10:32, 13 January 2025 (UTC)
- Needed: User:Pete Maverick/Attribution set 1; see User talk:Pete Maverick#Missing translation attribution
- Proto-set: User:Thriley/translations; not in proper format, but has all the info needed to tweak it into compliance by regex. These come from articles created in mainspace.
- And also: User:Thriley/translation drafts. These come from articles created in Draft space.
- In the works: User:SuperSkaterDude45/translation_worksheet. Mathglot (talk) 19:32, 10 March 2025 (UTC)
Test set ready for production run; spec change
[edit]DreamRimmer, the file User:Torimem/Attribution set 2 is ready for a production run.
Please note: I have made a change to the spec data line format, so that the source file is now also linked. The new version is at § Data lines, and is now the following:
* [[ArticleTitle]]; [[SourceTitle]]; Timestamp; Comment
where the SourceTitle has a lang-code prefix in the case of translations, and where the comment field is still optional as before.
Because a spec change is involved, if you would rather do another debug run first, we can do that. Alternatively, if it is easier to do a live run with the old format (i.e., unlinked SourceFile), then just undo my last edit at User:Torimem/Attribution set 2, namely, rev. 1279678514 of 22:50, 9 March 2025, and it will restore the previous format.
Note that the previous dry run log file has been moved to: User:Torimem/Attribution set 2/log debug run to make way for a live run with the log file at User:Torimem/Attribution set 2/log.
Thanks, Mathglot (talk) 00:21, 10 March 2025 (UTC)
- @Mathglot:
Done I didn't find any problems. The logs are saved at User:Torimem/Attribution set 2/log. Please review the edits and logs and let me know if you notice any issues. There are some duplicate edits on the first 5-6 pages because I restarted the kernel and forgot to remove the entries that were already done. – DreamRimmer (talk) 05:18, 10 March 2025 (UTC)
- Understood, and will do; thanks. For the record, this history link lists the 148 edits involved (> 137 due to the manual reverts, I presume). By the way, since a dummy edit is only about the edit summary, I'm not sure it makes sense to do a revert; a duplicate attribution—that is, the same attribution appearing twice in the edit history— is not harmful. Otoh, if the attribution is in error, then a subsequent edit, not a revert but a further dummy edit with an edit summary noting/correcting the error is needed (e.g., contrary to the previous edit summary of <timestamp>, this article was not a translation, or, ...was a translation of the German article 'Bar', not the 'Foo' article or whatever the case may be). Such a need to correct the record will no doubt arise eventually, and the proposal should probably contain a section about how to handle it.
- A future flash: I have two more sets in the works: the first is format-compliant on the surface but needs substing in the code and is not yet content-verified (here), and another which is in an early, key=value state at the moment (here). That will give us experience with four live runs, and my plan is to bring it to WP:VPI or WP:VPR to publicize it a bit, and solicit input to bootstrap it to the next level. Before that happens, I'll start a section here to discuss how to frame that. Mathglot (talk) 07:42, 10 March 2025 (UTC)
- Just to clarify, those edits weren't manual reverts. As I mentioned earlier, when I restarted the kernel, I forgot to remove the entries from the set before running the script again, which is why it edited those pages twice. For dummy edits, the script adds a space in the first line of the article, and if there's already a space, it removes it. The same thing happened on these pages: in the first edit, it added a space, and in the duplicate edits, it removed the space. I'll update the script to check and ignore pages that are already attributed. – DreamRimmer (talk) 02:36, 11 March 2025 (UTC)
- Ah, got it. The proposed script update sounds like a great robustness enhancement. One question: how will your update react to situations where the user forgot that he already attributed a page, but included it in the input file for the script, in effect inviting the script to add a duplicate attribution? The best outcome, if possible, would be to recognize the old attribution (even if many edits ago) and skip that one. I wouldn't spend a ton of time on that, as we already noted that a duplicate attrib isn't a huge concern, but it would be a nice-to-have.
- In either case, it would be nice to add another bin to the totals you are accumulating while running, so that at the end in your totals line, you could put, "12 entries skipped to avoid duplicate attribution" with the log showing something similar inline (instead of an attribution line) at the right position in the log file, for a line that would've resulted in the duplicate attribution. Mathglot (talk) 08:10, 11 March 2025 (UTC)
- Just to clarify, those edits weren't manual reverts. As I mentioned earlier, when I restarted the kernel, I forgot to remove the entries from the set before running the script again, which is why it edited those pages twice. For dummy edits, the script adds a space in the first line of the article, and if there's already a space, it removes it. The same thing happened on these pages: in the first edit, it added a space, and in the duplicate edits, it removed the space. I'll update the script to check and ignore pages that are already attributed. – DreamRimmer (talk) 02:36, 11 March 2025 (UTC)
Other namespaces
[edit]DreamRimmer, I'm guessing this is a non-issue for you, but I thought I better mention that as the attribution requirement is not limited to mainspace, it is possible that there may be lists pointing to pages in Draft space, User space (likely a subpage or sandbox), and conceivably other spaces. So the procedure should be prepared to deal with that. The only difference would be a prefix in the article title (i.e., the first token on the data line). In fact, there is a new data set in the works where they are all in Draft space: (current rev: 1279667092). Just FYI. Mathglot (talk) 08:27, 11 March 2025 (UTC)
- @Mathglot: The current code can run in any namespace on English Wikipedia, but if the source page on another wiki is in a different namespace, I'll need to update the code. So far, I haven't noticed any such translations, but if there are any, I'll address them. – DreamRimmer (talk) 08:57, 11 March 2025 (UTC)
- What I figured; we're not there yet, so probably not an issue, at least for the foreseeable future. Mathglot (talk) 09:03, 11 March 2025 (UTC)
- If you are interested, I can explain the current state of the code to make it easier for you to update the proposal. – DreamRimmer (talk) 09:00, 11 March 2025 (UTC)
- That would be great. Mathglot (talk) 09:03, 11 March 2025 (UTC)
- As of now, I manage a config file that includes settings such as ATTRIBUTION_PAGE (set page), LOG_PAGE (where we want the log), USERNAME (name of the user/translator), DEBUG_MODE (test run; only logs), and INTERACTIVE_MODE. Given the nature of this task, where the bot may sometimes be unable to make a dummy edit due to various situations, I set all the configurations before running the script. The script currently accepts the following formats:
- For copy:
* [[ArticleTitle]]; [[SourceTitle]]; Timestamp; Comment
- Result:
NOTE: The previous edit as of 22:31, October 14, 2015, copied content from the Wikipedia page at [[Exact name of the page copied from]]; see its history for attribution.
- For translation:
* [[ArticleTitle]]; [[:LangCode:SourceTitle]]; Timestamp; Comment
- Result:
NOTE: Content in the edit of 01:25, January 25, 2023, was translated from the existing French Wikipedia article at [[:fr:Exact name of the French article]]; see its history for attribution.
- It currently doesn't require separate sections for these entries, so if translate and copy entries are mixed, it will work properly without any issues.
- If debug mode is true, it will only publish logs; if debug mode is off, it will perform actual editing. Now, if you can explain point by point what other functionality you want in it or have questions, I will answer those. If any changes or functions need to be added to the code, I will do so. – DreamRimmer (talk) 09:38, 11 March 2025 (UTC)
- Makes sense. So, my understanding about mixed translate and copy entries is that you are using the wikipedia code (LANG_CODE) as a control field, and logically speaking, are doing something like this:
- That would be great. Mathglot (talk) 09:03, 11 March 2025 (UTC)
if empty (LangCode) then AttribMode = 'copy' else AttribMode = 'translate' end
- (perhaps dispensing with a mode variable, and just calling the proper message-gen function in each case), is that correct? Whatever the case, it sounds like I should remove the copy/translate token in the § Section header in the spec; do you agree? (Corollary: the procedure should probably barf if it finds en in the LangCode; and now your earlier comment about everything being automatic as long as we are just talking about English Wikipedia operation makes sense.)
- What is INTERACTIVE_MODE? Is that something that, instead of running through the input file in batch, does something like pausing after each entry and waits for user input before continuing? I imagine it might echo the log line and the proposed edit summary line it plans to write but has not yet done, and then waits for your okay (carriage return?) before editing the article and adding the line, and then pausing at the next entry. Because if that is what it is, then that is something that I would've named DEBUG_MODE, and I wonder if you picked INTERACTIVE_MODE because 'debug mode' was already in use ?
- If so, I think we should swap the names: pausing and asking for user input is definitely a 'debug mode' to me, and I think it should get that name. It's a minor change, and will probably save endless headaches going forward. What we are now calling debug mode, should in that case be renamed to something else: how about, ECHO_ONLY or LOG_ONLY or something like that? Oh, I got it, we've already been using it informally: let's call it DRY_RUN. Under this scheme, the user could opt for log_only (perhaps in the section header) that would translate into LOG_ONLY in your config, but a user could never choose the pause/interactive/debug mode (whatever we call it), only an operator could do that.
- If my assumptions are correct on both counts, then I propose changing the § Section header to this:
== user=Jimbo; log_only=yes; run_comment=append this to every edit summary ==
- where the user is required, the other two are optional. Are you good with this, or what works best for you? I think bit by bit, we are resolving items that have been a bit hand-wavy, and formalizing them sufficiently to avoid confusion which gets us closer to a production-level procedure, so this is all very encouraging. Mathglot (talk) 00:20, 12 March 2025 (UTC)
Driven by category
[edit]DreamRimmer, as this matures, somewhere in the productionization of it we should transition from just tapping you on the shoulder when a set is ready to go, to something a bit more formal, and categorization certainly seems like the way to go. I don't think this is urgent, but I have started to think about it. The WP:AFC has a small set of categories to handle articles that are submitted for review, not submitted, declined, rejected, and maybe one or two others. I thought we could have a small set of state-based categories like that, which would drive the process, and wanted to get feedback.
- Category:Attribution requests pending – the main hopper driving the bot/procedure/exhausted admin
- Category:Attribution requests not submitted – a waiting area, where a user can fiddle with their list or get feedback, help, etc.
- Category:Attribution requests declined – a previously pending request that has been reviewed by an operator, and is not ready for an actual run; gross syntax problems, permission errors; etc.
- Category:Attribution requests completed – a previously pending request that has been run; the category contains the log file. (Should this just keep getting bigger, or age out, or have year-month subcategories?)
- Category:All Attribution requests – the set union of all attribution request categories.
I wasn't sure where the request for a debug-only vs. a live run should happen. That could be done via a couple of sister categories mirroring the above, or by adding a token like |mode=debug
or |mode=run
on the § Section header line (or maybe both, for extra fail-safe).
I think categorization would streamline processing, reduce errors and stuff falling between the cracks, and even encourage specialization, in that you wouldn't necessarily have to follow a single set through the whole process, an interested admin or process gnome could pick one up and help it to the next step without having to be chained to the same set all the way through.
Note that I have mocked up one category, namely Category:Attribution requests completed, which has one log file in it, namely, the one you just completed, so that we can have something concrete to look at to see how it goes. (edit conflict) Mathglot (talk) 09:01, 11 March 2025 (UTC)
User config, and log file output mode
[edit]In response to your invitation for other functionality above, an optional log_file output mode param would be very helpful, and should be a very easy enhancement. Please add an option which allows the user to specify that they want the returned log file in "HTML mode", or just the raw mode as it is now. I put that in quotes, because we need a better term for it, but what I mean is, they should be able to opt in to a clickable version of the log, so the linked source file, revision or other wikilinked items provided by the user in their input file, would appear linked in the log, if the user opts for that option. I believe it is implementable simply by the presence/absence of the <pre>...</pre> tags and nothing else.
I'm not sure what to call it, maybe raw_log? I think probably the clickable version should be the default, because then the lines will most closely resemble what a user sees, when they go to the History tab for an article that was attributed, and also because they are far more user-friendly for the purpose of verifying that the info is correct; otherwise the user is faced with doing dozens or hundreds of copy-pastes to verify the output. If clickable is the default, then the opt-in to show a monospaced, unclickable version could be called, |raw_log=yes
or some such, defaulting to no, which means with the <pre> tags.
A corollary of this, already starting to be visible before this point, is that currently there seems to be two types of configuration:
- the internal config that you use in the procedure, that users neither have access to, nor do they need to be informed about; * the run-time options that a user may, or must, set.
Currently, they are all defined as items in the § Section header, but with three items there already, four if we add raw_log, it's probably reaching its capacity (and a section header was probably never an ideal location for that anyway). Maybe for userid, it still makes sense to have it in the section header, but less so for the others.
I wonder if we need to think about another way for users to be able to define run-time params like |log_only=
(echo only), |run-time comment=
, and so on, some other way than in the header; i.e., a user-config of some sort. As the project matures and we expose it at Village Pump, it's only more likely that other functions will be requested, and it seems entirely forseeable that they will quickly overrun the section header. So I'm thinking maybe something more like a MiszaBot config in a second input file with a canonical name you can derive, so for User:Jimbo/Set_1, you should look for User:Jimbo/Set_1/config. Maybe the config could even be optional, if all of the params have default values. Thoughts? Mathglot (talk) 01:19, 12 March 2025 (UTC)
- Okay, I have a way that I think makes sense: the user creates a config file as a subpage of their input file, and the config resides there. It contains a set of
|key=value
pairs, which the bot can look for and load. I've simplified this for the user by creating template {{Attribution request}} to generate it. The template will also categorize the config file page appropriately. (Alternatively, the config could go on the same page as the article list; can't decide which is better.) The template is working as designed afaict but it is not exhaustively tested yet; please try it out from User space, or use ExpandTemplates with the Context title set to a User space subpage title; for example, try Context title=User:Jimbo/Attribution set/config
and set the input field to{{Attribution request}}
. Mathglot (talk) 06:40, 12 March 2025 (UTC)
Template:Attribution request
[edit]Why is Template:Attribution request automatically categorized incorrectly into Category:Wikipedia template-protected pages? –LaundryPizza03 (dc̄) 08:39, 7 July 2025 (UTC)