Wikipedia:Bots/Requests for approval/KasparBot 3

From Wikipedia, the free encyclopedia
Jump to: navigation, search

KasparBot 3[edit]

Operator: T.seppelt (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 18:11, Wednesday, November 4, 2015 (UTC)

Automatic, Supervised, or Manual: automatic

Programming language(s): Java, own framework

Source code available: not yet

Function overview: removing {{Persondata}} in all articles, copying the information to a certain database which will be accessible on Tool Labs

Links to relevant discussions (where appropriate): first RfC, second RfC, Bot request

Edit period(s): one time run

Estimated number of pages affected: 1.2 million

Exclusion compliant (Yes/No): no

Already has a bot flag (Yes/No): Yes

Function details:

  • request all pages with {{Persondata}}
  • fetch the parameters
  • copy them into a database (Wikidata won't be affected)
  • remove {{Persondata|...}}

-- T.seppelt (talk) 18:11, 4 November 2015 (UTC)

Discussion[edit]

@Pigsonthewing, Magioladitis, Izno, GoingBatty, Hawkeye7, and Dirtlawyer1: -- T.seppelt (talk) 18:11, 4 November 2015 (UTC)

  • @T.seppelt: What will be the purpose of the "database" to which Persondata information will be copied as the Persondata templates are being deleted from all Wikipedia articles? Without a plan to review it, parse it, and transfer usable information to Wikidata, I'm not sure that creating a massive database with approximately 1.2 million Persondata profiles serves much of a function. Simply transferring potentially usable information to a database where most English language editors do not have practical access to it, hoping that someone with proper skills, time and motivation will actually do something with it in the future is, I'm afraid . . . well, wishful thinking. Dirtlawyer1 (talk) 18:30, 4 November 2015 (UTC)
    The purpose of this database is to allow users to add the information to Wikidata. The information will be parsed and a simple interface will be provided which allows users to decide one-by-one if a certain piece of information is suitable for Wikidata. Imagine it in this way: You will have 3 buttons (Next/Skip, Yes, No) and the rest of the work is done by software. It will be easier than comparing articles and items manually and adding complex statements. -- T.seppelt (talk) 18:39, 4 November 2015 (UTC)
    T.seppelt: I endorse your approach whole-heartedly. I look forward to working with your newly created database and toolset. Thank you for investing your skills, time and effort to do this. Dirtlawyer1 (talk) 18:49, 4 November 2015 (UTC)
    (edit conflict) Yes, thank you for your time. I am not sure that I would work with the proposed tool, as follows. -P64
    Does Wikidata really want the two-part reference added as part of every statement whose source is English Wikipedia, "imported from: English Wikipedia; retrieved: 4 November 2015" (not to mention same with UTC timestamp)? Then some improved interface must be a must, if you know what I mean. For me it might be sufficient to have a feature at Wikidata that amounts to "repeat the last reference". For English Wikipedia and the few "stated in" sources such as LCNAF that I commonly use there, I would be willing to provide the two-part reference once and then use a convenient "repeat the last reference".
    I doubt that I would dedicate time to the transfer of leftover Persondata to Wikidata, only continue to do some of that where something else takes me to the Wikipedia article. So I doubt (but may be wrong) that the extra efficiency of a dedicated PD to WD interface would matter to me. I have no clue how many editors might focus on leftover PD, where such extra efficiency should be very welcome. Maybe I would help with that myself, merely I doubt it.
    I haven't read Wikidata instruction concerning alternative names, perhaps what I most frequently carefully added to our PD templates myself. I don't use any of the WD statements that pertain to such content, only the "Also known as" header field. I don't know how much good alternative name data there is in our PD templates nor whether the proposed interface would actually be efficient for adding WD statements such as birthname, pseudonym, etc. --P64 (talk) 19:40, 4 November 2015 (UTC)
    This task is, for the most part, probably going to be "sort out alternative names". I'm not aware of whether "alternative name" really fits into anything but the alias fields (what you refer to as "also known as"). --Izno (talk) 22:47, 4 November 2015 (UTC)
  • Even though the ping didn't reach me, I--of course--advocated this idea (following Alakzi's comment in the second RFC).

    What I'd like this bot (or another) to do also would be to simply remove data that is already present on Wikidata, thus never pulling that into the database. We might enlist @GoingBatty: (because the ping didn't hit me I assume it didn't hit him) to do a bot run first removing all the uninteresting data and then start a bot doing this work T.seppelt. --Izno (talk) 22:47, 4 November 2015 (UTC)

    @Izno: I'm happy to help clear a path for T.seppelt if that would be beneficial. The hard part has been getting consensus on the definition of "uninteresting data". GoingBatty (talk) 03:31, 5 November 2015 (UTC)
    I think the start that you and Dirtlawyer got on in one or the other RFCs was probably the right direction, taking into account the seeming issue of the calendars (which I'm still fairly certain isn't resolved, though I haven't been watching those tasks). @Jc3s5h: since he cares. --Izno (talk) 04:03, 5 November 2015 (UTC)
  • Interestingly, I wasn't pingged or otherwise made aware of that second RFC. Nonetheless, there's nothing stopping the first portion of this from being enacted; you can use database dumps or the API to scrape the information into whatever form you desire. --slakrtalk / 23:57, 4 November 2015 (UTC)
  • Agreed—it would be good to get a sense of how the data will be presented, and whether the people who are interested in working with it find the interface useful. — Earwig talk 00:07, 5 November 2015 (UTC)
  • Support. User-friendly interface certainly makes any plan to export Persondata to an external database more appealing. (And more appealing than having to rely on past revisions). Of course, I would like to see a demonstration of the interface functionality and migration accuracy etc before any mass data migration. If it can be successfully implemented then I feel this is a good compromise. Question: Previous discussions have highlighted the confusion between Gregorian and Julian dates (I still don't understand, and most people are probably completely unaware). How would this be dealt with?
    (On a side note, I am also a little disturbed that the "second" RfC was not flagged on Persondata talk pages, and that I'm only hearing about it now. But so long as this plan goes ahead, there's probably no point contesting it.) —Msmarmalade (talk) 03:08, 5 November 2015 (UTC)
    I asked in the RFC whether T.seppelt might extend his authority control tool, which I got an affirmative for then. Basically, the tool in question helps us resolve authority control issues and mismatches; this application seemed similar enough to me given that we were fairly certain we would need to deal with human checking. --Izno (talk) 04:03, 5 November 2015 (UTC)
    I use the standard Wikidata date parsing service which is accessible through the Wikidata API (wbparsevalue) for parsing dates in order to get the same results as an user would get. -- T.seppelt (talk) 11:50, 9 November 2015 (UTC)
  • Support - As Hawkeye and others have pointed out, I strongly suggest that we get behind this plan and turn the bot loose. Every day, Persondata information is being manually deleted by users who have no clue regarding the present status of this discussion, the specifics of any of the recent RfCs, or the efforts to parse and transfer remaining usable information from Persondata to Wikidata. If we're going to do this, we need to do it before any more potentially usable Persondata information is lost to manual deletion without review or transfer to Wikidata. If this is the plan, let's do it. Dirtlawyer1 (talk) 02:22, 7 November 2015 (UTC)
    I started to fetch the data. The program is still running but the results are available live under [1]. Nothing gets lost from now -- T.seppelt (talk) 11:50, 9 November 2015 (UTC)
    Looks good so far. — Earwig talk 07:49, 11 November 2015 (UTC)
  • Support - To be absolutely honest when Persondata was deprecated a bot should've then be done much much sooner .... I and others have removed alot of Persondata from articles assuming it was already at Wikidata, One wonders what the actual point of WikiData is but that's for another day, Bot's needed ASAP so I whole-heartedly support. –Davey2010Talk 13:52, 12 November 2015 (UTC)
  • Question: At what rate will the bot be editing? Kharkiv07 (T) 17:06, 12 November 2015 (UTC)
    As preferred: I can make it doing up to thirty edits per minute or I limit the edit rate. Maybe 10 edits per minute are a good amount? Regards, -- T.seppelt (talk) 19:43, 12 November 2015 (UTC)
    Some numbers here. Given 1.2 million transclusions, 10 edits/minute would take about 84 days to clear all persondata, while 30 edits/minute would take 28 days. Policy recommends not going faster than six per minute. In practice we often let go bots faster than that, but within reason; 30 seems a bit fast for me. I suggest 10 per minute; i.e., a six-second sleep between edits. Make sure your library respects maxlag. — Earwig talk 03:36, 13 November 2015 (UTC)
    I am implementing the things as you proposed. Regards, --T.seppelt (talk) 07:34, 13 November 2015 (UTC)
  • Support - at this point the template is just causing confusion and not serving any useful purpose. I think its time to clean it up. Kaldari (talk) 06:55, 14 November 2015 (UTC)
  • Oppose There is no need to combine two unrelated proposals and create controversy where none existed before. The two proposals here are -
  1. Copy this data into a database elsewhere (totally uncontroversial - let the bot do this)
  2. Delete the data from here before the other database is established (why even do this?)
This proposal is premised on a database created in the future replacing the need of the current system. Why delete the current system before there is broad community approval of this other database, which does not even exist yet? Feel free to make the other database. After that database is appreciated, then make a second proposal to use it to replace the current system.
I fail to recognize any reason why these steps ought to be combined into a single proposal. What am I missing? Blue Rasberry (talk) 18:02, 15 November 2015 (UTC)
Oppose and agree with Bluerasberry's reasoning. Why would we make someone go to two places to look for that information? — Sctechlaw (talk) 01:00, 17 November 2015 (UTC)
@Sctechlaw: What do you mean by "two places"? — Earwig talk 01:30, 17 November 2015 (UTC)
The Earwig The two places being discussed are English Wikipedia and a project on Tool Labs. Blue Rasberry (talk) 14:38, 17 November 2015 (UTC)
The final home of these data is Wikidata, regardless. Intermediary but pursuant to the two RFCs (one to remove Persondata and the second to remove it by bot) would be a "holding pen" of sorts (hosted on Labs) for people to more easily assign the data to Wikidata items (through a person-centric UI). --Izno (talk) 14:43, 17 November 2015 (UTC)
Right, that was my understanding. I asked because I'm not clear how we are making people go to two places to look for information. As far as this task is concerned, if we keep {{persondata}} around for too long after the Labs database has been created, we're just encouraging them to go out of sync as people deal with both. I think it makes sense to start running the bot as soon as we are satisfied with the way the tool is structured. — Earwig talk 01:21, 18 November 2015 (UTC)
I would like to emphasize that the only alternative to T.seppelt's creation of a Persondata database and interface for review and transfer to Wikidata is the outright deletion of all existing Persondata with no further transfer of usable Persondata information to Wikidata. And I would also like to take note that T.seppelt's analysis below demonstrates that the statements made during the two previous RfCs -- that there remained no usable Persondata information that could be practically transferred to Wikidata -- were uninformed at best and outright misrepresentations at worst. Again, I commend T.seppelt for undertaking this project, and I urge editors who are opposed to this proposal to familiarize themselves with the two previous RfCs related to the removal of Persondata (May 2015 and September-October 2015) and prior bot request (June 2015). This is now the only game in town to preserve and transfer usable Persondata. Dirtlawyer1 (talk) 01:40, 18 November 2015 (UTC)

Update I am considering different ways of making the data accessible for user assisted import after the removal at the moment. In order to assess the options I did an analysis of the persondata which is accessible at the moment on enwiki. These are the results:

Persondata field Wikidata New ready New unparsable Conflict Conflict unparsable
DATE OF BIRTH P569 51269 4695 88790 5093
PLACE OF BIRTH P19 310575 32086 44907 27230
DATE OF DEATH P570 26379 2335 67835 2724
PLACE OF DEATH P20 90996 10654 14996 10737
ALTERNATIVE NAMES alias 101976 n/a
SHORT DESCRIPTION description 21417 135961
NAME label (in future alias) 54 244569

As you can see so far we have 479,219 statements which could be directly imported. The best option for this data is to me to give it to the Primary Sources Tool. For the conflicting statements, the unparsable data, the aliases, the descriptions and the labels I will provide a software solution. But please consider that we should start this removal process as soon as possible due to the long time it will take. There will be 1,295,269 user actions necessary to complete the import. The 479,219 statements can be accessible in the next days. Even to check these statements will take weeks, in the meantime the next parts of the dataset will be available through the proposed tool. Warm regards, -- T.seppelt (talk) 20:19, 15 November 2015 (UTC)

Great news You can use the Primary sources tool now to add place of birth and place of death statements to Wikidata. Tpt just uploaded the dataset ([2]). I am working on the tool for descriptions now. It will be available in the next hours or tomorrow. Warm regards, -- T.seppelt (talk) 19:41, 20 November 2015 (UTC)

Tool is launched in beta I worked on the proposed tool and it is ready for public testing now. You can find it here. Please check your contributions to Wikidata in order to find bugs. Let me know, if something goes wrong. Please come up with ideas for improvement. Warm regards, -- T.seppelt (talk) 15:25, 21 November 2015 (UTC)