Wikipedia talk:Link rot

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Essays
WikiProject icon This page is within the scope of WikiProject Essays, a collaborative effort to organise and monitor the impact of Wikipedia essays. If you would like to participate, please visit the project page, where you can join the discussion.
 Mid  This page has been rated as Mid-impact on the project's impact scale.
 
the Wikipedia Help Project (Rated B-class, Mid-importance)
WikiProject icon This page is within the scope of the Wikipedia Help Project, a collaborative effort to improve Wikipedia's help documentation for readers and contributors. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks. To browse help related resources see the Help Menu or Help Directory. Or ask for help on your talk page and a volunteer will visit you there.
B-Class article B  This page does not require a rating on the project's quality scale.
 Mid  This page has been rated as Mid-importance on the project's importance scale.
 

Contents

Archive.is[edit]

I think we should go slow on advocating http://archive.is. The field is littered with defunct archive sites - just look at this article history. Archive.is looks good, very good in fact, and its performance and coverage of essentially all used sources is very encouraging. But IMHO Wikipedia can't afford to depend on a brand new site which so far, discloses no public information about its funding, affiliation, or future. I have communicated with the owner, and I am confident the owner is acting in good faith, but it's a solo effort. I'd like to see if the site is here in a year. In the meantime, I would like to advocate using WebCite in parallel with Archive.is, meaning at least archiving at WebCitation, if not citing in ref. I hope this is received as a sensible precaution, in the best interest of Wikipedia's future source verifiability. --Lexein (talk) 02:10, 17 September 2012 (UTC)

I agree that we need to be circumspect. Just before seeing your commment above, I asked at http://blog.archive.is/ask :
"Who runs this site? If we're going to trust it (see Wikipedia:Link_rot#Repairing_a_dead_link) we need to have good reason to think it's stable/funded/likely to stick around indefinitely. The webcite faq is CC-NC-SA, so consider using it as a starting point for your own faq."
If we don't hear back soon, we should remove it. If Archive.is triggered a WebCite archive, in addition to its own, then I'd support it's continued mention here starting now. Also, the IA now supports on-demand archiving. It just doesn't appear online for months. --Elvey (talk) 17:25, 5 October 2012 (UTC)
IA certainly supports on-demand archiving. Traditionally, archived pages took many (three to six?) months to appear. In mid-2012, many pages seemed to be returned within about three to five weeks. By early 2013, this seems to have further reduced to about three to seven days, especially when archiving pages from several well-known sites. In more recent times, some archived pages have been returned in around 200 minutes by IA but this very much depends on the site being archived. -- 31.52.117.100 (talk) 20:27, 29 July 2013 (UTC)
+1 for removing archive.is from the instructions, or at least not promoting it so strongly over sites like archive.org and other institutions that are part of the International Internet Preservation Consortium --Edsu (talk) 16:52, 16 November 2012 (UTC)
+2 for removing http://archive.is from the instructions, until such time as its reliability and persistence is better demonstrated. Beyond the web archives already mentioned, the List of Web archiving initiatives and Memento Project pages may be other useful resources to point to in the instructions. --nullhandle (talk) 21:46, 16 November 2012 (UTC)
Sorry, I did not find your message not in my inbox nor in Tubmlr control panel :( Hopefully, I found this conversation by searching for archive.is on twitter. As I found the questions here, I answer here as well.
About FAQ and more info on the page: a new design is being prepared. It will have more information (both textual and infographic) about how to use the site, how to search for saved pages, etc.
About funding: it was started as a side project, because I had a computational cluster with huge hard drives and those disk space was not used. It was an kind of experiment, to see if the people would need a service like this and choose a ways to develop the service based on how people will use it.
About stability: currently it hosted on budget hosting providers (ovh.net and hetzner.de) using Hadoop File System. Although the hardware is cheap, all data is duplicated 3 times in 2 datacenters (in Germany and France) and designed to survive hardware fault with minimal downtime.
Almost all external links of Wikipedia (all Wikipedias, not only English) were archived in May 2012 pursuing two goals: to preserve the pages which may disappear and to stress test and find bugs in my software. If you see your link is rot, you can check it on archive.is and change link to the saved page. If you feel you do not trust archive.is but it is the only site which has preserved your content, you can save archive.is's page on WebCite or other site thus making more copies and increase redundancy.
Vice-versa, you can save WebCite's or IA pages on archive.is to increase redundancy. (IA is not likely to go offline, but the new domain owner may put "Disallow: /" in robots.txt and thus remove the previous domain owner's content from IA, so it may have sense).--Rotlink (talk) 04:25, 18 November 2012 (UTC)
Also, there are some popular sites IA and WebCite cannot work with. Facebook.com is a big example. --Rotlink (talk) 04:58, 21 November 2012 (UTC)
I've rewritten the archive.is mention as "under evaluation", and emphasized that it should not be used alone until consensus agrees it is reliable. I did not delete it because we have quite a history of suggesting trying out services without advocating them. Back when IA was broken in 2008-2010, I was desperate, and used anything that seemed like it would work. Many of those services later vanished. But WebCitation, as sketchy and unfunded as it first seemed, has survived, Javascript malscripts be damned. So can we AGF for archive.is as "under evaluation"? --Lexein (talk) 23:09, 16 November 2012 (UTC)

It very much looks like Archive.is keeps only the newest shots when it archives external links automatically. It archives the external links once in a while, discarding the old archived versions. In the end, it's archiving dead links. And that is very bad. I detailed the process at Talk:Archive.is#How does automatic archiving work?. The owner of Archive.is probably doesn't realize that the program deletes old versions. —  Ark25  (talk) 00:35, 27 July 2013 (UTC)

I've written to archive.is both on the Ask Me Anything form and by email, to ask about this behavior. I have not yet checked out old archive.is links I've used to see if this a global problem. --Lexein (talk) 02:29, 18 August 2013 (UTC)
I am sorry, my bad. I didn't know that Archive.is is making incremental backups and that it started to create backups on all Wikipedias in may-june 2013 - see Talk:Archive.is#How does automatic archiving work? and User talk:Rotlink#Questions about Archive.is. Sorry for the false alarm! —  Ark25  (talk) 22:52, 21 August 2013 (UTC)
I think Archive.is is very nice, for making automatic backups for all links in Wikipedia. It really deserves to be integrated as a WikiMedia project or at least it deserves to be payed by Wikipedia. It's very important to preserve the archives of the newspapers. —  Ark25  (talk) 23:00, 21 August 2013 (UTC)
The proprietor of Archive.is, User:Rotlink, assures us here and on its FAQ page that it is financially secure. However Webcitation.org has stated on its home page that it will be in financial trouble later this year. This has become a topic of discussion here:meta:WebCite. --Lexein (talk) 18:59, 25 August 2013 (UTC)

──────────────────────────────────────────────────────────────────────────────────────────────────── Per WP:Archive.is RFC all archive.is links are to be removed from the English Wikipedia. This has been added so that anyone finding this discussion will be aware both the existence of this RfC and the results. — Makyen (talk) 05:11, 20 March 2014 (UTC)

We need an anti-link-rot bot!!![edit]

What the hell is really going on here, big picture? I just read Wikipedia:Bots/Requests_for_approval/RotlinkBot and it's a long discussion between botmaster HELLKNOWZ and Rotlink. To me, at first, at least, it looked a bit like the botmaster of a competing bot that is INACTIVE hassled Rotlink and drove him out of town in Aug 2013. User:H3llBot has made no edits since November 5, 2013 ; I guess its trial run was not a success? I haven't been able to investigate deeply enough and wonder what the hell is really going on here. Why aren't any of these bots running? Seems like someone is anti-archive. I guess I could go read H3llBot, DASHBot and BlevintronBot and their talk pages, which HELLKNOWZ referred Rotlink to, but perhaps someone who has can summarize? Lexein? Ark25? Rotlink? (I know, you're blocked; we can copy any comments you make on your talk page here.) Hellknowz? Blevintron? I don't give a shit what happened AFTER the bot was Withdrawn by operator. (though I have read the RFC page). I want to understand why that happened in the first place. It seems to me more than a bit odd that despite all the efforts that have been made, no bot is performing this task. Perhaps HELLKNOWZ is merely familiar with big, hidden roadblocks to getting/keeping such a bot running and was just trying to help Rotlink do what HellBot has been unable to do, at least lately. The RFC closer wrote, and I too "urge the folks at Archive.is to come forward, apologize for policy violations carried out thus far, and work within the system." AND, I'd like to have a bot taking corrective, or better yet preventative action to address link rot, so I'd like to see us do what it takes to make that happen. The first step is getting an idea of what went wrong. Did Rotlink just run out of patience? Did some folks tell Rotlink to calm down so they could stop whipping him, as so often happens 'round here? I'm guessing there was lots more discussion beyond what's on the RFBA. The RFC closer said, " It seems likely this service could be valuable to the community". I think the enormous potential value of an anti-link-rot bot is certain. So, just to be clear, I'm not looking to assign blame, I'm looking to understand what keeps going wrong, so it can be addressed, as that seems to me to be the first step. Help? --{{U|Elvey}} (tc) 19:05, 5 June 2014 (UTC)

AGAIN: I want to understand why bot was withdrawn by operator in the first place. AGAIN: I don't give a shit what happened AFTER the bot was Withdrawn by operator. The RFC happened AFTER the bot was Withdrawn by operator. I don't recall seeing anything there that dicussed why the bot was withdrawn by operator in the first place. If I should take another look, what should I look at, Makyen?--{{U|Elvey}} (tc) 05:41, 10 June 2014 (UTC)
I've got some ideas for a bot and I've been thinking about proposing one for a while. I guess the place to start would be to raise an RFC and check that there is consensus for a bot to do this (or look for a past discussion on the subject)? I'm also somewhat daunted by the process and feel that I'm too inexperienced with the RFC / BRFA processes, and with Wikipedia in general - not to mention lacking the time to formulate the RFC...--Otus scops (talk) 22:05, 5 June 2014 (UTC)
I see a separate RFC as inappropriate. A BRFA is the process for getting a bot approved. If it's approved, the bot should be able to do the thing it was approved to do. Someone could add an RFC tag to the BRFA if they felt wider notice was needed. That's as simple as adding {{rfc|proj}} to the BRFA. That makes the BRFA an RFC too. Presto! --{{U|Elvey}} (tc) 05:38, 10 June 2014 (UTC)
The only information I have seen as to why the bot was withdrawn by the operator is the comment rotlink left in response to that question on his talk page: "I need time to rewrite the description, it is too unclear."
Note that he withdrew the bot close to 24 hours after making posts to Wikipedia:Bots/Requests for approval/RotlinkBot which appeared to indicate that he was preceding with development. His answers (linked above) to the why? question was provided 9–15 hours later (multiple edits to his response). — Makyen (talk) 07:20, 10 June 2014 (UTC)
Maybe we should go back further and see what happened before he filed his bot request.
  • RotlinkBot edited prior to approval, and prior to even requesting approval
  • The "bot" was editing in good faith, as its operator mistakenly believed that as a supervised, but unautomated script, it could be run without prior approval
  • But other editors were reporting problems with these edits, so evidently the supervision was lacking
  • Only after it was blocked, was a bot request filed.
  • "You would be surprised at the amount of tiny problems that all need to be fixed (and why previous bots aren't even running)."
I do not in any way see Hellknowz hassling Rotlink. I see a civil conversation where legitimate concerns are being raised. Reading between the lines and speculating a bit, from the later email conversation that you said was off-topic, I think that Rotlink, who was just doing this as a hobby or something like that (I'm assuming good faith and that he was not setting up some nefarious moneymaking enterprise), just misjudged the amount of work it would take to bring up a robust, acceptable bot—as had apparently several others before him—and decided that, as a volunteer, it was just not worth the time and trouble, so he just abandoned the project. Bots are frequently withdrawn after their operators realize that they tried to bite more then they could chew. Unfortunately the fact that Rotlink was operating an archive site himself clouds the situation, as that gives rise to suspicions that he could favor linking to his own archive site.
Again, I repeat my assertion that I don't think that a bot is the way to go, as I don't think that restoration of archived links should be automated, due to copyright, BLP, "right to be forgotten", etc. issues. A human needs to check for those issues. But as Rotlink has said, saving copies of linked references is relatively easy, heck hard drive prices are down to where most anyone could afford to build the server for archive storage. It's the legal issues that are a bigger concern than the technical issues. I think the Foundation should just automatically save copies of all external URL-linked pages whenever a new link is added. These archived copies could then be later restored by an administrator, after verifying that there were no issues with that. Is this a viable model for dealing with link rot? Wbm1058 (talk) 16:55, 10 June 2014 (UTC)
A big thanks for all that. I think your speculation - that misjudgement of the amount of work it would take to bring up a robust, acceptable bot—by several who tried is the main issue - may well be spot-on, with respect to what the hell is really going on here, big picture. If Rotlink or anyone else wants to archive every URL added to a Mediawiki as its added, by monitoring recent changes type feeds, there's nothing anyone can do that would stop him. We could stop him from adding links to them to articles, yes. But I see that, and any other restriction based on some sort of "right to be forgotten" theory is an attempt to close the barn door after the horse has already run out. Very similarly, http://en.wikichecker.com/ can't be stopped (or forced to honor opt-in). --{{U|Elvey}} (tc) 20:52, 10 June 2014 (UTC)

Newspaper websites which have undergone link format changes[edit]

Might be good to record a list of newspaper websites which have changed their article link format. This would help systematic, albeit manual, review of citations based on said newspapers. For instance:

periodical old format new format change date
Arizona Daily Star azstarnet.com/{section}/{article id} azstarnet.com/{section}/{abbreviated title}/article_{identifier}.html sometime after 2010

Just a thought. --User:Ceyockey (talk to me) 11:47, 16 January 2013 (UTC)

Yes, I find it a very good idea. I am doing that extensively on my native language Wikipedia (Romanian). I am putting such information on the talk pages of every newspaper's article. Check for example:

For example the website of Adevărul changed link formatting 2 times:

==>

==>

It's also an interresting exercise, I learned that many sites (most) changed their formatting like this:

In this case, first it was like:

and then:

It's interresting that you can access the article like this:

Print version:

Mobile version:

PDF version:

It's useful to know all those things. It's a little bit of reverse engineering. But it helps those who try to repair broken links. Such knowledge even helped me to repair broken links with a robot. Links like

Transformed into:

Here is the robot at work: [1] Ark25  (talk) 00:18, 27 July 2013 (UTC)

Yahoo news - when they disappear, do they disappear altogether?[edit]

I found recently that at least some content at Yahoo News has been captured at archive.org (internet archive). The item which caught my attention ... http://web.archive.org/web/20090502072711/http://news.yahoo.com/s/ap/20090427/ap_on_re_mi_ea/ml_odd_israel_kosher_flu . It might be that after a certain date, Yahoo! blocked the archiving and they've not bothered to reach back and request removal of older content?? --User:Ceyockey (talk to me) 15:09, 26 January 2013 (UTC)

Mementos - cross-archive searching[edit]

Hi all,

A colleague of mine has just alerted me to the Mementos interface - it's hosted by the UK Web Archive, but searches across a range of archive sites. Here's an example of a search run for news.bbc.co.uk; as you can see, it picks up a couple of smaller repositories, such as the LoC, as well as the usual suspects.

Any objections to my pointing to this as a resource in the "Web archive services" section? Andrew Gray (talk) 16:15, 29 January 2013 (UTC)

Hi, I work for the UK Web Archive and created that interface. We have no problem with it being publicised more widely. AndyJ 13:08, 19 February 2014 (UTC)
Cool! I've added it, but to the (Internet archives section of the) Repairing a dead link section, as repair is what the tool is most useful for.--Elvey (talk) 18:06, 19 March 2014 (UTC)
I have tried using the interface to which we are directed. I found it to have a significant issue. On the first page I tried, it removed all parameters passed with the URL. I was trying to find archives of a dead reference link I stumbled upon. Specifically the link was:
http://www.autonews.com/apps/pbcs.dll/article?AID=/20011203/ANE/112030837
When this URL is entered, the interface strips it to:
http://www.autonews.com/apps/pbcs.dll/article
This, obviously, produces erroneous results. As such, the interface was completely unusable for this particular search.
While the interface will work for many URLs, it appears that it is useless for any that require parameters. I am going to add some explanatory text to the project page and move it from the top of the section. While it is useful, this issue significantly limits its usability. I also found that it was not immediately obvious that the interface had effectively corrupted the URL for which I was searching. Thus, I expect that users will find it confusing when they encounter this issue. — Makyen (talk) 05:36, 20 March 2014 (UTC)
You could simply encode it http://www.webarchive.org.uk/mementos/search/http%3A%2F%2Fwww.autonews.com%2Fapps%2Fpbcs.dll%2Farticle%3FAID%3D%2F20011203%2FANE%2F112030837. Or even just the parameter delimiters http://www.webarchive.org.uk/mementos/search/http://www.autonews.com/apps/pbcs.dll/article%3FAID=/20011203/ANE/112030837 —  HELLKNOWZ  ▎TALK 10:56, 20 March 2014 (UTC)
Thank you for the information. I came to the same conclusion after experimenting a bit. However, my main point was that we should not have Mementos as the first thing mentioned in the section on archiving unless it is working perfectly (e.g. without the need for users to hand-edit the URL for which they are searching). I believe the text now reflects the need for encoding the query string separator (i.e. the "?"). This appears to get people 90% of the way to usability without having to go into great detail about how to encode a URL. The bookmarklet continues to function as intended. In the example I used above the bookmarklet was not very useful due to the site now in control of the domain redirecting the page. — Makyen (talk) 12:40, 20 March 2014 (UTC)
Good catch and good workarounds. @AndrewNJackson: Hope the memento folks make use of an appropriate urlencode/urldecode library soon. Kudos all 'round. --Elvey (talk) 15:47, 2 April 2014 (UTC)
Yes, thank you, we have resolved the underlying issue and will endeavour to get this deployed across the relevant web archives ASAP. AndyJ 15:51, 2 April 2014 (UTC)
Currently, it looks like it would be a good idea for us to also state that Mementos should not be the only site checked as Mementos can sometimes return no results when archives exist at sites which it normally includes. An example of this is trying to find archives of Battle of the Atlantic. As of April 2014, Archive.org reports it has 63 or 64 archives (https, http). Mementos reports 0 archives (https, http). Mementos usually finds archives at Archive.org, but clearly that is sometimes not the case. We should recommend that editors also do their own searches at least in addition to searching on Mementos.
AndrewNJackson, given your response above, my hope is that this also gets fixed.
Note: I said 63 or 64 archives reported by Archive.org because the number it reports changes upon different refreshes of the results pages. — Makyen (talk) 20:50, 2 April 2014 (UTC)
Hmm, the central Memento service my UI depends upon treats Wikipedia as special, and attempts to redirect to the different versions of content held on Wikipedia itself, rather than web archives (and my UI does not cope well with this). I'll follow up with the Memento developers as to their intentions. AndyJ 21:20, 2 April 2014 (UTC)
Hmm, http://www.webarchive.org.uk/mementos/search is down yet again. Been down all day, but displaying "Sorry, but there was an unexpected error that will prevent the Memento from being displayed. Try again in 5 minutes."--{{U|Elvey}} (tc) 06:33, 6 June 2014 (UTC)

Citations on Wikipedia and discussion at meta:WebCite[edit]

There is a discussion at meta:WebCite regarding citations on Wikipedia that would be of interest to those that watchlist this page. Wikipedia currently has 182,368 links to this archive site. Regards. 64.40.54.47 (talk) 11:41, 11 February 2013 (UTC)

suggest removing: Web archiving is especially important when citing web pages that are unstable or prone to changes, like time sensitive news articles or pages hosted by financially distressed organizations.[edit]

I believe we should remove "Web archiving is especially important when citing web pages that are unstable or prone to changes, like time sensitive news articles or pages hosted by financially distressed organizations." Any news article that changes places, you can just update the link to. Most don't change themselves though. If ever the link stops working, then you can add an archive to replace it. Over at Talk:Garrett_(character), an editor is quoting that sentence as a reason to include archive links all over an article, when its not needed since links to the content still work fine on their own. Dream Focus 22:34, 12 March 2013 (UTC)

Late reply: I don't agree with removing this sentence. All sources are ephemeral. Some are more ephemeral than others, like news such as AP(with contractual expiration times), UPI, NYT, Google News(15-30 days), and anything Google caches(15-120 days). Many Archive links in an article "all over" aren't a problem, since they're just an incremental burden when using templates, and can be filled in automatically by some tools, like reflinks. Eventually, all links rot. This is invariant. Archive.org and Webcitation.org don't/can't archive all websites due to robots.txt. I've even seen whole sites which used to be archiveable by them completely disappear behind a domain owner's new robots.txt (archive.org respects current robots.txt, not past ones). My argument is that we should archive early, defensively, redundantly(multiple archives), and often, to avoid being caught flatfooted by such blackouts. I use the webcitation bookmarklet like a Tourette syndrome twitch. --Lexein (talk) 00:55, 30 July 2013 (UTC)

Pay Wall[edit]

The " Web archive services " section implies that the use of a web archiving service is useful in cases where material is moved behind a paywall. This position is troubling. Do we really mean this, and if so, how do we justify it?--SPhilbrick(Talk) 23:45, 19 March 2013 (UTC)

Not likely to get a response over here. That's why I asked that over at User_talk:Jimbo_Wales#violating_copyright_laws_by_linking_to_archived_sites_when_original_site_is_still_live. Also have the discussion still going on at Talk:Garrett (character). Dream Focus 23:53, 19 March 2013 (UTC)

Should the original url= be required when using archiveurl=[edit]

People here may be interested in commenting on the issue described at:

Wikipedia:Village pump (policy)/Archive_105#Citations: Should the original url.3D be required when using archiveurl.3D. Dragons flight (talk) 18:47, 8 April 2013 (UTC)

Link Rotting Across the Universe[edit]

Me and Tmol42‎ have been discussing link archiving, and I'd just like some clarification on a matter. Sorry, I know you've probably had this so many times in so many forms, but I'd be grateful if you'd humour me! I found a dead link to a PDF file at Parish councils in England‎, and found a backup at the WayBack machine. I added the archiveurl= and archivedate= parameters, and took the information from Wayback. My revision can be found here. The other user then changed this, removing the archive parameters, and setting the URL to the Wayback archive. His change can be seen here. Which, if either, is correct?  drewmunn  talk  16:22, 24 May 2013 (UTC)

Yours is correct, we don't link direct to archives direct for many reasons. We even have bots to correct this. —  HELLKNOWZ  ▎TALK 20:24, 24 May 2013 (UTC)

Better archiving of external links[edit]

On Romanian Wikipedia we were trying to use a WikiWix gadget. Each external link is accompanied by it's WikiWix cache link. In order to archive the external link, you just have to click on it's WikiWix corresponding link. The first time you click it - it will archive your external link/reference. The next time you click on it, you will get the archived page. This is a much better way to archive web pages than submitting a link on WebCite or even than using a bookmarklet. It works very fast, if you want to archive 20 references, you just have to open them all in tabs. However, WikiWix has some issues: seems it has some daily or weekly or monthly quota - you can't archive more than say 100 links per week - which makes it quite unusable.

You can see the cached links on WikiWix using a gadget that you can activate in your preferences.

The best solution would be something like the WikiWix gadget because you don't have to bother to present the archived link, the gadget will show it automatically. And it's very easy to archive it by just clicking on the archive for the first time. However, we need a better solution: A robot to cache all the external links in Wikipedia automatically (I just noticed in the discussion above that Archive.is did just that). And without such quotas like WikiWix, of course.

One solution would be to create a gadget for WebCite, to show the archived (cached) links near each external link. Together with a robot to take care of archiving all external links.

Another solution would be to create a gadget for Archive.is, to show the cached links, since its owner claims it cached all the external links last year. And, if possible, to arrange with him to archive the new external links each month or so.

For those who are not clear how WikiWix works, check this page: ro:Șantierul Naval Constanța. On the "Note"' section, you will see the next link:

In order to see the WikiWix cached links, you have to activate the WikiWix gadget: http://ro.wikipedia.org/wiki/Special:Preferences#mw-prefsection-gadgets - the last checkbox: Versiunea arhivată pentru legăturile externe

Now, near each link you can see a small yellow image (like 10x10 pixels). In this case, right before the date (31.08.2009), the WikiWix archive link: http://archive.wikiwix.com/cache/?url=http://www.zf.ro/burse-fonduri-mutuale/bosanceanu-si-a-dublat-afacerile-la-santierul-naval-constanta-4823139/&title=Bosanceanu%20si-a%20dublat%20afacerile%20la%20Santierul%20Naval%20Constanta . —  Ark25  (talk) 19:05, 26 July 2013 (UTC)

Memento for Wayback Machine links[edit]

I just discovered Memento, a protocol that has been proposed in the past as a way for MediaWiki to provide easier access to historical revisions of pages; and that has a draft extension which could implement such a thing. AFAIK it isn't implemented on any wikis yet. But the idea is a neat one, and according to the bug request asking for the extension to be added to mediawiki core, Archive.org now recognizes memento URLs for referencing cached pages in the wayback machine. So this seems like a good time to revisit setting up a linkbot that caches links with them.

I sent an email to Alexis @ IA and Kevin who wrote the unfinished ArchiveLinks extension to see if any recent progresds had been made. If so, it would be nice to have a guideline for including a memento timestamp in links to the archive.org cache. – SJ + 22:41, 15 August 2013 (UTC)

I created the document Memento Capabilities for Wikipedia that describes areas in which the Memento protocol could be leveraged to add end-user value to Wikipedia. One of the described capabilities relates to the link rot problem. I will propose a Wikiproject that uses the document as its starting point. Hvdsomp (talk) 21:57, 19 September 2013 (UTC)

Again about the Gadget[edit]

I have modified the Gadget I was talking about in the above section. Now, every external link on Romanian Wikipedia has another link near to it that takes you to the Archive.org version of the page on that link. The gadget can be modified to take you to the Archive.is version of the page, or to show both archives - ro:MediaWiki:Cache.js (function addcache). I think English Wikipedia should have such a thing too. This is the most convenient solution, since Archive.is is archiving all the external links in Wikipedia pages, and Archive.org is archiving almost all the newspaper sites. It's far more efficient than having to archive pages manually or than searching for archives manually. —  Ark25  (talk) 05:08, 27 August 2013 (UTC)

Wikipedia:Archive.is RFC republicizing[edit]

Recent events related to archive.is have left the state of Wikipedia links to archive.is in a state that requires a community decision relative to archive.is.

This constitutes broader publicizing of the above Request for Comment as suggested/requested on October 3. This RFC was started September 20, 2013. --Lexein (talk) 08:05, 7 October 2013 (UTC)

Wayback API[edit]

I just learned of this: mw:Archived Pages: "The Internet Archive wants to help fix broken outlinks on Wikipedia and make citations more reliable. Are there members of the community who can help build tools to get archived pages in appropriate places? If you would like to help, please discuss, annotate this page, and/or email alexis@archive.org." - leaving the link here in case anyone else is interested or can help. –Quiddity (talk) 16:38, 31 October 2013 (UTC)

Are there any functioning linkfix bots?[edit]

http://www.language-museum.com/ used to be

a linguistic website which offers the samples of 2000 languages in the world. Every sample includes 4 parts: (1) a sample image, (2) an English translation, (3) the speaking countries and populations, (4) the language's family and branch. … constructed and maintained by Zhang Hong, an internet consultant and amateur linguist in Beijing China.

The domain's top level is now used by a language teaching company, LM Languages. But by using the Wayback Machine, I found that the content is still up there, at http://www.language-museum.com/encyclopedia/, though I don't know if Zhang Hong or anyone is still maintaining it.

I discovered this while following a link from Toba Batak language, pointing to

http://www.language-museum.com/b/batak-toba.php

That content is now at

http://www.language-museum.com/encyclopedia/b/batak-toba.php

I have updated the links in Toba Batak language. But the "museum" claims to have examples from 2000 languages, and there may be many more links in Wikipedia. So the thing to do, as I see it, is to update all links of the form

http://www.language-museum.com/LETTER/LANGUAGE.php

by inserting encyclopedia/ after .com/. Or, even simpler, change

http://www.language-museum.com/

to

http://www.language-museum.com/encyclopedia/

unless "encyclopedia" is already in the URL.

Following the links in §Bots, I posted the request at User:RileyBot/Requests. Then I saw that the user is "semiretired: no longer very active on Wikipedia as of 29 April 2013", so I tried MerlLinkBot, but that one doesn't seem to be much better. Is there ANYBOT that can do a clearly defined search-and-replace?

If you have any questions, please {{ping}} me so I don't have to add this page to my watchlist. Thanks. --Thnidu (talk) 19:30, 11 January 2014 (UTC)

I took care of it. --Ysangkok (talk) 16:53, 4 July 2015 (UTC)
@Ysangkok: Thanks. --Thnidu (talk) 16:21, 5 July 2015 (UTC)

Archiving Archive.is snapshots with WebCite/Wayback[edit]

Since Archive.is may be on its way to being blacklisted soon, per the RfC, would it be possible to archive any Archive.is snapshots whose original URL is dead and has not been archived through WebCite/Wayback, in order to circumvent the blacklist? — Whisternefet (t · c) 00:08, 18 January 2014 (UTC)

You can circumvent now by using MementoWeb. --Ysangkok (talk) 16:53, 4 July 2015 (UTC)

any way to search for dead links within a category[edit]

When I look at the list of dead links, it's huge. Is there a way to search it to find articles within particular categories with dead links? I guess I am more motivated to maintain articles within categories of interest to me than articles in general. Kerry (talk) 00:38, 21 January 2014 (UTC)

Many Wikiprojects have their own to-do listings, which organizes maintenance tasks by subject. Check out Category:To-do list templates for WikiProjects, and also the Cleanup Listing tool which will allow you to search maintenance tasks by category. Hope that helps! - œ 11:56, 19 February 2014 (UTC)
Thanks, this is exactly what I was looking for! Kerry (talk) 12:12, 19 February 2014 (UTC)

Repair vs. archive[edit]

I've been doing some link fixing, mostly to news articles – hopefully some of it even correctly. If I come across a link that's dead and I can find, both, a replacement live link to the same article and an archive copy of the same, based on the original location, should I have a preference towards a particular one of the following, or is policy / established consensus that I can pick whichever I prefer?

  1. Leave the dead url, and add the archiveurl and archivedate
  2. Update url to the working one

I tend to use option 2, always archive the new page and usually add the archiveurl, archivedate and deadurl=no, but I guess those are all optional.

If I do option 2, should I update the accessdate to today or leave it set to the original date? —Otus scops (talk) 17:53, 24 April 2014 (UTC)

Personally, I'd go with option 1, because it's much less work for me, and is almost as valuable. However, I think option 2 is better, and say go for it; you could add "(updated)" after the updated URL to help anyone who was confused and failed to look at the edit history. --{{U|Elvey}} (tc) 19:19, 5 June 2014 (UTC)
I usually consider the |accessdate= as the date on which the article text was checked against the reference to verify that what the article says is supported by the reference. If you are not re-verify the article text, I would add the archiveurl, but not the new URL. I would probably put the new, updated url in as a wiki-comment to aid the next person that comes through. The purpose of any reference is to support the article text. For the next person that comes through to verify, the best situation is to be able to see the reference as close to how it existed when last verified.
If you are verifying the article text against the reference, I would do option 2 including the preemptive archiving you mention. Given that creating the preemptive archive is a one click action, I might initiate archiving at both archive.org and WebCite. Obviously, you only include one archive in the citation, but having the second exist doesn't hurt and may be desirable at some point in the future. — Makyen (talk) 20:06, 5 June 2014 (UTC)
@Elvey and Makyen:Thank you both. I'll tend towards 1 when I don't reverify and 2 when I do, then. I suppose that I should probably have another go at getting the WebCite bookmarklet to work properly...—Otus scops (talk) 21:49, 5 June 2014 (UTC)
When I fix a URL matter for a reference, I usually choose option 1. Sometimes there is no archive URL, and it's simply a matter of locating the updated URL (because the URL was moved instead of archived). Flyer22 (talk) 21:58, 5 June 2014 (UTC)

Memento MediaWiki extention[edit]

Does wikipedia.org have it installed? https://www.mediawiki.org/wiki/Extension:Memento — Preceding unsigned comment added by 89.47.81.188 (talk) 17:08, 25 June 2014 (UTC)

Template:Google videos[edit]

All those references are dead. --93.216.69.148 (talk) 15:02, 10 July 2014 (UTC)

Wikipedia:Archive.is RFC 3 republicizing[edit]

This constitutes broader publicizing of the above Request for Comment.--{{U|Elvey}} (tc) 19:59, 11 September 2014 (UTC)

sportsillustrated.cnn.com moved to www.si.com[edit]

In enwiki are ~ 16600 external links of this domain, in dewiki are about 500 links. Is there somebody, who is able to make a list of old-link to new-link? Boshomi (talk) 09:31, 26 October 2014 (UTC)

Boshomi, how would one know what the new link is? There's nothing in the old links to provide guidance. Probably the thing is to link to Internet Archive ie. https://web.archive.org/web/20130125052017/http://sportsillustrated.cnn.com/vault/article/magazine/MAG1152491/4/index.htm -- GreenC 18:16, 21 June 2015 (UTC)

Meanwhile I fixed all the 500 Links in dewiki [LinkSearch ns=0 sportsillustrated.cnn.com]. I found the articles at http://www.si.com/ , sometimes I needed archive.org and archive.is. The archives can help to find the artikle at si.com. Boshomi (talk) 20:17, 22 June 2015 (UTC)
Boshomi: Nice work. How do the archives help finding the si.com link? -- GreenC 21:34, 24 June 2015 (UTC)
GreenC: Both web.archive.org/web/ and archive.is/ are very usefull.
  1. The first step was to seache in the webarchives. I have written a software for creating lists like de:Wikipedia:WikiProjekt_Weblinkwartung/Toter_Link/Liste_afp_google (List for afp.google.com I created today )
  2. The second step was to google the archived title like »"Sports Title" site:si.com« sometimes with date
  3. If nothing found at si.com I used web.archive.org/web/ or archive.is/ (web.archive.org/web/ is preferenced, but we use also archive.is but only with full URL, no short links, and we observe the use of this webarchive)
  4. I case of nothing found wether at si.com nor at the archives, I searched in other medias Boshomi (talk) 19:40, 25 June 2015 (UTC)
I see. Good idea to use Google to site:si.com based on title. -- GreenC 16:42, 26 June 2015 (UTC)

Links that have now become spam[edit]

On page "Rae Baker" there is a link to "burnettgrangercrowther.co.uk" which I assume once was the actress's agent's page but now appears to be controlled by a blatant spammer. I have marked the link as [dead link] but wonder whether the link should be removed altogether. Can anyone advise? — Preceding unsigned comment added by Toffeepot (talkcontribs) 16:20, 25 November 2014

In general, I would remove such links. If the link was something like the website in the infobox or an external link, removal should occur. If it is to verify some fact, the situation is more murky because in principle the dead link could be reactivated, or may have moved to some other website. If the verified text is uncontroversial, it might be ok to remove the ref and hope someone will fix it eventually. We are supposed to search for the information and try to find if it is verified anywhere else as at WP:LINKROT. In this article, there is a problem because the verified assertion is rather weird. I would be inclined to search the article history to see how it was added, and remove most of the sentence if the editor's intention was unclear. Johnuniq (talk) 22:59, 25 November 2014 (UTC)

Found a Great Tool (Save Page to Wayback Machine Bookmarklet)[edit]

This tool makes saving pages incredibly faster and easier. All you have to do is click on the big blue "save to wayback machine" banner and drag it to your bookmarks (they must be visible, which on chrome would make them under the search bar). Now when you click on the bookmark it automatically saves the page to the wayback machine, instead of having to copy and paste single links manually into a bunch of wayback machine tabs. Never again shall there be a dead link lol. WikiOriginal-9 (talk) 20:54, 3 December 2014 (UTC)

Reverse linkrot[edit]

You might want to look at a discussion that I've started at Wikipedia talk:Moving a page#Breaking incoming links. Thanks, Bazonka (talk) 08:27, 11 December 2014 (UTC)

There wasn't a lot of discussion, so I have been bold and added a section to this page. I think any further discussion should take here now. Bazonka (talk) 21:48, 16 December 2014 (UTC)
Bazonka - Some websites have a "Permanent URL" link that never changes, though I have not thought it through how that would help at Wikipedia. -- GreenC 17:57, 21 June 2015 (UTC)

Userbox related to link rot[edit]

Is there a user box I can place on my user page which says something like "This user always adds archived URLs to citations"? Lugevas (talk) 21:23, 5 January 2015 (UTC)

Alright folks, I answered my own question. There are a whole bunch here. Lugevas (talk) 21:30, 5 January 2015 (UTC)
WebCite This user believes in archiving sources to prevent link rot.



Moved external webpage[edit]

If a page has been moved to another url, should I just replace the url in the citation with the new, working link? --Prisencolinensinainciusol (talk) 23:35, 2 February 2015 (UTC)

Yes. -- Michael Bednarek (talk) 03:13, 3 February 2015 (UTC)

Semi-protected edit request on 17 March 2015[edit]

Please change DEAD LINK "Men's Fitness Interviews: Milo Ventimiglia" to "Men's Fitness Facts by Kerin Richael" Nidalsaim (talk) 04:31, 17 March 2015 (UTC)

Red information icon with gradient background.svg Not done: this is the talk page for discussing improvements to the page Wikipedia:Link rot. Please make your request at the talk page for the article concerned. Stickee (talk) 05:20, 17 March 2015 (UTC)

Semi-protected edit request on 17 March 2015[edit]

Please moderate the Dead Link"Zoidis, John D. (1999). "The Impact of Air Pollution on COPD". RT: for Decision Makers in Respiratory Care.[dead link] to "Air and water Pollution Effects on Health and Hair by Kerin Richael, "Air and Water Pollution Effects on Health & Hair Nidalsaim (talk) 05:42, 17 March 2015 (UTC)

Red information icon with gradient background.svg Not done: You are in the wrong place, as this page is only to discuss improvements to Wikipedia:Link rot.
If you want to suggest a change, please request this on the talk page of the relevant article - I have no idea which article you are talking about. - Arjayay (talk) 08:49, 17 March 2015 (UTC)

Semi-protected edit request on 18 March 2015[edit]

Please change the following Dead Link ("Official Women Of Wrestling (OWOW) – Trish Stratus Biography". Official Women of Wrestling. Retrieved August 20, 2007") to ( Complete Women Fitness Wrestling 2015 by Kerin Richael Complete Fitness Guide for Perfect Shape and Wrestling Nidalsaim (talk) 09:49, 18 March 2015 (UTC)

Red information icon with gradient background.svg Not done: as you are in the wrong place, since this page is only to discuss improvements to Wikipedia:Link rot.
If you want to suggest a change, please request this on the talk page of the relevant article - I don't know which one you are referring to. - Arjayay (talk) 11:54, 18 March 2015 (UTC)

New Archive-Bot Initiative[edit]

Okay, I am sick and tired of not being able to follow citations on Wikipedia due to link rot. It's happening on almost every article I visit now.

I think it's time that an archival bot solves this problem! Really shouldn't be too hard - find a citation, send to Archive.org, link back on WP. Flag dead links as such. (Ideally, I think Wikipedia would actually host their _own_ archive of a page without relying on a third party service, perhaps on the Commons or WikiData, but that's probably a different conversation.)

Is there any reason why this currently doesn't exist? I saw that there were other bots that attempted to do this in the past, but they are no longer operating - are they forbidden?

I think I could whip up a new bot to do this in a day or two. If I made a new bot to do this, would the administrators allow it? What would be the administrative/technical hurdles?

Thanks! Miserlou (talk) 19:16, 20 April 2015 (UTC)

Miserlou - It's a hairy problem in the details but I agree this is doable. I just wrote a bot to fix kirjasto.sci.fi which went offline in March and is linked in 580 articles (details here). But I did it it manual mode, getting permission to run a bot across millions of articles is going to take some time and effort I suspect. The average lifespan of a link is 6 years so its only surprising there are not more dead links. -- GreenC 18:02, 21 June 2015 (UTC)
You should see previous attempts and BRFAs on the matter. It is way more complex than a day or two. In fact, the many issues and ridiculous number of corner-cases and exceptions is why none of the bot operators have time to run and maintain this. See User:H3llBot/ADL#Relevant_links for some links (not recently updated). —  HELLKNOWZ  ▎TALK 19:25, 21 June 2015 (UTC)
I think the projects have stalled because it's ambitious to fully automate across all articles, a silver bullet solution. I'm taking the approach of focusing on a single website that has died, and updating all links for that website. It makes the exceptions manageable though there is still manual work involved. You have to train your bot by updating the code based on real-world experience of what shows up during a manual AWB run of a couple hundred articles. It is very challenging to fully automate, but about 20% of the cases are %90 of the work so if we can get the low hanging 80% and skip the rest it is better than nothing. Those others can then be noted and manually fixed later. - GreenC 17:12, 22 June 2015 (UTC)

Dead Link server[edit]

Thinking about the dead link problem, one solution is a "dead link server". This would be an API based application running on Tools that would redirect incoming requests to the appropriate Wayback page.

Say for example a {{cite web}} template has |dead-url=yes but the |archiveurl= and |archivedate= are blank or non-existent. eg:

{{cite web |url=http://sportsillustrated.cnn.com |accessdate=23 March 2007 |dead-url=yes}}

The cite web template would generate a URL to the hypothetical Dead Link Server like this:

https://tools.wikimedia.org/dls/dls.py?http://sportsillustrated.cnn.com&20070323

The dls.py script would then do the work of determining where to redirect the request on a real-time basis. It will use the Wayback API to determine which page to send to:

http://archive.org/wayback/available?url=http://sportsillustrated.cnn.com&timestamp=20070323

This will return the Wayback URL closest to the date 20070323 without going past that date and which contains valid content. This URL becomes the redirect served up to the end user's browser.

The Dead Link Server has advantages:

  1. Automating an archive flag is easy. Only need to add |dead-url=yes which basic bots can handle without much oversight.
  2. Easier on the community as you don't need to look up and find the Wayback page, just add a single |dead-url=yes
  3. It allows for using more than one archiving service. If Wayback doesn't return a result, the server can check other services like archive.is
  4. It allows for easy changes on a site-wide basis. If Wayback went dead, or Wikimedia started its own service, or URL formats changed etc.. only need to change dls.py instead of millions of articles.
  5. It works with existing templates and methods. If a cite web already is using |archiveurl= it continues to use the "hard coded" URL and not redirect to the Dead Link Server.
  6. If the idea is successful, it could later be built-in to the MediaWiki software. Say for example sportsillustrated.cnn.com is a known dead domain. MediaWiki has a list of known dead domains and when rendering the page to HTML will transpose the URL to go to the Dead Links Server. This means end users don't need to worry about dead links, the software does it all automatically and hidden.

There are some possible cons:

  1. If Tool Server goes down (happens) it will become unavailable.
  2. It initially only works with templated citations, but could work with any form (including bare URLS in external links) if #7 was implemented.

That was my thought of the day. I'm surely missing something but wanted to post it before I forgot :) -- GreenC 19:50, 23 June 2015 (UTC)

@Green Cardamom: check this URL out: http://timetravel.mementoweb.org/memento/2010/http://www.insc.anl.gov/neisb/neisb4/NEISB_3.3.A1.html Now try the same link with 2015 instead of 2010 in the URL. My point is that the Wayback has many unusable archived pages. Someone needs to choose the right date. Machine learning could probably solve this problem, but I do not think there is consensus for it. --Ysangkok (talk) 20:05, 5 July 2015 (UTC)
@Ysangkok: -- "Someone needs to choose the right date" - the Wayback API is designed for this. Try this: https://archive.org/wayback/available?url=http://www.insc.anl.gov/neisb/neisb4/NEISB_3.3.A1.html .. it returns (a URL in JSON) the best available copy, in this case the 2010 capture. -- GreenC 23:50, 5 July 2015 (UTC)

Proposal to linking to MementoWeb.org instead of webcitation.org and archive.org[edit]

There have been issues with all the archives:

  • some pages have disappeared from archive.org reflecting changes in robots.txt
  • webcitation.org has been about to cease its activities by financial reasons
  • ...not to mention the problems with archive.today

I propose to link to mementoweb.org instead of any particular archive's domain.

It is the metaindex of all the archives which is being led by respectable institution and well-funded.

More info here: Memento Project

Links will have form of http://timetravel.mementoweb.org/memento/2015/http://en.wikipedia.org

Strange situation[edit]

A site used as a source for some material at Commonwealth realm went down recently, now showing only "server error": [2]. None of the resources provided here recover any archived version of the page. However, I can actually still access the site using Safari on my iPhone. I find this very confusing and there is currently an ongoing dispute that rests heavily on the unavailability of this website to seemingly all users except myself. Can anyone provide some clarification and/or advice on how to proceed? -- MIESIANIACAL 19:52, 15 November 2015 (UTC)

There is little if anything in the dispute, https://en.wikipedia.org/wiki/Talk:Commonwealth_realm#On_the_United_Commonwealth_Society, that depends upon whether or not the site is active. Juan Riley (talk) 20:02, 15 November 2015 (UTC)
There is indeed, since no one involved (except myself) knows what the contents of the United Commonwealth Society's site are. Indeed, most of you think a couple of wacky individuals' material on the internet is representative of the UCS when it most certainly is not and you having access to the UCS's page would prove it. (Either that, or there's a way I can provide screenshots from my phone without uploading the images into Wikipedia's collection.) -- MIESIANIACAL 21:13, 15 November 2015 (UTC)
No comment since you are proving my point. Good luck. Juan Riley (talk) 21:15, 15 November 2015 (UTC)
Thank you for conceding you've jumped to conclusions, then. -- MIESIANIACAL 21:18, 15 November 2015 (UTC)