User talk:Fæ

for older archives, see User talk:Fæ/2021, User talk:Fæ/2020, User talk:Fæ/2019, et seq.

Old projects

Latest comment: 2 days ago10 comments3 people in discussion

Exemplars for historical documents and maps projects
Scan released last year at the Internet Archive
New upload of 1887 address book for Riga, Latvia. (800mb).
Example map, (created 1774) high resolution map of Africa, approx 373 megapixels in size, similar to a 5 foot square printed poster.

I had a look back at the IA upload project and realized that none of these scripts run because of Python, Pywikibot and internetarchive changes. It turns out I find it quite difficult to remember almost anything about these projects too. So don't be surprised if I'm testing it out and there are flaws. I'll do my best to repair any oddities. I'm seeing '+99' notices from my account which doesn't clear, could be a wm bug for big numbers, so I might not notice changes. It's not deliberate. For the moment please don't ask me to take on large projects, I'd rather pace myself at 'slow'. Fæ (talk) 13:07, 2 June 2026 (UTC)Reply

Thanks, and welcome back! — 🇺🇦Jeff G. ツ please ping or talk to me🇺🇦 08:11, 3 June 2026 (UTC)Reply

Another new "feature" of the Wikimedia API is ratelimiting. Added a couple of slow down precautions including slapping down multiprocessing, but it's definitely dogging uploads despite being visible to the API as an established user. It may be necessary to revisit the throttling system rather than bumping into it. It's sad this creates extra work for volunteers.

For the first time the queries found restricted items at Internet Archive like 1826histoirenumismatiquedelare, where the Washington University appears to be claiming copyright in a 200 year old publication. Good grief, I hope this is not a trend that IA is tolerating. --Fæ (talk) 07:09, 5 June 2026 (UTC)Reply

I have updated a process for finding the text of Google cover pages in the recent uploads. The 'pending' queue is at Category:OCR detected cover page. Yet to update the page removal process. There's no hurry and I will get to this slowly. --Fæ (talk) 06:29, 8 June 2026 (UTC)Reply

Note that larger files, like the 800mb PDF transcluded, are now possible thanks to limits changing. The larger files might cause things to break, sometimes in predictable ways like the SHA check from mediawiki taking some time to process and be available on the system. Behind the scenes, these uploads do not behave well and invariably fail to report back that they are successfully uploaded. Fæ (talk) 14:26, 17 June 2026 (UTC)Reply

Extra note about maps - though there is a fairly quick upload of jpegs of maps to Category:David_Rumsey_Historical_Map_Collection, which may take a couple of days, there is a much, much slower process to recover high resolution versions. A rough estimate is that this could take 2 or perhaps 4 months, partly due to the new WMF API throttling limits more than processing time. --Fæ (talk) 12:44, 20 June 2026 (UTC)Reply

@Fæ: Couldn't you technically request permission for the upscale to run on Faebot so it is exempt from the API rate limits? --Nintendofan885^{T&Cs apply} 13:30, 20 June 2026 (UTC)Reply

From what is available to easily read about it, probably not. Faebot has a bot flag, but a brief experiment shows that the API throttle is using IP address. Consequently the map uploads being mentioned here are being forced to sleep 90s before the upload is allowed to complete and then there's another 90s for the filepage to be updated with formatting or a category. These events are in a multiprocessing queue but it makes no difference for the API whether it's a bot or not, or whether Pywikibot can set a bot flag for the action.

Keep in mind, I'm trying not to spend hours at a keyboard, so not looking for complex extra volunteer work or phabricator requests for a fairly modest 50,000 files. Fæ (talk) 13:41, 20 June 2026 (UTC)Reply

Faebot has got running on the somewhat revised toolforge, so the mass link fixing has been moved to the WMF servers and that at least is 6x faster as a result and reduced local processing by about 7%. Other stuff might move there eventually. Fæ (talk) 14:50, 21 June 2026 (UTC)Reply

Fresh eye this morning, and the map upgrading is running around 4x faster locally and probably is obeying the new WMF API throttling rules. The run might be 6 weeks rather than 4 months, which is fine. Fæ (talk) 06:39, 22 June 2026 (UTC)Reply

Welcome back

Latest comment: 15 days ago7 comments7 people in discussion

As you probably have worked out, I've been following this talk page, checking whether DRs of your uploads are valid or not, and responding to the DR with a rationale if I think something should be kept. May I assume you are sufficiently "back" that I can let go of monitoring that?

Welcome back, in any case. - Jmabel ! talk 14:06, 3 June 2026 (UTC)Reply

If you wish, though I'll probably continue to ignore most DRs and let others chip in. After your first million images, the DRs that matter are ones that are a meaningful case study for thousands of others and could be collated with some automated method.

I have noticed your patient work and very much appreciate it! Fæ (talk) 15:21, 3 June 2026 (UTC)Reply

Yes, welcome back! So glad to see this. Krok6kola (talk) 16:06, 3 June 2026 (UTC)Reply
It's been almost 5 years. I hope you had a good time! Welcome back. - Alexis Jazz ^{ping plz} 10:03, 5 June 2026 (UTC)Reply

@Fæ Yes, happy to see you here again :) --PantheraLeo1359531 😺 (talk) 09:32, 8 June 2026 (UTC)Reply

Good to see you again, also from me. -- Deadstar (msg) 10:27, 9 June 2026 (UTC)Reply

Marvellous to see you editing again. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 15:15, 9 June 2026 (UTC)Reply

IA uploads

Latest comment: 7 days ago22 comments3 people in discussion

Hi, I see that you have restarted uploading PDF from Internet Archive. FYI, I created a bot which fixed the 21,000+ files you uploaded and were in Category:Book scans with Google Books cover sheets (to remove). Could you please add files with Google cover pages to this category, so that the cover pages can be removed?

I think that all PDF files without any meaningful category are not very useful. If you can't add category, you should at the very least add them to a "PDF files needing category". Thanks, Yann (talk) 19:53, 8 June 2026 (UTC)Reply

See Commons:IA_books#Automatic_detection_and_deletion_of_cover_pages and the section above on this talk page where I mention OCR. The use of pytesseract is slightly updated to be more accurate and has already detected many pdfs with secondary pages that were missed. The idea is to automate secondary detection when stripping each cover, as some have an English google cover followed by a German one. This is in testing, so a complete run and automated stripping are not ready yet due to false positives and it would be nice not to rely on manual double checking.

Automated categorization needs careful handling due to IA inconsistencies. Though a collection may have standard 'tags', this varies wildly. This is discussed at the IA book project page and it makes sense to revisit those same formulas to both encompass new releases at IA to those collections and think of including more. The IA uploads have several hidden categories, like 'Old books from American libraries' which acts as a default uncategorized category, and as anyone can use Petscan to list those with no visible categories, it seemed unnecessary. As you suggest it, we can add Category:Uncategorized images but it may lead to complaints about burdening others by appearing to leave hundreds of thousands in the queue, or duplicate categories being created because the categories available for old books are already complex.

None of this will be fast, weeks not days, time at the keyboard has to be in small doses these days. Fæ (talk) 08:22, 9 June 2026 (UTC)Reply

Hi, Thanks for your answer. Yes, cover pages detection is a bit tricky. I did have to make a lot of try-and-error tests. My script now detects cover pages properly for most cases, except some books in Portuguese, Scandinavian, and Slavic languages. In these cases, I put back the file in the category, and the second page is then removed. Hopefully this only concerns a few books (maybe around 1%). Please see Special:ListFiles/YannBot.

For the categories, there is Category:IA uploads needing categories. Regards, Yann (talk) 09:31, 9 June 2026 (UTC)Reply

BTW, I don't know if you have noticed, there is a nasty bug affecting PDF files for several months: phab:T420341. You may want to add your opinion there. Thanks, Yann (talk) 16:23, 9 June 2026 (UTC)Reply

The zero sized PDF was a long running problem. Were one to be debugging it, a start could pulling the list of all pdfs with zero size matches, or pageid==0. At the time the presumption was a fundamental problem with the way the WMF servers were using off-the-shelf pdf handling and even if WMF devs looked at it, the solution might be a dirty work around rather than a rewrite of anything. The numbers involved were so small in proportion that it would be okay to shove these in a housekeeping category for manual attempts at reuploading or recoding the source pdf. --Fæ (talk) 08:28, 10 June 2026 (UTC)Reply

This issue is never with the files themselves. They have changed the configuration, so now purging always displays them properly. I did a kind of off-the-shelf survey among the 20,000+ files I have processed. There is no discernible pattern, but it appears more often with big files than smaller ones. Yann (talk) 08:36, 10 June 2026 (UTC)Reply

Pleasingly, the Google cover page detection is going well behind the scenes. There's no rush, trying to stick to my keyboard time limits, so making sure this can run using the equivalent of a local ramdisk for processing and reuploading and preferably with one touch on the files. Seeing German secondary pages fairly rarely and want to ensure there are not other variations to detect. It might make sense to invert the teapot and prove there's no secondary cover pages and process those first anyway. --Fæ (talk) 09:10, 10 June 2026 (UTC)Reply

In related questions: perhaps you have useful insights for this topic Commons:Village_pump/Proposals#Leveraging_SD_for_book_categorizations_(PDFs,_déjavus,_categories,_). I wouldn't know where to start. -- Deadstar (msg) 10:36, 10 June 2026 (UTC)Reply

See Commons:Bots/Work_requests#Book_renaming and the section on categorization or diffusion projects at COM:IA books. The biggest impact would be collection (often specialist library) related categories, though author might be practical if the variations in name could be corrected for by a bio database or wikidata. Taxonomies are a rabbit hole worth avoiding, so trying to extract topics and using those for categories might be a mistake unless there are specific obvious needs with large results, like medical texts of the 19th century. This might be too big a project for automation right now, so avoiding giving thoughts in the discussion, but it is of interest longer term. --Fæ (talk) 11:50, 10 June 2026 (UTC)Reply

@Yann: Having run into the 0x0 bug for trimmed pdfs like this one (it looks okay now, but after overwriting this was returning a 0x0 size), it is correct that it can be fixed manually using the url based purge parameter. However when automated by using a pywikibot call, this returns the error "Fæ" does not have required user right "purge"; which seems weird when nothing stops this being done manually. Is this a right that can be added or readded? It could be bundled with something else as it does no appear in the list of groups. Meanwhile that job is paused out of caution. --Fæ (talk) 18:56, 10 June 2026 (UTC)Reply

OMG, this is a complex area. The 'purge' right is implicit, but the invocation within pywikibot seems the issue. Don't worry about it, but it shows the 0x0 thing is muddy. Fæ (talk) 19:08, 10 June 2026 (UTC)Reply

I do it with my bot with curl -s -d "action=purge" -d "titles=File:$filename" -d "format=json" "https://commons.wikimedia.org/w/api.php". Yann (talk) 19:16, 10 June 2026 (UTC)Reply

Just in case anyone else runs into this, in pywikibot speak it looks like:

purge_req = pywikibot.data.api.Request(
 site=site,
 parameters={
  'action': 'purge',
  'titles': file_page.title(),
  'forcelinkupdate': True
 }
)
purge_req.submit()

This fix seems to work and I don't have an insight into why large pdfs might be unaffected. it's annoying that the size verification loop has to be on the (volunteer) uploader side, rather than reliably being immediately flagged and repaired on the (WMF) server side. It does seem the error is rare and has not appeared on pdfs even over 800 pages. --Fæ (talk) 19:57, 10 June 2026 (UTC)Reply

Mathematics Journals

Hi, Do you have in your list of future uploads these journals: Annals of Mathematics, American Journal of Mathematics? If not, could you please add them? Thanks, Yann (talk) 07:39, 11 June 2026 (UTC)Reply

Maybe, it looks like some thought on copyright is needed. In the Annals there are multiple index only prints, which arguably could be uncopyrightable, so best to be cautious anyway, but the scans of content with papers are likely to have copyright with mathematicians either still alive or having died within the last 70 years. Only flicked through that set of digitized microfiche but not found a copyright statement and the uploader has not made a statement about copyright being expired or similar. Taking a literal interpretation might mean only taking volumes up to 1920? Perhaps a later date could be agreed on as uncontroversial in the absence of copyright statements. Nice to see some names I recognize from my time studying last century.

So long as mentioned here, it probably will not be forgotten, though the uploads might be after the current update uploads, retrospective renames and coverpage housekeeping. --Fæ (talk) 07:56, 11 June 2026 (UTC)Reply

Category:Annals of Mathematics, Category:American Journal of Mathematics, selected before 1931 as published in USA. Not sure what the best parent cat would be, so left that for you to think about. For some reason the links in the IA description have not been beautified, but as this is a small collection (300 odd matching files), parked that for now. Rusty, forgotten entirely how this worked until reading my own breadcrumbs.

Let me know if any are missing. For some reason the location test was getting flagged as non-US and not sure if the root cause could be bad data on the IA side, or a bug on my side. Fæ (talk) 08:17, 12 June 2026 (UTC)Reply

Thanks for uploading these. FYI: s:American Journal of Mathematics. Yes, I have found more: [1]. Also volumes after 1900 should be PD-US-expired rather than PD-old-100-expired. Idem for s:Annals of Mathematics: [2]. Yann (talk) 15:19, 14 June 2026 (UTC)Reply

There was a hard coding of ignoring everything after 1925, in a super precautionary way. Now set to 1930. 18th C. is so much quieter to handle. Fæ (talk) 19:41, 14 June 2026 (UTC)Reply

The Cambridge History of English Literature

Hi, Could you please upload all these? You can add them to Category:The Cambridge History of English Literature. We already have some books, but not a complete set in good quality. Thanks, Yann (talk) 15:27, 17 June 2026 (UTC)Reply

The advanced IA query is inaccurate as there is no collection defined. There are a couple of mismatches that can be moved to the old American books general category unless there is some obvious better one. Fæ (talk) 19:49, 17 June 2026 (UTC)Reply

Hi, I am not sure I understand. You mean you only upload files from US libraries? But there are books from different libraries there: California, Cornell, Princeton, etc. Hopefully there will be a complete good set among them. Do you intend to upload from other sources (Internet Archive, Toronto?). Thanks, Yann (talk) 20:16, 17 June 2026 (UTC)Reply

The CHEL is published by CUP though as it was in New York as well as London the US license holds, probably. There are other things that matched in the search given above but these are not specifically the CHEL. There was no limiting of sources as it was set to find any matches in any collection to (The Cambridge History of English Literature) date:[1900 TO 1930]. CHEL was published from 1907. If there are publications definitely missing, let me know and a second look might work out a different search approach. Fæ (talk) 05:28, 18 June 2026 (UTC)Reply

File:Hurricane Ivan, Natural Hazards DVIDS726877.jpg

Latest comment: 8 days ago5 comments3 people in discussion

Duplicate but with a different DVIDS ID. Not sure how we should best handle the merge. - Jmabel ! talk 00:46, 15 June 2026 (UTC)Reply

Category:Faebot identified duplicates

Category:Images from DoD uploaded by Fæ (duplicate)

Unfortunately it is a rabbit hole. As the SHA values do not match, there's no mediawiki way to find these 'almost but not quite' digitally identical copies. A lot of time was spent automating image hashing, but it is an expensive process and does not eliminate human intervention to decide the best way to merge.

Yes, the backlog really is over 20,000 files and 9 years old. However it remains a low priority issue compared to other enigmatic Commons puzzles.

WRT this example, they were released by the military one week apart according to the metadata. Picking the first one officially released on their system seems a fine choice considering neither is currently in use. Fæ (talk) 02:00, 15 June 2026 (UTC)Reply

Semi-related: do you know about Commons:International Standard Content Code? Might be relevant for some of the work you do. - Jmabel ! talk 19:05, 15 June 2026 (UTC)Reply

This is interesting, thank you. As it seems to work as a Hamming space, so gives "distance", this is presumably a type of image hash but seems to be used as a fingerprint, which may not be guaranteed with other methods. Looking at the database website though, it might be on an indefinite beta status, but worth a bit of research and reading for my education. Fæ (talk) 08:13, 16 June 2026 (UTC)Reply

Thank you

Latest comment: 7 days ago1 comment1 person in discussion

I just wanted to say, "Thank you," for all of the images you have made accessible. I count on them to complete family history books for my clients and each time I see "Fæ" pop up, I send a bit of gratitude your way. With full attribution to you, your time and effort show up in the analog world, too. ~2026-35350-40 (talk) 14:31, 17 June 2026 (UTC)Reply

File:Arthur and Fritz Kahn Collection 1889-1932 (20345633841).jpg

Latest comment: 7 days ago1 comment1 person in discussion

Welcome back!

Latest comment: 6 days ago4 comments2 people in discussion

Looks like you were gone for a bit. Are you working on any new large chunks of uploads? RAN (talk) 21:44, 17 June 2026 (UTC)Reply

Slightly updated the internetarchive PDF uploading, so these are being refreshed:

Category:Scans from University of Toronto

Category:Old books from American Libraries - this in particular needs contributors to surf the content and imagine categories which can better break up the 150,000 books.

Category:David Rumsey Historical Map Collection - new content but a few thousand jpegs of interesting maps. Some review will be needed as the source library curators may have presumed their collections are public domain by age when it can be more complicated.

Category:Images dezoomed by Fæ is being ever so slowly populated by overwriting the small versions from IA. This is probably why this was not done years ago. Hopefully the maps will not be moved before this dezooming task finishes in several days time.

Category:Books in the Prelinger Library

Category:Genealogy books from the Internet Archive modest number of IA books under this collection but some are eye wateringly large, one has been having an upload attempt of over an hour for a 1,781.2 MB PDF; it may continue to fail though several that succeeded are at the 1gb size.

Quietly restarted the slow job of trimming Google cover pages off pdfs which is harder and slower on local processing. Also been experimenting with nccommons, behind the scenes that's been the ghastly issue of getting timedtext to work on videos and battling anti-bot tools for a site that in theory wants you to take their publications.

There is plenty of gnomic work, so for the moment not looking for anything controversial or technically too challenging. Little projects that do not mean sitting at a keyboard for more than an hour at a time are best. Fæ (talk) 05:20, 18 June 2026 (UTC)Reply

Are we ever going to get something like "search inside this book" when you are at the pdf page, like we get at Google Books? The only time is see the ASCII text is when I do a general search for all of Commons and I see a snippet, or when the book is transcribed at Wikisource. --RAN (talk) 18:59, 18 June 2026 (UTC)Reply
The search can be used to find PDF contents matches and is a quick way to find prospective matches if doing copyright or cover page statement testing. For example ("This is a digital copy of a book that was preserved for generations" filetype:pdf intitle:IA) will find PDFs with google cover pages. However there's no way of using the mediawiki to tell you what pages the text is on, or do anything else really smart with it. Well no easy designed way; vaguely remember doing something smarter but can't remember how that worked.

A solution for the IA uploads alone would be to navigate from the commons IA upload, back to IA and then interrogate either the djvu.txt file there, or the xml version. If one needed to filter through, say, 100,000 files, that would be way more efficient than having to download each one and run it through another OCR. Fæ (talk) 19:13, 18 June 2026 (UTC)Reply

File:Acts of Assembly, passed in the island of Jamaica; from 1681, to 1737, inclusive. Fleuron T144016-27.png

Latest comment: 6 days ago1 comment1 person in discussion

File:Acts of Assembly, passed in the island of Jamaica; from 1681, to 1737, inclusive. Fleuron T144016-27.png (edit|talk|history|links|watch|logs)
Commons:Deletion requests/File:Acts of Assembly, passed in the island of Jamaica; from 1681, to 1737, inclusive. Fleuron T144016-27.png ~2026-35527-52 (talk) 00:10, 19 June 2026 (UTC)Reply

File:Amber Cousino won a rice cooker during the raffle at the Live, Laugh and Learn event at the Del Mar Beach Resort, Camp Pendleton, Calif., April 19, 2013 130419-M-LD192-226.jpg

Latest comment: 5 days ago1 comment1 person in discussion

File:Amber Cousino won a rice cooker during the raffle at the Live, Laugh and Learn event at the Del Mar Beach Resort, Camp Pendleton, Calif., April 19, 2013 130419-M-LD192-226.jpg (edit|talk|history|links|watch|logs)
Commons:Deletion requests/File:Amber Cousino won a rice cooker during the raffle at the Live, Laugh and Learn event at the Del Mar Beach Resort, Camp Pendleton, Calif., April 19, 2013 130419-M-LD192-226.jpg - Alexis Jazz ^{ping plz} 09:59, 19 June 2026 (UTC)Reply

File:The art of midwifery improv'd Fleuron T115049-2.png

Latest comment: 5 days ago1 comment1 person in discussion

File:Mapillary 20H15M52S000 Collier County us (gAS1T8MW4tmO7uqzG6kKvr, 1165545434717956) (cubanoboi) 2024-08-10.jpg

Latest comment: 3 days ago4 comments3 people in discussion

This seems to be a part of a mass upload of Mapillary images.

As these were being uploaded I'd been adding | other_fields = {{information field|Academic context|Geospatial street-level imagery from external site.}}, so these were not seen as random indiscriminaate uploads of 'random' locations and verges.

However the uploads seem to be faster than I can cope with manually. Any chance of seeing if something like Faebot is able to do the additions as the files are uploaded? Thanks. ShakespeareFan00 (talk) 20:00, 20 June 2026 (UTC)Reply

@DaxServer: Would this be a suitable minor addition to the curator, or is it left for others to tack it on if wanted? Fæ (talk) 20:24, 20 June 2026 (UTC)Reply

@Fæ @ShakespeareFan00

Done File:Mapillary (792440271635475, tulzukst7vufhdo1e4z60f) (b4sti4n) 2017-06-24 13H44M25S000.jpg Thanks! -- DaxServer (talk) 11:55, 21 June 2026 (UTC)Reply

for this upload- Thanks. The hope was that the addition could made to related uploads. These are in scope, but given some patrollers... I felt it was reasonable to indicate why they were in fact not just random verge-side images. ShakespeareFan00 (talk) 14:25, 21 June 2026 (UTC)Reply

File:100th Anniversary PNE Parade (4913329591).jpg

Latest comment: 3 days ago1 comment1 person in discussion

File:The story of the Malakand field force - an episode of frontier war (IA storyofmalakandf00chur).pdf

Latest comment: 16 hours ago1 comment1 person in discussion

File:The story of the Malakand field force - an episode of frontier war (IA storyofmalakandf00chur).pdf (edit|talk|history|links|watch|logs)
Commons:Deletion requests/File:The story of the Malakand field force - an episode of frontier war (IA storyofmalakandf00chur).pdf Nighfidelity (talk) 12:51, 24 June 2026 (UTC)Reply