Message boards :
News :
Increased file server capacity
Message board moderation
Author | Message |
---|---|
Send message Joined: 15 Jul 05 Posts: 248 Credit: 5,974,599 RAC: 0 |
Since Tuesday evening, we have had intermittent issues with upload failures due to a combination of a large number of new hosts running BOINC that co-incidentally joined at the same time as larger ATLAS tasks had been introduced. Our file server capacity has been increased and backlog tasks waiting for upload should upload again soon. (Please refer to the ATLAS application and Number crunching forums for more details.) |
Send message Joined: 15 Jul 05 Posts: 248 Credit: 5,974,599 RAC: 0 |
We will temporarily stop all our file servers to allow a maintenance operation. All uploads will fail for a while, with a different message. |
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,477,824 RAC: 30,457 |
As it seems, the situation with not-working downloads as well as uploads has become rather critical by now. Regardless of which LHC sub-project I'd like to run on my 3 PCs - nothing works at all, or only works after numerous retries over hours :-( It seems that some thorough clean-up should be made at CERN, once people are back after the holidays. |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
Once I excluded ATLAS, I have had no problems with uploads or downloads on CMS, LHCb or Theory. The only issue seems to be the usual failures on CMS from lack of work, but that is not much of a problem. I await notice of the new hardware upgrade, hopefully next month. |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
|
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
CMS works often enough (in terms of time) to keep me busy. And best wishes on your trip to Hobart if you go that route. |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
|
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,477,824 RAC: 30,457 |
we seem to have CMS jobs in the pipeline again.thanks for the good news, Ivan. However, when trying to download, it says "no tasks available for CMS simulation" :-( Although on the Project Status Page, the usual 200 "unsent" tasks are shown. |
Send message Joined: 18 Dec 15 Posts: 1814 Credit: 118,477,824 RAC: 30,457 |
when trying to download, it says "no tasks available for CMS simulation" :-(Finally, after almost endless retries, CMS tasks came in. I've experienced the same with all other sub-projects as well, not to talk about ATLAS, for which downloads and uploads almost don't work at all any more. Obviously, the whole I/O operation over there is at the verge of breakdown :-( |
Send message Joined: 23 Apr 10 Posts: 5 Credit: 1,349,240 RAC: 0 |
Hello still a LOT of upload problems :( Best Phil1966 |
Send message Joined: 29 Aug 05 Posts: 1060 Credit: 7,737,455 RAC: 1,317 |
|
Send message Joined: 27 Sep 08 Posts: 847 Credit: 691,678,194 RAC: 114,217 |
I still have 8 ATLAS and some SixTrack, I gave up running ATLAS so it's probally lower than it could be. |
Send message Joined: 16 Sep 17 Posts: 100 Credit: 1,618,469 RAC: 0 |
Yes, the problem persists. Additionally the project has been hard to reach in the last 24 hours. I keep cancelling short tasks in the hope of alleviating the file fragment issue, but my reliability/prio has already taken a huge hit from buffering work over the holidays. EDIT: I dropped ATLAS as well. |
Send message Joined: 15 Jun 08 Posts: 2534 Credit: 253,839,093 RAC: 37,502 |
SixTrack: Didn't get any for more than 24h now although my hosts periodically request them ATLAS: Stopped it as it's very hard to get a task and the upload/reporting also needs several attempts. CMS: RTS queue is still empty :-( LHCb: running but most WUs are very short; this needs to be explained but unfortunately no comment from the project team for months (guess they need an Ivan clone) Theory: running, nothing to complain <edit> Sorry, Theory also has something to complain: "207 (0x000000CF) EXIT_NO_SUB_TASKS" as well as some: Guest Log: [DEBUG] Testing connection to Condor server on port 9618 Guest Log: [DEBUG] nc: connect to vccondor01.cern.ch port 9618 (tcp) timed out: Operation now in progress Guest Log: [DEBUG] 1 Guest Log: [ERROR] Could not connect to Condor server on port 9618 Guest Log: [INFO] Shutting Down. </edit> |
Send message Joined: 28 Sep 04 Posts: 6 Credit: 15,285,651 RAC: 3,940 |
I have 5 transfers that won't upload. 2 of these have gone past deadline. All 5 of them start with LHC_2015_LHC_2015_260........ |
Send message Joined: 15 Nov 14 Posts: 602 Credit: 24,371,321 RAC: 0 |
My uploads and downloads have been fine on LHCb, CMS and Theory for several days. But I set up a separate machine (without VirtualBox) to handle native ATLAS and sixtrack, and all I get are "no new tasks available" (Ubuntu 16.04). |
Send message Joined: 28 Sep 04 Posts: 728 Credit: 49,036,265 RAC: 27,181 |
Yes there are a few sixtrack uploads that uploading more than a week now. They have gone past their deadline but the WUs are still visible on server and haven't reached their quorum. Only maybe half of the finsihed tasks can be uploaded in one go others need multiple retries. With Atlas the situation is about the same. None of the pending uploads are over their deadlines but considering how many you can crunch during one day there are more pending uploads. They usually upload in a couple of days. The downloads are also having troubles but currently none are pending, only one is waiting to be crunched. I have managed to download about 7 Atlas tasks per day for two hosts. At the moment I don't do any other subprojects for LHC. |
Send message Joined: 28 Sep 04 Posts: 6 Credit: 15,285,651 RAC: 3,940 |
All of mine are failing with this error LHC@home 1/5/2018 9:47:26 AM [error] Error reported by file upload server: [LHC_2015_LHC_2015_260_BOINC_errors__57__s__62.31_60.32__4.3_4.4__5__1.5_1_sixvf_boinc199598_1_r1950326673_0] locked by file_upload_handler PID=-1 |
Send message Joined: 29 Feb 16 Posts: 157 Credit: 2,659,975 RAC: 0 |
I experience similar issues with sixtrack tasks: 1/6/2018 10:04:08 AM | | Project communication failed: attempting access to reference site 1/6/2018 10:04:08 AM | LHC@home | Temporarily failed download of workspace1_hl13_collision_scan_62.3275_60.3100_chrom_15_oct_-300_B4__21__s__62.31_60.32__4_6__5__15_1_sixvf_boinc1784.zip: transient HTTP error 1/6/2018 10:04:08 AM | LHC@home | Backing off 00:04:49 on download of workspace1_hl13_collision_scan_62.3275_60.3100_chrom_15_oct_-300_B4__21__s__62.31_60.32__4_6__5__15_1_sixvf_boinc1784.zip 1/6/2018 10:04:11 AM | | Internet access OK - project servers may be temporarily down. 1/6/2018 10:06:17 AM | LHC@home | Temporarily failed download of workspace1_hl13_collision_scan_62.3275_60.3100_chrom_15_oct_-300_B4__21__s__62.31_60.32__0_2__5__75_1_sixvf_boinc1770.zip: transient HTTP error 1/6/2018 10:06:17 AM | LHC@home | Backing off 00:04:50 on download of workspace1_hl13_collision_scan_62.3275_60.3100_chrom_15_oct_-300_B4__21__s__62.31_60.32__0_2__5__75_1_sixvf_boinc1770.zip 1/6/2018 10:06:18 AM | | Project communication failed: attempting access to reference site 1/6/2018 10:06:20 AM | | Internet access OK - project servers may be temporarily down. As the problem is on the BOINC server side, I guess that everything will be fixed once our IT experts are back from holidays (please see https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4541&postid=33453) - I think that, as far as downloading new tasks is concerned, one thing that we can do as crunchers is to increase the amount of stored work. |
Send message Joined: 28 Sep 04 Posts: 728 Credit: 49,036,265 RAC: 27,181 |
Yep, there is not much we can do but to keep on crunching what we can. The problem with downloads seems to be intermittent. Eventual problem will be when number of stuck uploads reach the number of 2 x CPU cores, then Boinc will stop requesting new tasks for that project. |
©2024 CERN