Message boards : ATLAS application : Just more of the same failures
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,235,262
RAC: 3,182
Message 41775 - Posted: 29 Feb 2020, 18:29:31 UTC - in response to Message 41773.  

I doubt this is caused by LHC/ATLAS.
Hundreds of wingmen get it running even on computers with a similar configuration or much less RAM.
I guess ATLAS is just a victim of a local issue on that specific machine.

Could be caused by hardware like a corrupt RAM device, temperature or power supply issues (2 GPUS).
Could be caused by BIOS errors/settings.
Could be caused by a piece of software or a combination of different programs.

Much work to systematically test all of that.



Yes, I know wingman run it ok. That is what is frustrating.
However, the cleaning was much needed. Lots of leftovers and other problems identified and solved.
I put app_config.xml back in play. For some reason my system likes this. So if it works, leave it.
Memory at 6600 seems to solve the problem of stalling. I'll let ATLAS run alone for 24hrs and then add back CMS and Theory slowly.

I guess I need to do a deep clean at least every 2 weeks. I have been a bit lazy about this lately, being off work and all.

Thanks for all the help and I will come back to this thread again if something goes wrong for reference.
Keep your fingers and toes crossed that all this solves the problem.
I would love to get a success rate that exceeds the failure rate for once.
ID: 41775 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,235,262
RAC: 3,182
Message 41776 - Posted: 29 Feb 2020, 18:49:13 UTC - in response to Message 41774.  

29 Feb 2020, 9:42:08 UTC Fertig und Bestätigt 7,839.57 27,598.06 229.54 ATLAS Simulation v2.00 (vbox64_mt_mcore_atlas) windows_x86_64
29 Feb 2020, 6:21:23 UTC Fertig und Bestätigt 6,705.57 23,104.97 194.11 ATLAS Simulation v2.00 (vbox64_mt_mcore_atlas) windows_x86_64
29 Feb 2020, 2:48:15 UTC Fertig und Bestätigt 13,220.30 46,301.91 390.31 ATLAS Simulation v2.00 (vbox64_mt_mcore_atlas) windows_x86_64
This three tasks finished today successful with hitsfile, so you have something special after this time.
Can you see in Windowslog what going wrong after 9:42 UTC



First error at that time:
- Provider

[ Name] Microsoft-Windows-DistributedCOM
[ Guid] {1B562E86-B7AA-4131-BADC-B6F3A001407E}
[ EventSourceName] DCOM

- EventID 10016

[ Qualifiers] 0

Version 0

Level 3

Task 0

Opcode 0

Keywords 0x8080000000000000

- TimeCreated

[ SystemTime] 2020-01-13T08:42:23.019385200Z

EventRecordID 44033

- Correlation

[ ActivityID] {2a518cdf-ef20-4ff1-9fce-ec923f337462}

- Execution

[ ProcessID] 1224
[ ThreadID] 11608

Channel System

Computer DESKTOP-LFM92VN

- Security

[ UserID] S-1-5-19


- EventData

param1 application-specific
param2 Local
param3 Activation
param4 {6B3B8D23-FA8D-40B9-8DBD-B950333E2C52}
param5 {4839DDB7-58C2-48F5-8283-E1D1807D0D7D}
param6 NT AUTHORITY
param7 LOCAL SERVICE
param8 S-1-5-19
param9 LocalHost (Using LRPC)
param10 Unavailable
param11 Unavailable


But DistributedCOM has been blowing up a lot for a long time. It appears to start complaining after a Windows Update.
EventID 10016 is the common theme in all the errors starting back on the 12th.
This stuff shows as Warning.

There is a DCOM ReaderNotificationClient error a few times on the 12 and 13th. These shows at actual errors.


A random chosen warning shows this:
Log Name: System
Source: Microsoft-Windows-DistributedCOM
Date: 14/01/2020 12:50:52
Event ID: 10016
Task Category: None
Level: Warning
Keywords: Classic
User: LOCAL SERVICE
Computer: DESKTOP-LFM92VN
Description:
The application-specific permission settings do not grant Local Activation permission for the COM Server application with CLSID
{6B3B8D23-FA8D-40B9-8DBD-B950333E2C52}
and APPID
{4839DDB7-58C2-48F5-8283-E1D1807D0D7D}
to the user NT AUTHORITY\LOCAL SERVICE SID (S-1-5-19) from address LocalHost (Using LRPC) running in the application container Unavailable SID (Unavailable). This security permission can be modified using the Component Services administrative tool.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-Windows-DistributedCOM" Guid="{1B562E86-B7AA-4131-BADC-B6F3A001407E}" EventSourceName="DCOM" />
<EventID Qualifiers="0">10016</EventID>
<Version>0</Version>
<Level>3</Level>
<Task>0</Task>
<Opcode>0</Opcode>
<Keywords>0x8080000000000000</Keywords>
<TimeCreated SystemTime="2020-01-14T11:50:52.138834300Z" />
<EventRecordID>45508</EventRecordID>
<Correlation ActivityID="{a2588d91-dfe8-4366-a385-eea323282b05}" />
<Execution ProcessID="1224" ThreadID="15920" />
<Channel>System</Channel>
<Computer>DESKTOP-LFM92VN</Computer>
<Security UserID="S-1-5-19" />
</System>
<EventData>
<Data Name="param1">application-specific</Data>
<Data Name="param2">Local</Data>
<Data Name="param3">Activation</Data>
<Data Name="param4">{6B3B8D23-FA8D-40B9-8DBD-B950333E2C52}</Data>
<Data Name="param5">{4839DDB7-58C2-48F5-8283-E1D1807D0D7D}</Data>
<Data Name="param6">NT AUTHORITY</Data>
<Data Name="param7">LOCAL SERVICE</Data>
<Data Name="param8">S-1-5-19</Data>
<Data Name="param9">LocalHost (Using LRPC)</Data>
<Data Name="param10">Unavailable</Data>
<Data Name="param11">Unavailable</Data>
</EventData>
</Event>

This was the 19th

Today at 1800+
Log Name: System
Source: Microsoft-Windows-DistributedCOM
Date: 29/02/2020 18:08:23
Event ID: 10016
Task Category: None
Level: Warning
Keywords: Classic
User: DESKTOP-LFM92VN\Greg
Computer: DESKTOP-LFM92VN
Description:
The application-specific permission settings do not grant Local Activation permission for the COM Server application with CLSID
{2593F8B9-4EAF-457C-B68A-50F6B8EA6B54}
and APPID
{15C20B67-12E7-4BB6-92BB-7AFF07997402}
to the user DESKTOP-LFM92VN\Greg SID (S-1-5-21-630949258-3761359405-375428836-1001) from address LocalHost (Using LRPC) running in the application container Unavailable SID (Unavailable). This security permission can be modified using the Component Services administrative tool.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-Windows-DistributedCOM" Guid="{1B562E86-B7AA-4131-BADC-B6F3A001407E}" EventSourceName="DCOM" />
<EventID Qualifiers="0">10016</EventID>
<Version>0</Version>
<Level>3</Level>
<Task>0</Task>
<Opcode>0</Opcode>
<Keywords>0x8080000000000000</Keywords>
<TimeCreated SystemTime="2020-02-29T17:08:23.919648000Z" />
<EventRecordID>71748</EventRecordID>
<Correlation ActivityID="{8c4a383f-ac9a-47a2-aa6b-aa12d52a6bb2}" />
<Execution ProcessID="1144" ThreadID="1184" />
<Channel>System</Channel>
<Computer>DESKTOP-LFM92VN</Computer>
<Security UserID="S-1-5-21-630949258-3761359405-375428836-1001" />
</System>
<EventData>
<Data Name="param1">application-specific</Data>
<Data Name="param2">Local</Data>
<Data Name="param3">Activation</Data>
<Data Name="param4">{2593F8B9-4EAF-457C-B68A-50F6B8EA6B54}</Data>
<Data Name="param5">{15C20B67-12E7-4BB6-92BB-7AFF07997402}</Data>
<Data Name="param6">DESKTOP-LFM92VN</Data>
<Data Name="param7">Greg</Data>
<Data Name="param8">S-1-5-21-630949258-3761359405-375428836-1001</Data>
<Data Name="param9">LocalHost (Using LRPC)</Data>
<Data Name="param10">Unavailable</Data>
<Data Name="param11">Unavailable</Data>
</EventData>
</Event>

But all this is before I deep cleaned my system.
ID: 41776 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,235,262
RAC: 3,182
Message 41779 - Posted: 29 Feb 2020, 23:07:31 UTC
Last modified: 29 Feb 2020, 23:42:43 UTC

Seriously! 1 hour and 52 mins or so and it blows up with a memory error? Come on!
'lse, errorID=HostMemoryLow message="Unable to allocate and lock memory. The virtual machine will be paused. Please close applications to free up memory or close the VM

So..last attempt....
Used REVO Uninstall to get rid of 6.0.18 and all related registry keys.
Reinstalled the latest VM
Put back app_config.xml

Current task is advancing in early stages at .2500% per 2 seconds and using 30% CPU.

As for previous discussion about heat, well that's taken care of on the CPU side with a really good radiator cooling system that is in pull form.

This is a new case that is designed for gaming, 2 huge intake fans, plus 1 standard fan on the bottom blowing in cool air. 1 exhaust plus related open back for blowing out hot air.

Power supply..I have a digital power supply of 650 watts of which I use only 460 watts of power. Max 190 watts CPU Max 165 GPU and system takes 116 max. PSU temp only 53C.

To me this is really just a freaking software problem between ATLAS and VBOX.
I really have no idea what more to do.
Windows kicked up a message about security needing to know if it was ok to let VBOX headless do its thing. I said yes of course allowing it all the access it needs.

I really hope this solves the issues. If it does I am not messing with any more configuration things. If app_config.xml makes the system happy, so be it.
ID: 41779 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2108
Credit: 159,820,112
RAC: 106,484
Message 41783 - Posted: 1 Mar 2020, 7:16:15 UTC

If you have a ASUS Mainboard, there is a good tool AI Suite3 to check and tune your system.
Also is cpu-z ok to see if your memory have a problem.
It's better to run Theory first and if this is ok, to test CMS or Atlas.
Have the same hardware twice without no problems for Atlas.
ID: 41783 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,235,262
RAC: 3,182
Message 41784 - Posted: 1 Mar 2020, 8:01:02 UTC

Things are ok now. It's stable.
So latest Vbox and extensions then a app_config.xml with the followiing: task download 1, run 1, cpu 4, memory 6600.
I am not changing this.

Thanks for the support.

I am not using a ASUS board, I am using MSI.
I rarely use any of the extra stuff from MSI, I do not have any need for it.
But I will run the program to see if there are any other critical updates needed.
I have been running MSI for years, after a little tweaking they are not a problem.
It's not a MOBO issue though. Or memory, This has been just a software problem with BOINC and VBOX fighting. I had to find the right combo and settings to get them to stop fighting.
ID: 41784 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2418
Credit: 226,712,993
RAC: 130,464
Message 41786 - Posted: 1 Mar 2020, 8:52:51 UTC - in response to Message 41784.  

This has been just a software problem with BOINC and VBOX fighting.

As I already wrote:
BOINC, ATLAS and VBOX don't fight on hundreds of other computers, so why should they do only on your's?
It's more likely a homemade issue, if not caused by the hardware then maybe by too much registry tweaking or using the wrong tools for monitoring.

ATLAS tasks are starting fine. That's what the logs show.
Then, when they come to a point where they need more RAM, VBOX can't allocate it since it is locked by another process.
This does not affect all of your tasks as a couple of valids show but those valids also show that ATLAS is able to run fine if it gets all required resources.
That RAM locking process has to be identified to solve the problem.
ID: 41786 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2108
Credit: 159,820,112
RAC: 106,484
Message 41788 - Posted: 1 Mar 2020, 10:28:54 UTC - in response to Message 41784.  

Can it be, that the combination with TWO Nvidia-Graphs is a problem?
Do you run Nvidia Tasks under Boinc too?
ID: 41788 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,235,262
RAC: 3,182
Message 41789 - Posted: 1 Mar 2020, 11:17:40 UTC - in response to Message 41786.  
Last modified: 1 Mar 2020, 11:32:32 UTC

This has been just a software problem with BOINC and VBOX fighting.

As I already wrote:
BOINC, ATLAS and VBOX don't fight on hundreds of other computers, so why should they do only on your's?
It's more likely a homemade issue, if not caused by the hardware then maybe by too much registry tweaking or using the wrong tools for monitoring.

ATLAS tasks are starting fine. That's what the logs show.
Then, when they come to a point where they need more RAM, VBOX can't allocate it since it is locked by another process.
This does not affect all of your tasks as a couple of valids show but those valids also show that ATLAS is able to run fine if it gets all required resources.
That RAM locking process has to be identified to solve the problem.


Ok..but how does one sort out the RAM locking issue?
That is above my understanding.
And if things start needing more than 24 gigs, then I drop a project.
I have two that I am dedicated to because they come from where I used to live.
So Rosetta and Einstein I will keep, but stuff like Asteroids and perhaps Milkyway can go if needed.
Or I just get some larger sticks of RAM later if that solves the issue. I have two old sticks that I keep rolling over from build to build to save money. This would be the last upgrade I do on the system for awhile if needed.

I think that there was to much leftover crap from Windows and all the installs and uninstalls of BOINC and VBOX. There were over 700 issues that needed to be fixed by WISE. Now that the system has been deep cleaned it's working properly again. That's all I know. And you do have to deep clean your system every so often. So I will look at deep cleaning again in 4-6 weeks.

And as I said, for regular light maintenance cleaning, it has been told to exclude the BOINC folder.
ID: 41789 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,235,262
RAC: 3,182
Message 41790 - Posted: 1 Mar 2020, 11:28:07 UTC - in response to Message 41788.  

Can it be, that the combination with TWO Nvidia-Graphs is a problem?
Do you run Nvidia Tasks under Boinc too?


Why would that be a problem? I have been running side by side Nvidia GPU's for years. I had a 1050 and a 960 before I got the 1080. The 1080 has been running just fine for a long time now. And I personally wouldn't know how GPU's on their own cores would affect just one process that does not use GPU and not interfere with other CPU + VM jobs. Only ATLAS is coughing, not any of the other projects that use VM.

I think it is more like the other guy says, something to do with memory locking, but that is a topic I am not familiar with, so will need more information. That was an issue for a time and also low memory for some reason. But with the forced change to max 6600MB that dropped my RAM load by 25%. Now I max out in the high 70s to low 80% of my 24GB of RAM.

As far as memory usage goes, right now after ATLAS only 2 Rosetta tasks come close in usage.
One that maxes out in the 1200 range and another that is only in the 450-470 range.

And since I made the adjustments in app_config.xml and left it in place and got the right version of VBOX and BOINC, things are chugging along just nicely with 2 NVIDIA cards running.

I am not about to go messing around with things if they work now. It's hands off on things for ATLAS. It's happy, I'm happy, so just leave it be.
ID: 41790 · Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 15 Jun 08
Posts: 2418
Credit: 226,712,993
RAC: 130,464
Message 41791 - Posted: 1 Mar 2020, 11:29:14 UTC - in response to Message 41789.  

I have two old sticks that I keep rolling over from build to build to save money.

You have 24 GB RAM, right?
How many RAM sticks do you have and how do you populate your MB?
How many RAM slots are used and how many are free?
Do all your RAM sicks currently in use have the same size and the same specs?
Are all RAM sticks from the same manufacturer?
ID: 41791 · Report as offensive     Reply Quote
maeax

Send message
Joined: 2 May 07
Posts: 2108
Credit: 159,820,112
RAC: 106,484
Message 41792 - Posted: 1 Mar 2020, 13:25:07 UTC - in response to Message 41791.  

Yes,
this combination from Hardware and GPU's can be something special.
What's about, Nvidia-GPU's block the RAM for it's own work?
ID: 41792 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,235,262
RAC: 3,182
Message 41793 - Posted: 1 Mar 2020, 17:07:11 UTC - in response to Message 41792.  

Yes,
this combination from Hardware and GPU's can be something special.
What's about, Nvidia-GPU's block the RAM for it's own work?


Well they would only take what is needed to run the project. But none of my GPU projects consume as much RAM as ATLAS does. Rosetta was drawing more RAM than the GPU's were last time I looked.
Right now Moo Wrapper and Prime Grid are running, but they have a combined use of just 117MB.
They are more processor heavy than RAM heavy. Prime is a number search, so no real graphics to speak of. Moo is the same thing. Heavy equations but not graphics.
ID: 41793 · Report as offensive     Reply Quote
greg_be

Send message
Joined: 28 Dec 08
Posts: 318
Credit: 4,235,262
RAC: 3,182
Message 41794 - Posted: 1 Mar 2020, 17:22:52 UTC - in response to Message 41791.  

I have two old sticks that I keep rolling over from build to build to save money.

You have 24 GB RAM, right?
How many RAM sticks do you have and how do you populate your MB?
How many RAM slots are used and how many are free?
Do all your RAM sicks currently in use have the same size and the same specs?
Are all RAM sticks from the same manufacturer?


Rammon report. Every detail you would want to know
https://drive.google.com/open?id=1pWh86MdxvSPCKK0-xuFM9LnME9_64cfp
2 x Patriot 4096 each (old sticks) 1x Kingston 1x CORSAIR both 8192
All PC4 17000

Any other details just look at the report.
ID: 41794 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : ATLAS application : Just more of the same failures


©2024 CERN