`

segfault rip rsp error

 
阅读更多

1.

What are segfault rip/rsp numbers and how to use them:

http://stackoverflow.com/questions/1456899/what-are-segfault-rip-rsp-numbers-and-how-to-use-them

When my linux application crashes, it produces a line in the logs something like:

segfault at 0000000 rip 00003f32a823 rsp 000123ade323 error 4

What are those rip and rsp addresses? how do I use them to pinpoint the problem? do they correspond to something in the "objdump" or "readelf" outputs? are they useful if my program gets its symbols stripped out (to a separate file, which can be used using gdb)

debugging segmentation-fault
link|edit|flag edited Jan 12 '10 at 2:30
Bill the Lizard?
65.2k29128285 asked Sep 21 '09 at 21:11
johnnys




3 Answersactive oldest votes up vote
3
down vote Well the rip pointer tells you the instruction that caused the crash. You need to look it up in a map file.

In the map file you will have a list of functions and their starting address. When you load the application it is loaded to a base address. The rip pointer - the base address gives you the map file address. If you then search through the map file for a function that starts at an address slightly lower than your rip pointer and is followed, in the list, by a function with a higher address you have located the function that crashed.

From there you need to try and identify what went wrong in your code. Its not much fun but it, at least, gives you a starting point.

Edit: The "segfault at" bit is telling you, i'd wager, that you have dereferenced a NULL pointer. The rsp is the current stack pointer. Alas its probably not all that useful. With a memory dump you "may" be able to figure out more accurately where you'd got to in the function but it can be really hard to work out, exactly, where you are in an optimised build
link|edit|flag edited Sep 21 '09 at 21:35

answered Sep 21 '09 at 21:20
Goz
19.5k1245


This link was very usefull to me.
link|edit|flag answered Oct 21 '10 at 14:29
tsotso
82



up vote
0
down vote I got the error, too. When I saw:

probe.out[28503]: segfault at 0000000000000180 rip 00000000004450c0 rsp 00007fff4d508178 error 4
probe.out is an app which using libavformat (ffmpeg). I disassembled it.

objdump -d probe.out
The rip is where the instruction will run:

00000000004450c0 <ff_rtp_queued_packet_time>:
4450c0: 48 8b 97 80 01 00 00 mov 0x180(%rdi),%rdx
44d25d: e8 5e 7e ff ff callq 4450c0 <ff_rtp_queued_packet_time>
finally, I found the app crashed in the function ff_rtp_queued_packet_time

PS. sometimes the address doesn't exactly match, but it is almost there.
link|edit|flag answered Jan 24 at 7:18
qrtt1
649313


Your Answer

draft saved

log in or Name

Email
never shown
Home Page


Not the answer you're looking for? Browse other questions tagged debugging segmentation-fault or ask your own question. Hello World!
This is a collaboratively edited question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

about ? faq ?

tagged

debugging × 6655
segmentation-fault × 588

asked

1 year ago

viewed

1,906 times

latest activity

1 month ago

Related
Debug-compiled executable: Why not abort gracefully on invalid write to NULL?
How to interpret this debugging error
C Programming: seg faults, printf, and related quirks
segfault after return 0;
Debugging a clobbered static variable in C (gdb broken?)
Segfault when calling Gtkmm textBuffer->insert
What are some good methods or steps to debug a segmentation fault in Perl?
Waiting with a crash for a debugger?
How can I add debugging symbols to Audacious?
What could be wrong with this?
What does this stack trace possibly mean?
Having issues with counting the number of bytes in use by files in a folder. Getting a segfault error.
Determine the line of C code that causes a segmentation fault?
Educational example to show that sometimes printf as debugging may hide a bug
What is a segmentation fault?
electric-fence with pthread
How to debug an assembled program?
Methodology for fixing Segmentation faults in C++
Core dumped?----
backtrace by SIGSEGV
i know what causes the segfault, but why?
how is segmentation fault thrown
map with string is broken?[solved]
Segfault when using *this
Need suggestions with Seg fault debugging question feed

2.Bug172933 - gdm segfaults on boot

https://bugzilla.redhat.com/show_bug.cgi?id=172933

3.Segfault error 4:

from:http://nixcraft.com/linux-software/12412-segfault-error-4-a.html

Senior Member
Join Date: Jul 2006
Location: India, Delhi
OS: CentOS, RedHat, Fedora, Ubuntu
Scripting language: Bash Scripting
Posts: 193
Thanks: 3
Thanked 1 Time in 1 Post
Rep Power: 5
kasimani is on a distinguished road
<!-- / user info -->
<!-- message, attachments, sig --><!-- icon and title -->
Exclamation<!-- google_ad_section_start -->Segfault error 4<!-- google_ad_section_end -->

<!-- / icon and title --><!-- message -->
<!-- google_ad_section_start -->i am running pgcluster 1.9rc5 for some months, recently i am getting alerts in in message log for segfaults error 4...

What could be the problem and any solution for this..
Can anyone give me why this error occurs and what is it's meaning.


I am running Centos 5 on 64bit Blade servers and they are parted in 4 part using VMWARE
with disabled HT.

Here is the alerts that i am getting in message log

/var/log/messages.2:Oct 4 10:18:49 ibn-cluster3 kernel: postgres[13458]: segfault at 00002aaaae097004 rip 0000000000536e10 rsp 00007fff97608930 error 4
/var/log/messages.2:Oct 4 10:18:49 ibn-cluster3 kernel: postgres[13438]: segfault at 00002aaaae097004 rip 0000000000536e10 rsp 00007fff97608930 error 4
/var/log/messages.2:Oct 4 13:59:26 ibn-cluster3 kernel: postgres[25406]: segfault at 0000000000000094 rip 00000000005ad88e rsp 00007fff932bcad0 error 4
/var/log/messages.2:Oct 4 19:07:23 ibn-cluster3 kernel: postgres[5698]: segfault at 0000000000000094 rip 00000000005ad88e rsp 00007fff08347500 error 4
/var/log/messages.4:Sep 20 14:02:43 ibn-cluster3 kernel: postgres[31633]: segfault at 0000000000000094 rip 00000000005ad88e rsp 00007ffff4f76370 error 4
/var/log/messages.4:Sep 20 14:48:01 ibn-cluster3 kernel: postgres[32302]: segfault at 0000000000000094 rip 00000000005ad88e rsp 00007fffc9b4aca0 error 4
/var/log/messages.4:Sep 20 14:48:23 ibn-cluster3 kernel: postgres[32330]: segfault at 0000000000000094 rip 00000000005ad88e rsp 00007fffc9b4aca0 error 4
/var/log/messages.4:Sep 20 14:48:25 ibn-cluster3 kernel: postgres[32338]: segfault at 0000000000000094 rip 00000000005ad88e rsp 00007fffc9b4aca0 error 4
/var/log/messages.4:Sep 20 14:48:28 ibn-cluster3 kernel: postgres[32347]: segfault at 0000000000000094 rip 00000000005ad88e rsp 00007fffc9b4aca0 error 4
/var/log/messages.4:Sep 20 15:23:31 ibn-cluster3 kernel: postgres[1474]: segfault at 0000000000000094 rip 00000000005ad88e rsp 00007fff34cd0340 error 4
/var/log/messages.4:Sep 20 16:46:46 ibn-cluster3 kernel: postgres[2480]: segfault at 0000000000000094 rip 00000000005ad88e rsp 00007ffffc58a6f0 error 4
/var/log/messages.4:Sep 20 16:52:53 ibn-cluster3 kernel: postgres[2984]: segfault at 0000000000000094 rip 00000000005ad88e rsp 00007fff5e7e59e0 error 4
/var/log/messages.4:Sep 20 20:00:01 ibn-cluster3 kernel: postgres[6654]: segfault at 0000000000000094 rip 00000000005ad88e rsp 00007fffa24b2df0 error 4
/var/log/messages.4:Sep 20 20:00:03 ibn-cluster3 kernel: postgres[6662]: segfault at 0000000000000094 rip 00000000005ad88e rsp 00007fffa24b2df0 error 4


I some more details needed then pl. let me know

Regards<!-- google_ad_section_end -->
<!-- / message -->
<!-- controls -->Reply With Quote <!-- / controls -->
<!-- message, attachments, sig -->

<!-- post 17286 popup menu --><!-- / post 17286 popup menu --><!-- / close content container --><!-- / post #17286 --><!-- post #17289 --><!-- open content container -->
<!-- this is not the last post shown on the page -->
<!-- status icon and date -->Old 14th October 2008, 04:11 PM <!-- / status icon and date -->
<!-- user info -->
nixcraft's Avatar
Never say die
Join Date: Jan 2005
Location: BIOS
OS: RHEL
Scripting language: Bash, Perl, Python
Posts: 3,727
Thanks: 13
Thanked 531 Times in 382 Posts
Rep Power: 10
nixcraft has a reputation beyond reputenixcraft has a reputation beyond reputenixcraft has a reputation beyond reputenixcraft has a reputation beyond reputenixcraft has a reputation beyond reputenixcraft has a reputation beyond reputenixcraft has a reputation beyond reputenixcraft has a reputation beyond reputenixcraft has a reputation beyond reputenixcraft has a reputation beyond reputenixcraft has a reputation beyond repute
<!-- / user info -->
<!-- message, attachments, sig --><!-- icon and title -->
Default

<!-- / icon and title --><!-- message -->
<!-- google_ad_section_start -->May be suggestion posted below will help you out:
Why Does The Segmentation Fault Occur on Linux / UNIX Systems?<!-- google_ad_section_end -->
<!-- / message --><!-- sig -->
__________________
<!-- google_ad_section_start(weight=ignore) -->Vivek Gite
Do you run a Linux? Let's face it, you need help!
All [Solved] threads are closed by mods / admin to avoid spam issues.

<!-- google_ad_section_end -->
<!-- / sig -->
<!-- controls -->Reply With Quote <!-- / controls -->
<!-- message, attachments, sig -->
<!-- post 17289 popup menu --><!-- / post 17289 popup menu -->
<!-- / close content container --><!-- / post #17289 --><!-- post #17292 --><!-- open content container -->
<!-- this is not the last post shown on the page -->
<!-- status icon and date -->Old 14th October 2008, 08:54 PM <!-- / status icon and date -->
<!-- user info -->
Junior Member
Join Date: Aug 2008
OS: Debian
Posts: 11
Thanks: 0
Thanked 0 Times in 0 Posts
Rep Power: 0
websissy is on a distinguished road
<!-- / user info -->
<!-- message, attachments, sig --><!-- icon and title -->
Exclamation

<!-- / icon and title --><!-- message -->
<!-- google_ad_section_start -->In my experience Apache segfaults can also be caused by having one or more damaged run-time components in Apache or one of its dependent modules. I had this happen to me last weekend when fixes applied to a corrupted file system apparently ended up damaging some of apache's components.

Shortly after that happened, I noticed cascades of segment faults occurring in Apache on my system. In an effort to fix it, I used aptitude on my Debian system, to carefully build a detailed list of Apache2 and all components it depended on. Once I had a complete list, I then used Aptitude to remove all those apps (except one which the kernel depended on) and then I used "clean" to remove all traces of those apps except their config files from my system. Finally I reinstalled Apache2 and all components it depended on and the result was I managed to eliminate almost all the segfaults. Whereas before I was getting several of those errors at once (and hundreds or thousands in a 24 hour period), I'm now seeing only 5 or 6 in a 24 hour period.

The point is segfaults can also occur as a by-product of a damaged file system as well as because of a hardware problem or a poorly written program. try removing and reinstalling Apache and the components it depends on then be sure to delete the binary runtimes for all those apps too -- THAT's the trickiest part -- but try to save your config files (if possible). In my case this solution reduced my number of Apache segfaults from hundreds per hour to 4 - 6 per day.

Another thing to bear in mind is that Apache makes heavy use of memory and spawns dozens of dynamic tasks (e.g. php and its underlying applications, Python and its apps, Ruby and its apps, etc.) which then issue requests to other apps as well (e.g. mysql database request, perl requests and God KNOWS what else. In short, Apache and the tools it relies on very thoroughly exercise your system's memory. So if you had a bad stick of ram that was throwing random errors -- especially during periods of high system demand, Apache and the apps it calls might encounter that bad block of memory quite often. Have you tried running a hardware diagnostic on your system?<!-- google_ad_section_end -->
<!-- / message --><!-- edit note -->

Last edited by websissy; 14th October 2008 at 09:08 PM.
<!-- / edit note -->
<!-- controls -->Reply With Quote <!-- / controls -->
<!-- message, attachments, sig -->
<!-- post 17292 popup menu --><!-- / post 17292 popup menu -->
<!-- / close content container --><!-- / post #17292 --><!-- post #19822 --><!-- open content container -->
<!-- this is not the last post shown on the page -->
<!-- status icon and date -->Old 16th July 2009, 05:28 PM <!-- / status icon and date -->
<!-- user info -->
Junior Member
Join Date: Jun 2007
OS: Debian
Posts: 7
Thanks: 0
Thanked 0 Times in 0 Posts
Rep Power: 0
manishkochar is on a distinguished road
<!-- / user info -->
<!-- message, attachments, sig --><!-- icon and title -->
Default

<!-- / icon and title --><!-- message -->
<!-- google_ad_section_start -->I read the articles posted on the links, suggested by Vivek.
Good articles, but really they couldn't do much to diagnose and solve a problem specially if you are not the software developer.

Most people who use Linux systems, use open source software, but that does not mean they can understand all the programming that goes inside.

Developers write programs, package and distribute it to people all over the world. Developers don't even know most of the users, and most users wouldn't know how to debug a crash. If you believe in Peter's Principle, believe me a crash occurs when you are least expecting it. I suppose we need an HowTo, that bridges the developers and the users, for the purpose of eliminating such crashes and flaws from the software.

I hope somebody could write an article like:
Suppose you are using a software that randomly experiences a crash, open up /var/log/messages and grep for segfault. You will notice one or more lines like:

segfault at a594dec8 eip b7cc6283 esp ab78e658 error 4

I really wish I knew more about what to do next, and write the rest of the article. But I think the article should cover things like:

How and why all software developers, both open source and closed source should ship the symbol tables of the executables and libraries?

How should end-user use the symbol tables, and use them to analyse a crash, even when they do not have the original source code?

All segfaults are not necessarily caused by a flaw in the application software. How to make sure that a crash reported as segfault is definitely NOT caused by a fault in the software?


Most of the articles as seen on the web ask you to start an application under gdb, and then keep using it until a crash occurs. I am amazed, that so many people still believe that a software user really has nothing better to do. And moreover, if replicating the crash was so easy, wouldn't every developer be able to simply ship out stable releases, and cut out all that alpha, and beta crap. On numerous occasions I have even seen segfaults without any core dumps. And even other forms of crashes, like stack smashing etc. that doesn't even generate a core dump, and most users wouldn't even be able to record them, if they occurred in a background application or a daemon. Hopefully this article might cover points like capturing ALL outputs emitted by an application, due to systemic errors, like stack smashing, OOM, etc.

Maybe all the stuff that I am wishing for does exist somewhere, and I am too stupid to have not discovered them, so I thank in advance all those who might be kind enough to point me to the correct links.

Hopefully admins and the users of this forum will be able to get together and get such an article in place, under Vivek's stewardship.
<!-- google_ad_section_end -->
<!-- / message -->
<!-- controls -->Reply With Quote <!-- / controls -->
<!-- message, attachments, sig -->
<!-- post 19822 popup menu --><!-- / post 19822 popup menu -->
<!-- / close content container --><!-- / post #19822 --><!-- post #19833 --><!-- open content container -->
<!-- this is not the last post shown on the page -->
<!-- status icon and date -->Old 17th July 2009, 02:48 PM <!-- / status icon and date -->
<!-- user info -->
nixcraft's Avatar
Never say die
Join Date: Jan 2005
Location: BIOS
OS: RHEL
Scripting language: Bash, Perl, Python
Posts: 3,727
Thanks: 13
Thanked 531 Times in 382 Posts
Rep Power: 10
nixcraft has a reputation beyond reputenixcraft has a reputation beyond reputenixcraft has a reputation beyond reputenixcraft has a reputation beyond reputenixcraft has a reputation beyond reputenixcraft has a reputation beyond reputenixcraft has a reputation beyond reputenixcraft has a reputation beyond reputenixcraft has a reputation beyond reputenixcraft has a reputation beyond reputenixcraft has a reputation beyond repute
<!-- / user info -->
<!-- message, attachments, sig --><!-- icon and title -->
Default

<!-- / icon and title --><!-- message -->
<!-- google_ad_section_start -->Yes, segfault errors are royal pain in a$$. gdb is the best tool to debug these problem. There is another good alternative called DTrace which is dynamic tracing framework for troubleshooting kernel and application problems on production systems in real time. But, it only works on Solaris / FreeBSD / Mac OS x but not on Linux.

So as a sys admin you get to train yourself using gdb. There are good books out there that teaches gdb.

HTH<!-- google_ad_section_end -->
<!-- / message --><!-- sig -->
__________________
<!-- google_ad_section_start(weight=ignore) -->Vivek Gite
Do you run a Linux? Let's face it, you need help!
All [Solved] threads are closed by mods / admin to avoid spam issues.

<!-- google_ad_section_end -->
<!-- / sig -->
<!-- controls -->Reply With Quote <!-- / controls -->
<!-- message, attachments, sig -->
<!-- post 19833 popup menu --><!-- / post 19833 popup menu -->
<!-- / close content container --><!-- / post #19833 --><!-- post #19854 --><!-- open content container -->
<!-- this is not the last post shown on the page -->
<!-- status icon and date -->Old 17th July 2009, 08:46 PM <!-- / status icon and date -->
<!-- user info -->
Junior Member
Join Date: Jun 2007
OS: Debian
Posts: 7
Thanks: 0
Thanked 0 Times in 0 Posts
Rep Power: 0
manishkochar is on a distinguished road
<!-- / user info -->
<!-- message, attachments, sig --><!-- icon and title -->
Default

<!-- / icon and title --><!-- message -->
<!-- google_ad_section_start -->I have been on this subject for a while now, and did a bit of surfing around in search of nirvana.

Let me share with you what I discovered, and let's hope there's people on this forum who might be interested to add in further:

A super link that initiates people to the world of post-mortem analysis:
YouTube - Gilad Ben-Yossef on using ldd and nm

Another discussion thread at Getting stack traces on Unix systems, automatically - Stack Overflow
is worth a visit.

The second link requires a lot of recoding of any existing software, whereas the first link encourages analysis from whatever you already have.

A bit more of surfing around, and I found that after compiling any application, it is possible to export it's symbols, into a separate library.
Thus, even applications that are actually put into production after stripping, can be analysed with gdb, without necessarily requiring the source code.

For example, one could do:
Code:
gdb ./${EXECUTABLE_BINARY} --readnow <<- _EOF
maint print symbols ${SYMBOLS_FILE_FOR_THE_EXECUTABLE_BINARY}
quit
_EOF
wait
The above would produce or rather extract the symbols file for the gdb.

gdb allows invocation by specifying an executable, and a separate file that contains the symbols, in case you don't not have the source code, and are using a stripped executable. But I am not sure if maint print symbols is the accurate option, must be verified before used.

But I guess, if that works, and does not allow reverse engineering, then even developers of closed source software, could be encouraged to release the symbols file within released packages.

I discovered another good reference at Tuxology - a Linux embedded, kernel and training blog
It's by the same gentleman in the youtube link.

Dtrace, mtrace, strace, ptrace, etc. are good, but the only problem with them is they are good if you know the application is going to soon crash. Or if you know how to replicate the crash. All of them leave me miserably occupied on the console, waiting for hours for an application to crash. The scene is worse when the app's basically supposed to be run as a daemon, and we run it in the foreground just for witch-hunting. And if the stupid thing crashes just when you left for a quick cuppa, ..... !!!!

The second link I mentioned above, discusses possibilities of making your application capable capturing a lot of details, when it gets a sigsegv. And I suppose it's just that enough examples need to be collected, so that newbies can learn it too.

Btw. Does anybody know how to actually use objdump and nm?
Their documentation only discusses how to invoke it. Nothing much about how to interpret the output and use it to analyse a segfault with fine accuracy.

Cheers<!-- google_ad_section_end -->
<!-- / message -->
<!-- controls -->Reply With Quote <!-- / controls -->
<!-- message, attachments, sig -->
<!-- post 19854 popup menu --><!-- / post 19854 popup menu -->
<!-- / close content container --><!-- / post #19854 --><!-- post #19970 --><!-- open content container -->
<!-- status icon and date -->Old 29th July 2009, 03:50 PM <!-- / status icon and date -->
<!-- user info -->
Junior Member
Join Date: Jun 2007
OS: Debian
Posts: 7
Thanks: 0
Thanked 0 Times in 0 Posts
Rep Power: 0
manishkochar is on a distinguished road
<!-- / user info -->
<!-- message, attachments, sig --><!-- icon and title -->
Default

<!-- / icon and title --><!-- message -->
<!-- google_ad_section_start -->Ok I figured out a bit about the objdump!

It is possible to identify the location in source code, that causes problems like:
segfault at XXXXXX eip YYYYYY esp ZZZZZZ error 4
Typically such lines would be witnessed in /var/log/messages in the following format:

Code:
Jul 28 20:51:32 ubuntu804 kernel: [ 8146.280653] YOUR_APPLICATION[992]: segfault at 0000004c eip 08094952 esp a7acddc0 error 4
First generate an objdump of the application "YOUR_APPLICATION" with the following command:

Code:
objdump -DCl "/path/to/YOUR_APPLICATION" > APPLICATION_DEBUG
then simply locate the eip location that is YYYYYY in the APPLICATION_DEBUG.

In the Above example 08094952 represents YYYYYY
so I would typically do this:

Code:
grep -n -A 6 -B 6 "8094952" APPLICATION_DEBUG
Note instead of "08094952" I trimmed the leading "0" and and used "8094952"

The resulting output should give you a fair idea of where the problem lies in the code. grep -n would tell you the line number of the relevant information in the APPLICATION_DEBUG and you might even cat or less to view that entire file to look at things more holistically. -A 6 -B 6 simply show 6 lines before and after the matching position in the APPLICATION_DEBUG.

Though the information in /var/log/messages could be different like:
Code:
segfault at 00002aaaae097004 rip 0000000000536e10 rsp 00007fff97608930 error 4
I still haven't figured out that, will surely post when I do!

Happy Hunting, and if anybody else has notes to add, I guess this thread will be very useful to everybody, so please accept my thanks in advance.<!-- google_ad_section_end -->
<!-- / message -->
http://nixcraft.com/linux-software/12412-segfault-error-4-a.html
4.LINUX: segfault error 4 - This thread has been closed
from:
http://forums11.itrc.hp.com/service/forums/questionanswer.do?admit=109447626+1300826568908+28353475&threadId=1337779
<!-- Added by San --> <!--stopindex--> <!-- Added by San Need to Check. Required --> <!-- Added by San Need to Check. Required --> <!-- Added by San Need to Check. Required --> <!-- Added by San Need to Check. Required -->
Author<!--startindex--> Subject: LINUX: segfault error 4 Add to my favorites This thread has been closed
Jojo Castro <!--this will show the hats gif according to the user points-->
May 6, 2009 12:01:59 GMT <!-- edit question by phani kumar K--> <!-- end delete question by Phani Kumar K-->

Hi All,

I am currently checking the problem with our application developer. Apparently, theire application always stopped after every two hours running in the background. The process is being submitted in the background via fork. When I check the messages log, i found out this:

May 5 05:39:34 URM01 kernel: logging_app[17789]: segfault at 0000000000000000 rip 0000003bb7861dd1 rsp 0000007fbfffe460 error 4
May 6 07:14:43 URM01 kernel: logging_app[10965]: segfault at 0000000000000000 rip 0000003bb7861dd1 rsp 0000007fbfffe430 error 4
May 6 07:14:43 URM01 kernel: logging_app[10964]: segfault at 0000000000000000 rip 0000003bb7861dd1 rsp 0000007fbfffe430 error 4
May 6 10:00:46 URM01 kernel: logging_app[12754]: segfault at 0000000000000000 rip 0000003bb7861dd1 rsp 0000007fbfffe420 error 4
May 6 13:17:46 URM01 kernel: logging_app[16639]: segfault at 0000000000000000 rip 0000003bb7861dd1 rsp 0000007fbfffe430 error 4
May 6 13:18:10 URM01 kernel: logging_app[16638]: segfault at 0000000000000000 rip 0000003bb7861dd1 rsp 0000007fbfffe430 error 4
May 6 16:14:44 URM01 kernel: logging_app[18255]: segfault at 0000000000000000 rip 0000003bb7861dd1 rsp 0000007fbfffe430 error 4
May 6 16:14:53 URM01 kernel: logging_app[18256]: segfault at 0000000000000000 rip 0000003bb7861dd1 rsp 0000007fbfffe430 error 4
May 6 19:11:16 URM01 kernel: logging_app[10646]: segfault at 0000000000000000 rip 0000003bb7861dd1 rsp 0000007fbfffe430 error 4
May 6 19:11:23 URM01 kernel: logging_app[10648]: segfault at 0000000000000000 rip 0000003bb7861dd1 rsp 0000007fbfffe430 error 4

Basically, I already look at google and check the meaning of segfault 4 and somebody told that it has something to do with SELINUX.

Here are my queries:
1.) Can someone tell me what does the exact error means (segfault)
2.) Strange that application always stopped every two hours that corresponds to the segfault error time being logged in messages
3.) PLEASE PLEASE PLEASE give us resolution to this problem.

Thanks in advance!
Note: If you are the author of this question and wish to assign points to any of the answers, please login first.For more information on assigning points ,click here <!--startindex-->

<!-- render select & delete message buttons added by phanikumar --><!-- stop render select & delete message buttons added by phanikumar --> <!--stopindex-->
Sort Answers By: Date or Points
<!-- beging rendering checkbox --><!-- end rendering checkbox --><!--startindex-->Matti Kurkela <!--IMG ALIGN=MIDDLE ALT='expert in this area' SRC="/service/forums/images/expert_small.gif"-->Expert in this areaThis member has accumulated 7500 or more points
May 6, 2009 13:11:38 GMT <!--add code here-->8pts <!-- ADMIN/MODERATOR EDIT ANSWER --><!-- END ADMIN/MODERATOR EDIT ANSWER --><!--this should be a notEqual tag,change it once you have attachements in the database ashish june10,2003-->

1.) Segfault = segmentation fault = the application is trying to access a memory area that belongs to the OS or some other program. The memory management unit in the CPU stops the operation and triggers an exception. The standard segfault exception handler in the kernel kills the program.

As the message is "segfault at 0000000000000000", I'd guess the program probably tried to use an uninitialized pointer, which has a value NULL. It is very likely that your application has a fairly serious bug in it.

The "rip" value is the Instruction Pointer: the program location the CPU was running at the time of the error. It seems it is always exactly the same, so the error is repeatable - that is good.

The "rsp" is the Stack Pointer. Its value seems to vary just a little. If your developer is good, s/he will know whether this is important or not.

2.) Not strange at all. After receiving a segfault, the program cannot continue.

3.) Without having the program source code, this is impossible. Your application developer will have to fix it him/herself.

However, there are some things you can do:

If possible, have your application developer produce a version of the application that includes debug information. If the application is compiled using gcc, this is as simple as adding the "-g" option to the compilation commands.

Before starting the application, run "ulimit -c unlimited". This allows the segfault handler to produce a core dump file when the segfault handler is triggered. This file contains all the memory used by the application, so it might be very big.

Then your application developer needs to run a debugger program on the application and the core file. If the application was compiled with debug information, the debugger can identify exactly on what line of the source code the error happened. The developer can also use the debugger to examine the values of any variables at the time of the error. The debugger has many other features which might be useful too. If your developer does not know how to use a debugger, he/she should definitely learn it.

For Linux, the most common debugger program is named "gdb" and it is available in most Linux distributions. It is usually in the "development tools" category of the distribution's package collection.

MK <!--stopindex-->
<!-- beging rendering checkbox --><!-- end rendering checkbox --><!--startindex-->Jojo Castro
May 7, 2009 01:14:27 GMT <!--add code here--> N/A: Question Author <!-- ADMIN/MODERATOR EDIT ANSWER --><!-- END ADMIN/MODERATOR EDIT ANSWER --><!--this should be a notEqual tag,change it once you have attachements in the database ashish june10,2003-->

Hi MK,

Thanks for the information regarding segfault.
I have already fowarded your recommendation to our developer and they will try to look to the issue of "pointer" being a bug on the're application.

Currently, this are my ulimit values:

core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
pending signals (-i) 1024
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 137215
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited


Actually, we have already tried ulimit -c 1000 to be included on .bash_profile of the account that runs the program.

Action points:

1.) Run the debug mode on program
2.) Since there are two application running on the same time, we will try to run 1 application only at a time. These might the issue on "locking" you mentioned.
3.) How can we use gdb?

Another question though, is adjusting some kernel parameters on OS side will somehow help?

Thanks! <!--stopindex-->
<!-- beging rendering checkbox --><!-- end rendering checkbox --><!--startindex-->Jojo Castro
May 7, 2009 01:25:14 GMT <!--add code here--> N/A: Question Author <!-- ADMIN/MODERATOR EDIT ANSWER --><!-- END ADMIN/MODERATOR EDIT ANSWER --><!--this should be a notEqual tag,change it once you have attachements in the database ashish june10,2003-->

btw, here is also the finding from metalink as our dba also open a case for this...

KERNEL PARAMETER
--------------------------------
semmsl 250 / 250 / OK
semmns 32000 / 32000 / OK
semopm 100 / 32 / ----> TO LOW
semmni 128 / 128 / OK
--> kernel.sem = 250 32000 32 128

shmall 2097152 / 2097152 / OK
shmmax - / 33554432 / OK
shmmni 4096 / 4096 / OK
file-max 65536
ip_local_port_range 1024 - 65000 / 32768 - 61000 / OK
rmem_default 262144 / 135168 ----> TO LOW
rmem_max 262144 / 135168 ----> TO LOW
wmem_default 262144 / 135168 ----> TO LOW
wmem_max 262144 / 135168 ----> TO LOW



SEMOPM
-------------
The SEMOPM kernel parameter is used to control the number of semaphore operations that can be perfo
rmed per semop system call.

The semop system call (function) provides the ability to do operations for multiple semaphores with one semop system call. A se
maphore set can have the maximum number of SEMMSL semaphores per semaphore set a
nd is therefore recommended to set SEMOPM equal to SEMMSL.

Oracle recommends setting the SEMOPM to a value of no less than 100.


ACTION PLAN
===========

1. following kernel parameters on the system where Client 10.2.0.1 is used to run application
must be increased like following. After that, machine has to be rebootet - see

Oracle® Database Installation Guide
10g Release 2 (10.2) for Linux x86-64
Part Number B15667-03

http://download.oracle.com/docs/cd/B19306_01/install.102/b15667/pre_
install.htm#BABCHAED


--------------------------------
semopm 100
rmem_default 262144
rmem_max 262144
wmem_default 262144
wmem_max 262144
---------------------------------

2. Oracle does NOT support user-generated makefiles - only shipped ones contained in

$ORACLE_HOME/precomp/demo/proc
$ORACLE_HOME/precomp/lib/


Thanks. <!--stopindex-->
<!-- beging rendering checkbox --><!-- end rendering checkbox --><!--startindex-->Jojo Castro
May 7, 2009 05:38:16 GMT <!--add code here-->Thread closed by author <!-- ADMIN/MODERATOR EDIT ANSWER --><!-- END ADMIN/MODERATOR EDIT ANSWER --><!--this should be a notEqual tag,change it once you have attachements in the database ashish june10,2003-->

Hi MK,

Just to give you an update, we found out that the application is hugging so many files thus hitting 1024 number of open files limit.
Our application developer is now looking at the application part were looping is currently happening.

Thanks again for the info!

5.Linux遭遇Segmentation fault

from:http://hi.baidu.com/goggle1/blog/item/1ee73d2fe90d985c4fc2261c.html

Linux遭遇Segmentation fault
2010-03-16 16:02

Program terminated with signal 11, Segmentation fault.
程序运行了8个小时之后,出现了上面的提示,并说有core.dump文件产生;
找到coredump文件core.2747,
#gdb -c core.2747
#bt
看不到堆栈,看不到任何代码行的信息;开始以为是内存已被踩到大乱,导致!
在网上百度了“Program terminated with signal 11, Segmentation fault.”,找到了

How to find and fix faults in Linux applications

发现1. 事实上,并非如此;而是gdb使用错误,正确的使用是:
#gdb ./myprogram core.2747
#bt
现在堆栈信息出来了!

发现2. tail -f messages
Mar 16 13:59:52 localhost kernel: myprogram[2856]: segfault at 0000000000003a49 rip 000000000041f82c rsp 000000004be1bfb0 error 4
这次google“segfault rip rsp error 4”
找到第二篇好文:

《Posts tagged segfault》


了解了dmesg,可以找到一些信息;
了解了addr2line -e testseg 0000000000400470命令;

两篇文章太好,全文粘贴如下:
How to find and fix faults in Linux applications

Abstract:

Everybody claims that it is easy to find and fix bugs in programs written under Linux. Unfortunately it is very hard to find documents explaining how to do that. In this article you will learn how to find and fix faults without first learning how an application internally works.

_________________ _________________ _________________

Introduction

From a user perspective there is hardly any difference between closed and open source systems as long as everything runs without faults and as expected. The situation changes however when things do not work and sooner or later every computer user will come to the point where things do not work.

In a closed source system you have usually only two option:

  • Report the fault and pay for the fix
  • Re-install and pray that it works now
Under Linux you have these options too but you can also start and investigate the cause of the problem. One of the main obstacles is usually that you are not the author of the failing program and that you have really no clue how it works internally.

Despite those obstacles there are a few things you can do without reading all the code and without learning how the program works internally.

Logs

The most obvious and simplest thing you can do is to look at file in /var/log/... What you find in those files and what the names of those logs files are is configurable. /var/log/messages is usually the file you want to look at. Bigger applications may have their own log directories (/var/log/httpd/ /var/log/exim ...).
Most distributions use syslog as system logger and its behavior is controlled via the configuration file /etc/syslog.conf The syntax of this file is documented in "man syslog.conf".

Logging works such that the designer of an program can add a syslog line to his code. This is much like a printf except that it writes to the system log. In this statement you specify a priority and a facility to classify the message:
#include <syslog.h>

void openlog(const char *ident, int option, int facility);
void syslog(int priority, const char *format, ...);
void closelog(void);

facility classifies the type of application sending the message.
priority determines the importance of the message. Possible
values in order of importance are:

LOG_EMERG
LOG_ALERT
LOG_CRIT
LOG_ERR
LOG_WARNING
LOG_NOTICE
LOG_INFO
LOG_DEBUG
With this C-interface any application written in C can write to the system log. Other languages do have similar APIs. Even shell scripts can write to the log with the command:
logger -p err "this text goes to /var/log/messages"
A standard syslog configuration (file /etc/syslog.conf) should have among others a line that looks like this:
# Log anything (except mail) of level info or higher.
# Don't log private authentication messages.
*.info;mail.none;authpriv.none /var/log/messages
The "*.info" will log anything with priority level LOG_INFO or higher. To see more information in /var/log/messages you can change this to "*.debug" and restart syslog (/etc/init.d/syslog restart).

The procedure to "debug" an application would therefore be as follows.
1) run tail -f /var/log/messages and then start the application which
fails from a different shell. Maybe you get already some hints
of what is going wrong.

2) If step 1) is not enough then edit /etc/syslog.conf and
change *.info to *.debug. Run "/etc/init.d/syslog restart" and
repeat step 1).
The problem with this method is that it depends entirely on what the developer has done in his code. If he/she did not add syslog statements at key points then you may not see anything at all. In other words you can find only problems where the developer did already foresee that this could go wrong.

strace

An application running under Linux can execute 3 type of function:
  1. Functions somewhere in its own code
  2. Library functions
  3. System calls
Library functions are similar to the application's own functions except that they are provided in a different package. System calls are those functions where your program talks to the kernel. Programs need to talk to the kernel if they need to access you computer's hardware. That is: write to the screen, read a file from disk, read keyboard input, send a message over the network etc...

These system calls can be intercepted and you can therefore follow the communication between application and the kernel.

A common problem is that an application does not work as expected because it can't find a configuration file or does not have sufficient permissions to write to a directory. These problems can easily be detected with strace. The relevant system call in this case would be called "open".

You use strace like this:
strace your_application
Here is an example:
# strace /usr/sbin/uucico
execve("/usr/sbin/uucico", ["/usr/sbin/uucico", "-S", "uucpssh", "-X", "11"],
[/* 36 vars */]) = 0
uname({sys="Linux", node="brain", ...}) = 0
brk(0) = 0x8085e34
mmap2(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40014000
open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=70865, ...}) = 0
mmap2(NULL, 70865, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40015000
close(3) = 0
open("/lib/libnsl.so.1", O_RDONLY) = 3
read(3, "/177ELF/1/1/1/0/0/0/0/0/0/0/0/0/3/0/3/0/1/0/0/0/300;/0"..., 1024)
= 1024
fstat64(3, {st_mode=S_IFREG|0755, st_size=89509, ...}) = 0
mmap2(NULL, 84768, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x40027000
mprotect(0x40039000, 11040, PROT_NONE) = 0
mmap2(0x40039000, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x11)
= 0x40039000
mmap2(0x4003a000, 6944, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) =
0x4003a000
close(3) = 0
open("/lib/libc.so.6", O_RDONLY) = 3
read(3, "/177ELF/1/1/1/0/0/0/0/0/0/0/0/0/3/0/3/0/1/0/0/0`X/1/000"..., 1024)
= 1024
fstat64(3, {st_mode=S_IFREG|0755, st_size=1465426, ...}) = 0
mmap2(NULL, 1230884, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x4003c000
mprotect(0x40163000, 22564, PROT_NONE) = 0
mmap2(0x40163000, 12288, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED, 3, 0x126) = 0x40163000
mmap2(0x40166000, 10276, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x40166000
close(3) = 0
munmap(0x40015000, 70865) = 0
brk(0) = 0x8085e34
brk(0x8086e34) = 0x8086e34
brk(0) = 0x8086e34
brk(0x8087000) = 0x8087000
open("/usr/conf/uucp/config", O_RDONLY) = -1 ENOENT (No such file or directory)
rt_sigaction(SIGINT, NULL, {SIG_DFL}, 8) = 0
rt_sigaction(SIGINT, {0x806a700, [],
SA_RESTORER|SA_INTERRUPT, 0x40064d58}, NULL, 8) = 0
rt_sigaction(SIGHUP, NULL, {SIG_DFL}, 8) = 0
rt_sigaction(SIGHUP, {0x806a700, [],
SA_RESTORER|SA_INTERRUPT, 0x40064d58}, NULL, 8) = 0
rt_sigaction(SIGQUIT, NULL, {SIG_DFL}, 8) = 0
rt_sigaction(SIGQUIT, {0x806a700, [],
SA_RESTORER|SA_INTERRUPT, 0x40064d58}, NULL, 8) = 0
rt_sigaction(SIGTERM, NULL, {SIG_DFL}, 8) = 0
rt_sigaction(SIGTERM, {0x806a700, [],
SA_RESTORER|SA_INTERRUPT, 0x40064d58}, NULL, 8) = 0
rt_sigaction(SIGPIPE, NULL, {SIG_DFL}, 8) = 0
rt_sigaction(SIGPIPE, {0x806a700, [],
SA_RESTORER|SA_INTERRUPT, 0x40064d58}, NULL, 8) = 0
getpid() = 1605
getrlimit(RLIMIT_NOFILE, {rlim_cur=1024, rlim_max=1024}) = 0
close(3) = -1 EBADF (Bad file descriptor)
close(4) = -1 EBADF (Bad file descriptor)
close(5) = -1 EBADF (Bad file descriptor)
close(6) = -1 EBADF (Bad file descriptor)
close(7) = -1 EBADF (Bad file descriptor)
close(8) = -1 EBADF (Bad file descriptor)
close(9) = -1 EBADF (Bad file descriptor)
fcntl64(0, F_GETFD) = 0
fcntl64(1, F_GETFD) = 0
fcntl64(2, F_GETFD) = 0
uname({sys="Linux", node="brain", ...}) = 0
umask(0) = 022
socket(PF_UNIX, SOCK_STREAM, 0) = 3
connect(3, {sa_family=AF_UNIX,
path="/var/run/.nscd_socket"}, 110) = -1 ENOENT (No such file or directory)
close(3) = 0
open("/etc/nsswitch.conf", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=499, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40015000
read(3, "# /etc/nsswitch.conf:/n# $Header:"..., 4096) = 499
read(3, "", 4096) = 0
close(3) = 0
munmap(0x40015000, 4096) = 0
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=70865, ...}) = 0
mmap2(NULL, 70865, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40015000
close(3) = 0
open("/lib/libnss_compat.so.2", O_RDONLY) = 3
read(3, "/177ELF/1/1/1/0/0/0/0/0/0/0/0/0/3/0/3/0/1/0/0/0/300/25"..., 1024)
= 1024
fstat64(3, {st_mode=S_IFREG|0755, st_size=50250, ...}) = 0
mmap2(NULL, 46120, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x40169000
mprotect(0x40174000, 1064, PROT_NONE) = 0
mmap2(0x40174000, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED, 3, 0xa) = 0x40174000
close(3) = 0
munmap(0x40015000, 70865) = 0
uname({sys="Linux", node="brain", ...}) = 0
brk(0) = 0x8087000
brk(0x8088000) = 0x8088000
open("/etc/passwd", O_RDONLY) = 3
fcntl64(3, F_GETFD) = 0
fcntl64(3, F_SETFD, FD_CLOEXEC) = 0
fstat64(3, {st_mode=S_IFREG|0644, st_size=1864, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40015000
_llseek(3, 0, [0], SEEK_CUR) = 0
read(3, "root:x:0:0:root:/root:/bin/bash/n"..., 4096) = 1864
close(3) = 0
munmap(0x40015000, 4096) = 0
getuid32() = 10
geteuid32() = 10
chdir("/var/spool/uucp") = 0
open("/usr/conf/uucp/sys", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/var/log/uucp/Debug", O_WRONLY|O_APPEND|O_CREAT|O_NOCTTY, 0600) = 3
fcntl64(3, F_GETFD) = 0
fcntl64(3, F_SETFD, FD_CLOEXEC) = 0
fcntl64(3, F_GETFL) = 0x401 (flags O_WRONLY|O_APPEND)
fstat64(3, {st_mode=S_IFREG|0600, st_size=296, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40015000
_llseek(3, 0, [0], SEEK_CUR) = 0
open("/var/log/uucp/Log", O_WRONLY|O_APPEND|O_CREAT|O_NOCTTY, 0644) = 4
fcntl64(4, F_GETFD) = 0
fcntl64(4, F_SETFD, FD_CLOEXEC) = 0
fcntl64(4, F_GETFL) = 0x401 (flags O_WRONLY|O_APPEND)
What do we see here? Let's look e.g look at the following lines:
open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
The program tries to read /etc/ld.so.preload and fails then it carries on and reads /etc/ld.so.cache. Here it succeeds and gets file descriptor 3 allocated. Now the failure to read /etc/ld.so.preload may not be a problem at all because the program may just try to read this and use it if possible. In other words it is not necessarily a problem if the program fails to read a file. It all depends on the design of the program. Let's look at all the open calls in the printout from strace:
open("/usr/conf/uucp/config", O_RDONLY)= -1 ENOENT (No such file or directory)
open("/etc/nsswitch.conf", O_RDONLY) = 3
open("/etc/ld.so.cache", O_RDONLY) = 3
open("/lib/libnss_compat.so.2", O_RDONLY) = 3
open("/etc/passwd", O_RDONLY) = 3
open("/usr/conf/uucp/sys", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/var/log/uucp/Debug", O_WRONLY|O_APPEND|O_CREAT|O_NOCTTY, 0600) = 3
open("/var/log/uucp/Log", O_WRONLY|O_APPEND|O_CREAT|O_NOCTTY, 0644) = 4
open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
The program tries now to read /usr/conf/uucp/config. Oh! This is strange I have the config file in /etc/uucp/config ! ... and there is no line where the program attempts to open /etc/uucp/config. This is the fault. Obviously the program was configured at compile time for the wrong location of the configuration file.

As you see strace can be very useful. The problem is that it requires some experience with C-programming to really understand the full output of strace but normally you don't need to go that far.

gdb and core files

Sometimes it happens that a program just dies out of the blue with the message "Segmentation fault (core dumped)". This means that the program tries (due to a programming error) to write beyond the area of memory it has allocated. Especially in cases where the program writes just a few bytes to much it can be that only you see this problem and it happens only once in a while. This is because memory is allocated in chunks and sometimes there is accidently still room left for the extra bytes.

When this "Segmentation fault" happens a core file is left behind in the current working directory of the program (normally your home directory). This core file is just a dump of the memory at the time when the fault happened. Some shells provide facilities for controlling whether core files are written. Under bash, for example, the default behavior is not to write core files at all. In order to enable core files, you should use the command:
# ulimit -c unlimited

# ./lshref -i index.html,index.htm test.html
Segmentation fault (core dumped)
Exit 139
The core file can now be used with the gdb debugger to find out what was going wrong. Before you start gdb you can check that you are really looking at the right core file:
# file core.16897
core.16897: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style,
from 'lshref'
OK, lshref is the program that was crashing so let's load it into gdb. To invoke gdb for use with a core file, you must specify not only the core filename but also the name of the executable that goes along with that core file.
# gdb ./lshref core.23061 
GNU gdb Linux (5.2.1-4)
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
Core was generated by `./lshref -i index.html,index.htm test.html'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
#0 0x40095e9d in strcpy () from /lib/libc.so.6
(gdb)
Now we know that the program is crashing while it tries to do a strcpy. The problem is that there might be many places in the code where strcpy is used.

In general there will now be 2 possibilities to find out where exactly in the code it goes wrong.
  1. Recompile the code with debug information (gcc option -g)
  2. Do stack trace in gdb
The problem in our case is that strcpy is a library function and even if we would re-compile absolutely all code (including libc) it would still tell us that it fails at a given line in the C library.

What we need is a stack trace which will tell us which function was called before strcpy was executed. The command to do such a stack trace in gdb is called "backtrace". It does however not work with only the core file. You have to re-run the command in gdb (reproduce the fault):
gdb ./lshref core.23061
GNU gdb Linux (5.2.1-4)
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
Core was generated by `./lshref -i index.html,index.htm test.html'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
#0 0x40095e9d in strcpy () from /lib/libc.so.6
(gdb) backtrace
#0 0x40095e9d in strcpy () from /lib/libc.so.6
Cannot access memory at address 0xbfffeb38
(gdb) run ./lshref -i index.html,index.htm test.html
Starting program: /home/guido/lshref ./lshref -i index.html,index.htm test.html

Program received signal SIGSEGV, Segmentation fault.
0x40095e9d in strcpy () from /lib/libc.so.6
(gdb) backtrace
#0 0x40095e9d in strcpy () from /lib/libc.so.6
#1 0x08048d09 in string_to_list ()
#2 0x080494c8 in main ()
#3 0x400374ed in __libc_start_main () from /lib/libc.so.6
(gdb)
Now we can see that function main() called string_to_list() and from string_to_list strcpy() is called. We go to string_to_list() and look at the code:
char **string_to_list(char *string){
char *dat;
char *chptr;
char **array;
int i=0;

dat=(char *)malloc(strlen(string))+5000;
array=(char **)malloc(sizeof(char *)*51);
strcpy(dat,string);
This malloc line looks like a typo. Probably it should have been:
dat=(char *)malloc(strlen(string)+5000);

We change it, re-compile and ... hurra ... it works.

Let's look at a second example where the fault is not detected inside a library but in application code. In such a case the application can be compiled with the "gcc -g" flag and gdb will be able to show the exact line where the fault is detected.

Here is a simple example.
#include 
#include

int add(int *p,int a,int b)
{
*p=a+b;
return(*p);
}

int main(void)
{
int i;
int *p = 0; /* a null pointer */
printf("result is %d/n", add(p,2,3));
return(0);
}
We compile it:
gcc -Wall -g -o exmp exmp.c
Run it...
# ./exmp
Segmentation fault (core dumped)
Exit 139
gdb exmp core.5302
GNU gdb Linux (5.2.1-4)
Copyright 2002 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
Core was generated by `./exmp'.
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2

#0 0x08048334 in add (p=Cannot access memory at address 0xbfffe020
) at exmp.c:6
6 *p=a+b;
gdb tells us now that the fault was detected at line 6 and that pointer "p" pointed to memory which can not be accessed.

We look at the above code and it is of course a simple made-up example where p is a null pointer and you can not store any data in a null pointer. Easy to fix...

Conclusion



We have seen cases where you can really find the cause of a fault without knowing too much about the inner workings of a program.

I have on purpose excluded functional faults, e.g a button in a GUI is in the wrong position but it works. In those cases you will have to learn about the inner workings of the program. This will generally take much more time and there is no recipe on how to do that.

However the simple fault finding techniques shown here can still be be applied in many situations.

Happy troubleshooting!

原文地址 http://linuxfocus.berlios.de/English/July2004/article343.shtml


Posts tagged segfault


testseg[24850]: segfault at 0000000000000000 rip 0000000000400470 rsp 0000007fbffff8a0 error 6
这种信息一般都是由内存访问越界造成的,不管是用户态程序还是内核态程序访问越界都会出core, 并在系统日志里面输出一条这样的信息。这条信息的前面分别是访问越界的程序名,进程ID号,访问越界的地址以及当时进程堆栈地址等信息,比较有用的信息是最后的error number. 在上面的信息中,error number是4 ,下面详细介绍一下error number的信息:

在上面的例子中,error number是6, 转成二进制就是110, 即bit2=1, bit1=1, bit0=0, 按照上面的解释,我们可以得出这条信息是由于用户态程序读操作访问越界造成的。
error number是由三个字位组成的,从高到底分别为bit2 bit1和bit0,所以它的取值范围是0~7.

* bit2: 值为1表示是用户态程序内存访问越界,值为0表示是内核态程序内存访问越界
* bit1: 值为1表示是写操作导致内存访问越界,值为0表示是读操作导致内存访问越界
* bit0: 值为1表示没有足够的权限访问非法地址的内容,值为0表示访问的非法地址根本没有对应的页面,也就是无效地址

根据segfault信息调试定位程序bug:

#include<stdio.h>
int main()
{
int *p;
*p=12;
return 1;
}

1. 1. gcc testseg.c -o testseg -g,运行./testseg查看dmesg信息如下:
2. testseg[26063]: segfault at 0000000000000000 rip 0000000000400470 rsp 0000007fbffff8a0 error 6
3. 2. 运行addr2line -e testseg 0000000000400470,输出如下:
4. /home/xxx/xxx/c/testseg.c:5

6.Lighttpd php segfault at 0000000000000040 rip 0000003e30228278 rsp 0000007fbffff708 error 4

from:http://www.cyberciti.biz/tips/lighttpd-php-segfault-at-0000000000000040-rip-error.html

Lighttpd php segfault at 0000000000000040 rip 0000003e30228278 rsp 0000007fbffff708 error 4
by Vivek Gite on October 17, 2006 · 0 comments

I have recently noticed this error. Although server continues to work w/o problem at some point your server will crash. It is better to fix this error. The main problem was chrooted lighttpd installation. Few libraries were not copied. You need to use ldd command to locate name of libraries. In my case it was curl library used my DOMXML php module. Use following procedure to trace required libraries:

# mkdir /webroot/bin
# cp /bin/bash /webroot/bin
# cp /usr/bin/strace /webroot/bin
# l2chroot /usr/bin/strace
# l2chroot /bin/bash
# chroot /webroot
# strace php /path/to/script.php 2> /tmp/debug.txt
# exit
# vi /webroot/tmp/debug.txt


Now find out which shared libraries not found. Next you need to copy all missing libraries to /lib or /usr/lib location. You need to repeat above procedure till all shared libraries not copied to chroot jail.

Following is recommended solution if you run Apache or lighttpd in chroot jail.
Copy all shared libs from /lib and /usr/lib to /chroot directory. But don't copy any executable from /bin/ /usr/bin or /usr/sbin directory.

# cp -avr /lib/ /chroot/lib/
# cp -avr /usr/lib/ /chroot/usr/lib/

Above solution is quite secure and I have successfully implemented it for high performance Apache shared load balancing business hosting. More than 800+ sites are hosted using 6 Apache web server and 2 node MySQL cluster.

Don't forget to remove /chroot/bin directory and all files after troubleshooting.

分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics